Models scoring near 90% on standard benchmarks may simply memorize answers rather than demonstrate true problem-solving abilities.