More challenging benchmarks like Humanity's Last Exam and ARC reveal advanced AI capabilities compared to basic tests like Amy Math.