Artificial intelligence (AI) is advancing at a rapid pace, but measuring its true abilities is becoming harder. Existing benchmarks, like those testing general knowledge or reasoning, are no longer challenging enough.
To address this, the Center for AI Safety (CAIS) and Scale AI introduced "Humanity's Last Exam" (HLE), a groundbreaking test designed to push AI models to their limits.
Why Humanity's Last Exam?
AI models are increasingly achieving near-perfect scores on traditional tests. Dan Hendrycks, CAIS co-founder, explains, "We wanted problems that test models at the frontier of human knowledge and reasoning."
To create HLE, nearly 1,000 researchers and professors from over 500 institutions across 50 countries contributed their toughest questions. From over 70,000 submissions, 3,000 highly complex, expert-level questions were finalized. These questions span diverse fields like mathematics, natural sciences, and humanities.
Testing Advanced AI Models
HLE was used to test leading AI systems, including OpenAI’s GPT-4o, Google’s Gemini 1.5 Pro, and Anthropic’s Claude Sonnet-3.5. These models excel on standard benchmarks but struggled with HLE, scoring below 10%.
This result highlights a key limitation: while these models can process vast amounts of information, they still fall short in expert-level reasoning and specialized knowledge.
AI can’t pass Humanity’s Last Exam, yet…
Example Question
Here’s a sample question from the HLE in the field of ecology:
“Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.”
This level of complexity requires deep understanding and precise reasoning, areas where current AI systems still face challenges.
What This Means for AI Development
HLE demonstrates that while AI has made remarkable progress, it is far from mastering human-like reasoning. Current models can analyze and predict patterns but struggle with nuanced, high-level problems.
However, AI has a history of rapid improvement. When the MATH benchmark was launched in 2021, top models scored less than 10%. Today, they score over 90%. Hendrycks believes similar progress could occur with HLE.
The Role of New Benchmarks
Advanced benchmarks like HLE are essential to track AI's progress. They encourage researchers to address weaknesses and focus on developing models that can handle real-world challenges.
More importantly, HLE reminds us that AI should be evaluated not only for its accuracy but also for its reasoning and problem-solving skills. As AI systems become more integrated into daily life, these qualities will be critical for ensuring reliability and safety.
The introduction of HLE signals a shift in how we measure AI capabilities. As AI evolves, tests must also evolve. Benchmarks like HLE will help ensure that AI continues to develop in ways that benefit society, fostering systems that can think critically, reason effectively, and support human decision-making.
Humanity’s Last Exam sets a high bar for AI models. Current systems may have underperformed, but this is only the beginning. With continuous research and innovation, AI can rise to the challenge.
Benchmarks like HLE are essential to guide this journey, ensuring AI becomes a reliable tool for solving humanity’s most complex problems.