Why We Are Running Out of Mathematics-Based AI Reasoning Benchmarks

Benjamin Skuse

Recently, the idea that Claude – a popular general-purpose large language model (LLM) from Anthropic – is conscious was mooted by the company’s CEO. For its part, Claude modestly gave itself a 15 to 20 percent probability of being conscious. Meanwhile, other popular LLMs and related AI have successfully completed complicated tasks once thought solely the preserve of humans – including composing and recording chart-topping hits, penning award-winning novels and even creating top prize-worthy art.

Anthropic CEO Dario Amodei
Anthropic CEO Dario Amodei. Image credit: TechCrunch (CC-BY-2.0)

Do these achievements mean that AI has reached or exceeded human capabilities in terms of creativity and reasoning? And how would we know? For the latter, we can turn to mathematics, at least for now. 

Foundational Years

Mathematics is one of AI researchers’ favourite testing grounds because it requires precise step-by-step reasoning built on the firm foundations of logic, and it can be verified rigorously and automatically, avoiding subjective judgment or expensive tests.

Since 2017 and the advent of the neural network transformer architecture, which uses attention mechanisms to process sequential data, AI language model capabilities have increased at a staggering pace. To keep up, AI benchmarks have become more and more sophisticated.

One of the first mathematical AI benchmarks was Google DeepMind’s Mathematics Dataset. Introduced in 2019, it contained well over 100 million question & answer pairs that a bright schoolchild could answer, with questions limited to 160 characters in length, and answers to 30 characters. As the data were generated automatically, questions were varied but procedural, and therefore straightforward to solve for anyone who remembers a little bit from their school mathematics lessons. For example:

Question: Calculate \(-841880142.544 + 411127\).

Answer: \(-841469015.544\).

Question: Three letters picked without replacement from \(\mathrm{qqqkkklkqkkk}\). Give prob of sequence \(\mathrm{qql}\).

Answer: \(1/110\).

Though some state-of-the-art AI models scored roughly 35% as soon as the benchmark was released, many others including OpenAI’s GPT-3, released in 2020, struggled because they lacked chain-of-thought reasoning, failing to conduct multi-step algebraic manipulations and instead predicting an answer immediately.

Mathematics Dataset was finally consigned to the dustbin of history in 2023 with the release of GPT-4 and Google Gemini, whose advanced reasoning skills led to scores of 90%+; a stage in a benchmark’s lifecycle known as saturation or – more harshly – obsolescence.

High School Mathematics

Well before Mathematics Dataset had reached saturation, AI researchers were devising more challenging mathematics-based benchmarks. In 2021, GSM8K and the imaginatively titled MATH benchmarks were both released with the intention of testing AI’s mettle in multistep mathematical reasoning.

Created by human problem writers, OpenAI’s GSM8K contained 8500 mathematical problems that a bright middle-school student should be able to solve in just two to eight steps. Below is an example question and answer:

Question: Ali is a dean of a private school where he teaches one class. John is also a dean of a public school. John has two classes in his school. Each class has \(1/8\) the capacity of Ali’s class which has the capacity of \(120\) students. What is the combined capacity of both schools?

Answer: \(150\).

Just over two years after its release, GPT-4 scored 92% on GSM8K, effectively saturating the benchmark.

MATH fared somewhat better. This benchmark, devised by researchers from the University of California, Berkeley, contained 12,500 challenging problems from high-school mathematics competitions. MATH was shown to be difficult even for an average computer science PhD student, scoring around 40%. In comparison, upon its release, state-of-the-art models were scoring just 5%. Below is an example question and answer:

Question: What are all values of \(p\) such that for every \(q>0\), we have \(\frac{3(pq^2+p^2q+3q^2+3pq)}{p+q}>2p^2q\)? Express your answer in interval notation in decimal form.

Answer: \([0,3)\)

But by 2024, MATH was heading the way of GSM8K into oblivion, with frontier models achieving 90%+ scores, including 94.8% from OpenAI’s o1.

AI researchers needed a new focus, and some already had their sights set on the ultimate high-school mathematics competition – the International Mathematical Olympiad (IMO).

People onstage holding various countries' flags.
2015 IMO closing ceremony. Image credit: Z3144228 (CC-BY-4.0)

The Ultimate Mathematics Competition

Held annually, the IMO attracts participants from over 100 countries and is widely regarded as the most prestigious mathematical competition in the world. Though there are only six problems that need to be solved over the course of two 4.5 hour sessions, each IMO question is fiendishly difficult.

Unsurprisingly, IMO prize winners are disproportionately more likely than others to go on to produce important mathematical breakthroughs in their careers. For example, the first woman to win the Fields Medal, Maryam Mirzakhani, was an IMO gold medallist. And Terence Tao, one of the most recognised mathematicians on the planet and another Fields Medallist, received bronze, silver and gold IMO medals at the ages of 10, 11 and 12, respectively.

In 2024, a combined system comprising Google’s AlphaProof and AlphaGeometry 2 was the first AI to solve 4 of the 6 problems on the IMO. Though this is equivalent to a silver medal performance, the test was not conducted under competition rules.

Crucially, each of the questions had to be manually translated into formal mathematical language for the systems to make sense of them, a long and labour-intensive task. However, once this was done, AlphaGeometry 2 – an AI system combining a Gemini-like LLM with a formal engine based on the laws of geometry – solved one problem within 19 seconds. Meanwhile, AlphaProof – which couples an LLM with a reinforcement learning algorithm – solved two algebra problems and one number theory problem in a total of three days.

Just a year later, the IMO was no longer a challenging test for the most advanced AI models. Experimental systems from Google DeepMind and OpenAI achieved gold-level performance on the 2025 IMO, both answering 5 of the 6 questions correctly. Importantly, they did so within the competition time limits and in natural language, no longer needing manual translation of the questions into formal mathematics to work on the problems, and implementing large-scale tree search to allow the AIs to explore thousands of logical branches before committing to a line of reasoning.

Real Open Mathematical Problems

These landmark results leave mathematics-based AI benchmarking in a tricky situation. The technology is so advanced that coming up with challenging questions that have clean, simple answers is proving ever more difficult. In addition, the pace of development is so fast that the likelihood a benchmark will last more than a year before reaching saturation point is shrinking fast.

This is why new benchmarks coming through are raising the game significantly; in fact, asking AI to solve problems at the bleeding edge of human knowledge and beyond.

For example, non-profit research organisation Epoch AI has a benchmark called FrontierMath. This contains over 350 challenging problems, from undergraduate through to early postdoc level, that have known answers humans have derived. Up to now, the best performance has come from GPT-5.4 Pro (xhigh), with a score of 50%, but this is increasing all the time, and the team behind FrontierMath see it saturating in the next year or two.

This is why they devised FrontierMath: Open Problems. Open Problems contains 15 open questions from research mathematics that professional mathematicians have tried and failed to answer. Answers to these questions range from, at the very least, being moderately interesting for some human mathematicians to, at best, representing a major breakthrough. Since its release, only one of the moderately interesting problems has been solved by AI (first with GPT-5.4 Pro, and later Claude Opus 4.6 (max) and Gemini 3.1 Pro).

Martin Hairer onstage.
Martin Hairer at the 11th Heidelberg Laureate Forum, 2024, in Heidelberg, Germany. Image credit: Kreutzer/HLFF

Just a month after Open Problems’ release, a group of 11 highly distinguished mathematicians (including 2014 Fields Medallist Martin Hairer, and 2010 Nevanlinna Prize winner Daniel Spielman) proposed the First Proof challenge, a set of 10 mathematical questions equivalent to lemmas in terms of difficulty which arose naturally in the authors’ research processes, and whose proofs are roughly five pages or less and had not been shared with anyone.

Daniel Spielman sitting at a table speaking into a microphone.
Daniel Spielman at the 10th Heidelberg Laureate Forum, 2023, in Heidelberg, Germany. Image credit: Flemming/HLFF

The First Proof challenge was a preliminary effort to assess the capabilities of AI systems in solving research-level mathematics questions on their own. Experimental systems from OpenAI and Google DeepMind were the most successful, solving around half of the problems.

From these results, the researchers behind First Proof are now in the process of concocting a second batch of even more fiendish problems, which are being created, tested and graded between March and June 2026, and will form a formal benchmark.

Is It the End for Mathematical Benchmarking?

But will this benchmark, or indeed FrontierMath: Open Problems, last? If progress remains on its current trajectory, the answer is no. Recently, Google DeepMind’s experimental AI system Aletheia autonomously produced PhD level research results. Though obscure mathematically – calculating certain structure constants in arithmetic geometry called eigenweights – the result is new, moderately interesting and publishable.

Elsewhere, mathematicians have been applying AI company Harmonic’s reasoning agent Aristotle and other competitor offerings, as well as state-of-the-art LLMs to some of the Erdős problems. These are 1217 and counting problems (of which 692 remain open) of varying difficulty that, at some time in his career, prolific Hungarian mathematician Paul Erdős posed but did not solve. Several solutions to Erdős problems have been found and formally verified by AI in rapid succession recently. Again, these results are new and moderately interesting.

Paul Erdős teaching 10 year old Terence Tao in 1985.
Paul Erdős teaching 10 year old Terence Tao in 1985. Image credit: Billy or Grace Tao (CC-BY-2.0)

Given these announcements and the pace of development up to this point, looking forward, it doesn’t seem unrealistic to imagine that GPT-10 or Gemini 8 will be producing results that could be described as much more than ‘moderately interesting’, perhaps even significant breakthroughs.

If AI reaches this level of sophistication, humans will no longer be benchmarking AI with mathematics, but treating AI as an active participant in the mathematical process.

The post Why We Are Running Out of Mathematics-Based AI Reasoning Benchmarks originally appeared on the HLFF SciLogs blog.