Math breakthroughs, AI systems & the value of feedback

Until recently, most buzz about “AI + math” fixated on high-school/Olympiad demos rather than research-level mathematics. That work isn’t trivial: going from LLMs parroting Wikipedia (and failing to add two simple numbers accurately) to Olympiad medal-level performance in two years is simply astounding. Milestones like this, and other benchmarks* matter—even if they’re imperfect and gameable—and yes, we can also acknowledge humans and AI didn’t compete on exactly equal footing at the Olympiad.

Now, over the past year, we've seen a number of flashes of math abilities. I'll pick (on) one: The claim that GPT-5 had solved multiple Erdos problems was walked back—the “breakthroughs” were previously known results surfaced by the model rather than genuinely new theorems.
   - Truly Novel? Most probably not!
   - Truly Useful? Absolutely Yes!

Research advances by connecting ideas to the literature; good work is often a careful synthesis, frequently borrowing from adjacent fields. Creativity is - to a large degree- combinatorial. If all LLMs did was reliably map concepts to prior art and to other fields, they could already serve as co-scientists—and paired with tools and feedback, they’re useful already. I see this in everyday use.

Now, against that noisy backdrop, let's talk about Terry Tao & Co's significant work from last week

What AlphaEvolve is (and isn’t)
AlphaEvolve is an agentic workflow that uses an LLM to write generators for candidate mathematical objects (functions, sets, packings, etc.), evaluates them with problem-specific verifiers, and iteratively evolves better candidates. It is effectively "guided exploration at scale" as characterized by the authors—and occasionally lands on results that match or improve the best known ones. DeepMind’s write-ups emphasize algorithm design more broadly alongside math.



Terry Tao’s group put the system through 60+ problems across analysis, combinatorics, and geometry. These are really hard PhD grade problems, mind you. Hard to quantify, but (since PhDs appear to be currency), I'd venture to say that less than 0.1% of PhDs will be able to solve even one problem. I'd guess that some of the best practicing mathematicians would be able to address many (but perhaps not all, since they are broad in scope) of these problems given a lot of time. 

The authors public comment is measured: many rediscoveries, some improvements, and lots of documented negatives—plus clear discussion of how fragile verification can be if you allow numerical slop. In other words, the work is serious—but not magic.

The headlines will be “AI finds new paths to unsolved math problems.” But there's a credible workflow: propose, score, mutate, and (sometimes) hand off to a proof assistant for formalization. That’s progress—within scope. 

AI as a system
Now, for some shameless self-promotion, which I will also deflate by saying that the following is more commonsense than profound. What Tao and co used is the architecture I’ve argued for in preprint: a hypothesizer + a knowledge synthesizer + tight coupling to tools + reinforcement signals that reward verified novelty, not just plausible text. Think slow, hypothesis generation coupled to fast, deterministic checking and then close the loop.

AlphaEvolve fits that template:

  • Hypothesis generation: LLM proposes code families (search spaces).

  • Knowledge synthesis: Prior literature, prompts, and patterns steer exploration.

  • Tool coupling: Scorers/verifiers (ideally exact or approximate)

  • Reinforcement/selection: Keeps only candidates that score.

  • Proof handoff: Pipe to proof systems (Lean/AlphaProof) for formal closure.

The big accomplishment is acceleration and bandwidth: the system makes tinkering at scale feasible, while leaving theorems and creativity to humans + proof assistants. This is the primary way in which AI will accelerate cognitive tasks in the near future.

AlphaEvolve is neither “AI just did mathematics humans couldn’t” nor “just searching”. It is a practical instantiation of AI-as-a-system: a hypothesizer wired to robust tools and selection signals. When the scoring is trustworthy and the proof feedback is good, it can accelerate discovery, document failures, and occasionally perhaps surface genuinely new structures... and oh! it is substantiated by evidence from the greatest active mathematician who says "We present AlphaEvolve as a powerful new tool for mathematical discovery, capable of exploring vast search spaces to solve complex optimization problems at scale, often with significantly reduced requirements on preparation and computation time."


-------
* Wherever you are on the spectrum of "AI passed the Turing test", the point is that the 75 year old Turing test is now considered largely irrelevant. There are reasonable points to be argued for and against why it is not a big deal, but the same will happen to other "AGI" benchmarks and goalposts will continue to move.

Comments