Math-AI: Noam Brown & Panel

NeurIPS 2024 · Read our full coverage

OpenAI’s Noam Brown began by stating that progress in AI from 2019 to today is mostly down to scaling data and compute; specifically training compute. While this can keep working, it’s expensive.

He introduced the solution — test-time compute — by first talking about his experience building poker bots almost a decade ago.

After Claudico, his CMU team’s initial attempt at a superhuman poker bot, lost against human poker players, he had a realisation: human players had less pre-training data (fewer hands played), but would spend more time thinking before they made a game decision. The bots, on the other hand, acted almost instantly.

Building in extra “thinking time” (inference-time search) led to success at 2-player poker with Libratus in 2017 and 6-player poker with Pluribus in 2019. In a way, this method continued a decades-long tradition of inference-time compute in game bots: TD-Gammon for backgammon, Deep Blue for chess, and AlphaGo for Go.

By having the model think for 20 seconds in the middle of a hand, you get roughly the same improvement as scaling up the model size and training by 100,000x.

Noam Brown on poker bots

Over the first three years of his PhD, Brown had managed to scale up the model size and pre-training for his poker bot by 100x. Adding more test-time compute was the key to a system that could beat human players within a reasonable overall budget.

Pluribus cost under $150 to train, and could run on 28 CPU cores. This was significantly cheaper than it would have been to build a system with minimal test-time compute purely scaled through training.

A speaker addresses an audience from behind a lectern. — Noam Brown

Since then, there has been other research establishing similar trade-offs — “scaling laws” — between train-time and test-time compute in other games.

OpenAI’s o1 model applies this technique to reasoning in natural language, using reinforcement learning to generate a chain of thought before answering. Brown described o1 as “a very effective way to scale the amount of compute used at test time and get better performance”.

He noted that while the o1 approach is good, it might not be the best way to do this, and recommended that other researchers seriously consider the benefits of additional test-time compute in natural language reasoning. For researchers wanting to get an intuition for the process behind o1, he pointed to the corresponding release blog post — especially the chain of thought examples published there.

This approach works well for some problems, but not as well for others, and Brown introduced the concept of a “generator-verifier gap” to categorise promising problem domains.

Picture of a slide titled "The generator-verifier gap". The slide states: For some problems, verifying a good solution is easier than generating one.
Examples where verfication is easier than generation: Many puzzles, Math, Programming.
Examples where verification isn't much easier: Information retrieval, Image recognition.
When a generator-verifier gap exists and we have a good verifier, we can spend more compute on inference to achieve better performance. — The generator-verifier gap / Noam Brown's slides

Brown clearly stated that there is still room to push inference compute much further, and that “AI can and will be more than chatbots”. Specifically, he expects that AI with additional inference compute will “enable new scientific advances by surpassing humans in some domains while still being much worse in others”.

Brown’s talk was followed by a highly-anticipated panel session. There was a line outside the room with people waiting to get in throughout.

The panel included mathematics professor Junehyuk Jung as well as invited speakers Dawn Song, Noam Brown, and Jeremy Avigad, and was moderated by Swaroop Mishra from Google DeepMind.

A packed audience facing five speakers sat behind a table on stage. — Panel session in the neurips 2024 math-ai workshop

A few highlights from the panel discussion:

Progress has been extremely fast: in 2022, no one would have believed that the MATH benchmark would have been saturated by 2024.
Advice for anyone designing new benchmarks: have a very wide gradient of difficulty. Include some low-end examples (what models can do today) and some high-end examples (things we’re not even sure are possible).
On rating current AI mathematics approaches from 0 to 10: “with a little bit of human intervention we are already at 7 or 8 out of 10. With zero human intervention… less than one”.
OpenAI’s research is focused on informal (rather than formal) mathematics: they expect progress in this area to be more generally useful.
On autoformalisation: difficult and (in Lean) limited to the scope of mathlib.
There were general expectations that mathematical practice will change in years to come.
Even with AI systems that are better at theorem-proving than the best humans, humans would still have an important role to play. This has already been seen in chess and Go, where humans can be effectively paired with AI systems.

Summaries of the workshop’s afternoon sessions will be posted on this blog over the next few days.