AI for Math workshop: emergency slides, hammers, LLMs, auto-formalisation, and synthetic data

ICML 2024 · Read our full coverage of this year's conference  

The ICML 2024 AI for Math(s) workshop (ICML page / workshop website) was auspiciously timed, starting the morning after Google DeepMind announced their IMO silver-medal-level systems, AlphaProof and AlphaGeometry 2 (Google DeepMind blog).

A packed crowd in a conference hall, facing a screen with an ICML logo on it

A packed crowd for the AI for Math workshop

There were many great talks and posters in this workshop. This post covers the talks up to the lunch break, and will be followed by posts describing the competitions and afternoon talks.

M. Pawan Kumar: Mathematical Discoveries from Program Search with Large Language Models

M. Pawan Kumar presented the FunSearch system, which has discovered novel results to difficult problems by searching in function space — through manipulating Python programs — as opposed to searching in solution space directly. An LLM is used to guide the evolutionary search process (generating candidate programs). Interestingly, using a better LLM didn’t tend to improve performance — possibly related to the candidate programs being short, and the prompts being simple.

M. Pawan Kumar presenting FunSearch

Kumar described the requirements for this approach to be problems where we can:

  1. verify the correctness of solutions, and
  2. measure how good a solution is.

Both the cap set problem and the online bin packing problem meet these criteria (the former in a global sense, the latter conditional on a given distribution or dataset), and FunSearch was able to improve on existing results for these problems.

For more on FunSearch, see the Google DeepMind blog post describing the system.

Swarat Chaudhuri: Language Model Agents for Formal Theorem-Proving

Swarat Chaudhuri’s talk introduced some recent trends in LLM agents for formal theorem-proving, and described his lab’s approach which uses reinforcement learning aided by generative models and neuro-symbolic reasoning.

Swarat Chaudhuri presenting a slide titled "Our angle: Sequential Decision-Making"

He gave an overview of his lab’s COPRA agent (arxiv), which combines GPT-4 generation with augmentation, backtracking, and a lemma database. He also gave a preview of an upcoming paper which makes use of cross-language learning (Coq/Lean/Isabelle) to increase dataset diversity (e.g. induction proofs are more common in Coq software verification proofs than in Lean).

Wenda Li: From Pattern Matching and Beyond in Autoformalisation

Wenda Li’s talk started with some examples of the increasing interest in formal proofs from within the mathematics community, citing examples such as the Liquid Tensor Experiment and the Polynomial Freiman-Ruzsa conjecture.

While formal proofs have many advantages, manual formalisation is very labour-intensive and formal proofs are often less legible than informal proofs. Auto-(in)formalisation — being able to move from informal proofs to formal proofs and vice versa in an automated way — would mitigate these downsides.

Wenda Li presenting a slide titled "Applying autoformalisation: guiding provers with informal proofs"

Towards the end of the talk, Li covered the challenges of auto-formalisation, including the difficulty of mathematical formulation. As an example of this, he showed the IMO 2024 P5 — one of the problems AlphaProof failed to solve — and the formalised version (78 lines of Isabelle code) that took him over two hours to formulate.

Kun Zhang: Causal Representation Learning: Uncovering the Hidden World

Kun Zhang’s talk gave an overview of causal representation learning techniques — learning a causal representation of a system, often in the form of a directed causal graph —using examples from archeology and psychology.

Kun Zhang presenting a slide titled "Linear, Gaussian Case: With Rank Deficiency Constraints"

Piotr Miłoś: Hammers and Transformers

Piotr Miłoś’s talk introduced hammers — tools that “integrate proof assistants with external automated-theorem-provers (ATPs) to aid in finding and applying premises and proof methods”. One such commonly-used tool in the Isabelle language is called Sledgehammer.

Piotr Milos presenting a slide titled "Thor - method"

Miłoś’s lab developed two new methods:

  • Thor — a hybrid neuro-symbolic method, wihch leverages both learned methods (LLMs) and symbolic ATPs.
  • Magnushammer — an alternative to hammers, using pre-trained transformers and contrastive learning.

Albert Q. Jiang: Synthetic Data in Mathematical Reasoning

Albert Q. Jiang’s talk started with a brief mention of two recent successes in automated mathematical reasoning using LLMs — Google DeepMind’s IMO silver medal and NuminaMath’s AIMO Progress Prize win.

However, he pointed out that LLMs are auto-regressive models — they generate new tokens based on past tokens — while many mathematical problems and datasets are structured in ways that are not well-suited to auto-regressive approaches.

For example, mathematicians generally write down and publish a final result which does not include their intermediate attempts or thinking process. After they build something, they tend to remove the scaffolding they used, making it hard to see how they arrived at their final result.

Jiang argued that synthetic data generation can be a key tool in mitigating these issues, and presented several techniques for generating both formal and informal data to this end.

Albert Q. Jiang presenting a slide titled "Emergency Slide 2: AlphaProof autoformalization"

Jiang had taken the time to insert a few 'emergency slides' in response to the AlphaProof/AlphaGeometry news

One example of these techniques is what he calls rewriting: imagine a dataset of question-solution pairs where some written-up solutions start with an answer, followed by an explanation or proof. Rewriting the solution such that the explanation comes first and is followed by the answer makes it easier for auto-regressive models to learn from the data. This way it’s clear that the answer follows from the explanation, and it might be easier for these models to provide an answer once they have had the luxury of using multiple tokens’ compute to generate the explanation.