Meta-Generation Algorithms for LLMs

NeurIPS 2024 · Read our full coverage  
Tutorial

So you’ve trained an LLM — now what? The meta-generation algorithms tutorial introduced various approaches for improving performance by scaling test-time compute.

It started with a brief introduction to strategies for generating strings of tokens from LLMs, through both the lenses of decoding as optimisation and optimisation as sampling.

It then introduced meta-generation as a method that:

  1. Takes advantage of external information during generation
  2. Calls the generator more than once to search for good sequences.

Meta-generation methods can be sequential (like chain-of-thought), parallel (like rejection sampling), search-based, and can incorporate external information like feedback from a verifier when generating code.

A speaker standing behind a lectern

Sean Welleck giving an overview of meta-generation algorithms

Building on the plethora of inference-time efficiency improvements like speculative decoding and quantisation, efficiency concerns for meta-generation strategies vary based on the specifics of the meta-generation strategy.

Efficient meta-generation strategies are those that are amenable to:

  • Parallelisation, allowing for batched sampling of trajectories.
  • Prefix sharing, allowing re-use of key-value caches across multiple model calls.

A speaker standing behind a lectern, next to a screen facing an audience.

Hailey Schoelkopf presenting efficient generation

The tutorial finished with a wide-ranging panel discussion with topics including OpenAI’s o1 model, the extent to which additional training-time compute can obviate the need for sophisticated inference-time methods, the tractability of meta-generation on domains lacking robust external verification, and the role of academic labs in running compute-constrained experiments.

The role of hardware was also discussed, with one panelist suggesting that more inference-targeted compute (vs hardware optimised for training) would be beneficial in making efficient and large-scale meta-generation easier.

Five panelists sitting behind a table, with a moderator to their left in front of a microphone.

Moderator: Ilia Kulikov. Panelists: Nouha Dziri, Beidi Chen, Rishabh Agarwal, Jakob Foerster, and Noam Brown.

The slides are available on the tutorial website, as is the TMLR Survey Paper on which the tutorial is based.