So you’ve trained an LLM — now what? The meta-generation algorithms tutorial introduced various approaches for improving performance by scaling test-time compute.
It started with a brief introduction to strategies for generating strings of tokens from LLMs, through both the lenses of decoding as optimisation and optimisation as sampling.
It then introduced meta-generation as a method that:
- Takes advantage of external information during generation
- Calls the generator more than once to search for good sequences.
Meta-generation methods can be sequential (like chain-of-thought), parallel (like rejection sampling), search-based, and can incorporate external information like feedback from a verifier when generating code.
Building on the plethora of inference-time efficiency improvements like speculative decoding and quantisation, efficiency concerns for meta-generation strategies vary based on the specifics of the meta-generation strategy.
Efficient meta-generation strategies are those that are amenable to:
- Parallelisation, allowing for batched sampling of trajectories.
- Prefix sharing, allowing re-use of key-value caches across multiple model calls.
The tutorial finished with a wide-ranging panel discussion with topics including OpenAI’s o1 model, the extent to which additional training-time compute can obviate the need for sophisticated inference-time methods, the tractability of meta-generation on domains lacking robust external verification, and the role of academic labs in running compute-constrained experiments.
The role of hardware was also discussed, with one panelist suggesting that more inference-targeted compute (vs hardware optimised for training) would be beneficial in making efficient and large-scale meta-generation easier.
The slides are available on the tutorial website, as is the TMLR Survey Paper on which the tutorial is based.