
Sign up to save your podcasts
Or


The Thinking Machines Lab publication addresses the challenge of achieving reproducibility in large language model (LLM) inference, noting that even with "greedy sampling" (temperature set to 0), results are often nondeterministic. It dismisses the common "concurrency + floating point" hypothesis as incomplete, instead identifying floating-point non-associativity as the fundamental cause of numerical differences, where the order of operations can significantly alter results. However, the core issue for LLM inference nondeterminism from a user's perspective is ultimately traced to the lack of batch invariance in kernels, meaning an individual request's output can vary based on the batch size due to the server's fluctuating load. The article then details strategies to achieve batch invariance in key LLM operations like RMSNorm, matrix multiplication, and attention, ultimately demonstrating how to obtain truly deterministic LLM inference and highlighting its benefits, such as enabling "true on-policy reinforcement learning."
By StevenThe Thinking Machines Lab publication addresses the challenge of achieving reproducibility in large language model (LLM) inference, noting that even with "greedy sampling" (temperature set to 0), results are often nondeterministic. It dismisses the common "concurrency + floating point" hypothesis as incomplete, instead identifying floating-point non-associativity as the fundamental cause of numerical differences, where the order of operations can significantly alter results. However, the core issue for LLM inference nondeterminism from a user's perspective is ultimately traced to the lack of batch invariance in kernels, meaning an individual request's output can vary based on the batch size due to the server's fluctuating load. The article then details strategies to achieve batch invariance in key LLM operations like RMSNorm, matrix multiplication, and attention, ultimately demonstrating how to obtain truly deterministic LLM inference and highlighting its benefits, such as enabling "true on-policy reinforcement learning."