Personal Podcast

Defeating Nondeterminism in LLM Inference


Listen Later

The text from the Thinking Machines Lab explores the challenge of achieving reproducible results when using large language models (LLMs), a problem termed nondeterminism in LLM inference. While a common explanation points to floating-point non-associativity combined with concurrent execution on GPUs, the authors argue that this "concurrency + floating point" hypothesis is incomplete, as core LLM operations are often run-to-run deterministic. The true culprit behind the lack of reproducible results from a user's perspective is the lack of "batch invariance" in crucial kernels like RMSNorm, matrix multiplication, and attention, meaning an individual request's output depends on the variable, non-deterministic batch size imposed by concurrent user load. To defeat this nondeterminism, the article proposes and demonstrates the implementation of batch-invariant kernels and shows that this approach allows for truly on-policy reinforcement learning by ensuring numerical consistency between sampling and training.


==============



Code content percentage: 4.661%

Total text length: 37718 characters

🔗 Original article: https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/

📋 Monday item: https://omril321.monday.com/boards/3549832241/pulses/10045252264

...more
View all episodesView all episodes
Download on the App Store

Personal PodcastBy John Doe