Explained: Defeating Nondeterminism in LLM Inference
Why the largest seed round in history might already pay off
TL;DR Executive Summary
The Problem: Your "deterministic" LLMs (temperature=0) aren't actually deterministic. Run the same prompt 1,000 times and you'll get dozens of different outputs - not because of randomness, but because the math changes based on how requests get batched together on the server.
Why It Matters:
Broken evaluations: Benchmark scores vary by up to ±5% based on server load, not model quality
Failed debugging: Can't reproduce customer-reported edge cases because batch configurations changed
Compliance risk: Regulated industries can't guarantee consistent AI behavior for audits
Wasted money: A/B tests and model comparisons are contaminated by phantom variance
Training instability: RLHF and online learning silently fail due to train/inference distribution mismatches
The Fix: Thinking Machines Lab (the company started by Mira Murati, Ex-CTO of OpenAI) identified that batch size variations break numerical consistency in three key operations (normalization, matrix multiplication, attention). They built batch-invariant versions that achieve perfect reproducibility - 1,000 identical runs now produce 1,000 identical outputs.
The Trade-off: Deterministic inference is ~60% slower than current methods. That's the price of correctness.
What You Should Do:
Test your systems: Run the same prompt 100 times at temperature=0. Count unique outputs.
For critical applications: Consider adopting batch-invariant kernels (now open source) despite performance costs
For general use: Demand deterministic mode options from your inference providers
For evaluation/research: Be aware of potential variance, most LLM papers use single-run benchmarks
Bottom Line: We've been accepting broken behavior as inevitable when it's actually fixable. There will be uses cases where 60% slower or more costly LLMs are a much better option than non-determinism. Also keep in mind that the proposed solution will most likely be optimized further in the foreseeable future.
One time I sat in a meeting where an executive confidently proclaimed that they'd "fixed" their LLM's unpredictability by setting temperature to zero. The room nodded in agreement - after all, that's what we've all been taught. Greedy sampling equals deterministic outputs, right? Thanks to new research from Thinking Machines Lab, we might finally have an actual solution to this problem - not just one that 𝘧𝘦𝘦𝘭𝘴 right in absence of better knowledge.
Keep reading with a 7-day free trial
Subscribe to LLM Watch to keep reading this post and get 7 days of free access to the full post archives.