Explained: Defeating Nondeterminism in LLM Inference

Why the largest seed round in history might already pay off

Sep 13, 2025

TL;DR Executive Summary

The Problem: Your "deterministic" LLMs (temperature=0) aren't actually deterministic. Run the same prompt 1,000 times and you'll get dozens of different outputs - not because of randomness, but because the math changes based on how requests get batched together on the server.

Why It Matters:

Broken evaluations: Benchmark scores vary by up to ±5% based on server load, not model quality
Failed debugging: Can't reproduce customer-reported edge cases because batch configurations changed
Compliance risk: Regulated industries can't guarantee consistent AI behavior for audits
Wasted money: A/B tests and model comparisons are contaminated by phantom variance
Training instability: RLHF and online learning silently fail due to train/inference distribution mismatches

The Fix: Thinking Machines Lab (the company started by Mira Murati, Ex-CTO of OpenAI) identified that batch size variations break numerical consistency in three key operations (normalization, matrix multiplication, attention). They built batch-invariant versions that achieve perfect reproducibility - 1,000 identical runs now produce 1,000 identical outputs.

The Trade-off: Deterministic inference is ~60% slower than current methods. That's the price of correctness.

What You Should Do:

Test your systems: Run the same prompt 100 times at temperature=0. Count unique outputs.
For critical applications: Consider adopting batch-invariant kernels (now open source) despite performance costs
For general use: Demand deterministic mode options from your inference providers
For evaluation/research: Be aware of potential variance, most LLM papers use single-run benchmarks

Bottom Line: We've been accepting broken behavior as inevitable when it's actually fixable. There will be uses cases where 60% slower or more costly LLMs are a much better option than non-determinism. Also keep in mind that the proposed solution will most likely be optimized further in the foreseeable future.

One time I sat in a meeting where an executive confidently proclaimed that they'd "fixed" their LLM's unpredictability by setting temperature to zero. The room nodded in agreement - after all, that's what we've all been taught. Greedy sampling equals deterministic outputs, right? Thanks to new research from Thinking Machines Lab, we might finally have an actual solution to this problem - not just one that 𝘧𝘦𝘦𝘭𝘴 right in absence of better knowledge.

The uncomfortable truth is that temperature=0 has never guaranteed determinism in practice. Send the same prompt to ChatGPT's API a thousand times with temperature set to zero, and you'll get back dozens of different responses. Run inference on your own hardware with identical inputs, and watch as supposedly deterministic models conjure up variations from the computational ether. This isn't a bug in your code or a quirk of a specific implementation, it's a fundamental issue that affects every major LLM inference engine across GPUs, CPUs, and TPUs alike.

For years, we've accepted this as the cost of doing business with neural networks. "It's just floating-point arithmetic," we'd shrug. "GPUs are nondeterministic by nature." These explanations became so entrenched that they appear in academic papers and technical documentation as accepted fact. But what if the entire field has been looking at this problem through the wrong lens?

The real culprit hiding in plain sight

The research from Thinking Machines reframes our understanding of what causes LLM nondeterminism. The conventional wisdom about "concurrency plus floating-point equals chaos" turns out to be, well, mostly wrong.

The true villain in this story isn't GPU concurrency at all. It's something far more subtle and pervasive: batch invariance failure. Or to put it in simpler terms: when you run the same computation with different batch sizes, you get different results - not just different performance, but numerically different outputs. And since modern inference servers dynamically adjust batch sizes based on load, your "deterministic" model becomes a shape-shifter, morphing its responses based on how many other requests happen to be in flight.

Let me make this concrete with an example that should alarm anyone building production LLM systems. The Thinking Machines team ran 1,000 identical completions of "Tell me about Richard Feynman" through a state-of-the-art model at temperature=0. The result? Eighty unique completions. The most common response appeared only 78 times. Every single completion started identically for the first 102 tokens, then at token 103, reality forked: 992 completions continued with "Queens, New York" while 8 chose "New York City." Same model, same weights, same temperature - different realities.

This isn't just an academic curiosity. If you're running evaluations, your benchmarks are lying to you. If you're doing reinforcement learning, you're implicitly converting on-policy training to off-policy training without realizing it. If you're in a regulated industry that requires deterministic AI behavior, you're potentially out of compliance. And if you're trying to debug why your model occasionally produces weird outputs? Good luck reproducing that edge case.

Understanding the mathematical "original sin"

To truly grasp why this happens, we need to confront what the paper elegantly calls the "original sin" of numerical computing: floating-point arithmetic breaks the basic rules of mathematics we learned in elementary school. Specifically, it violates associativity - the idea that (a + b) + c should equal a + (b + c).

In the pristine world of pure mathematics, these expressions are identical. In the messy reality of finite-precision computing, they're not. A simple example: (0.1 + 1e20) - 1e20 equals zero in floating-point arithmetic, while 0.1 + (1e20 - 1e20) equals 0.1. The order of operations matters, sometimes dramatically. The researchers found that summing the same eight-element array in different orders can produce 102 distinct results - all technically "correct" from a floating-point perspective, but obviously problematic when you're expecting deterministic behavior.

This fundamental issue cascades through every layer of neural network computation. Every matrix multiplication, every normalization, every attention operation becomes a potential source of variation. Modern GPUs, with their massively parallel architectures and dynamic scheduling, turn these theoretical variations into practical nightmares. But here's the crucial insight that shifts the narrative: the problem isn't the parallelism itself - it's how we handle different batch sizes.

The three horsemen of nondeterminism

The research identifies three specific operations that break batch invariance in modern LLMs, and understanding each one reveals just how deep this problem runs.

RMSNorm seems innocent enough - it's just normalizing vectors, after all. But the standard GPU implementation assigns each batch element to a GPU core. When your batch size is small, some cores sit idle, triggering different reduction strategies. Run the same input with batch size 1 versus batch size 8, and you get different numerical results. The fix is conceptually simple - use the same reduction strategy regardless of batch size - but it means accepting performance penalties for small batches. That's a trade-off many production systems won't want to make.

Matrix multiplication raises the complexity considerably. Modern GPUs use specialized tensor cores that operate on tiles of data, not individual elements. When batch dimensions are small, implementations often split along the reduction dimension (the dreaded "split-K" strategy) to maintain efficiency. Different batch sizes trigger different tiling strategies, different tensor core instructions, even different accumulation orders. The researchers found that torch.mm(a[:1], b) and torch.mm(a, b)[:1] - which should theoretically be identical - differed by 1669.25. That's not a rounding error, that's a fundamentally different computation.

Attention, the crown jewel of transformer architectures, compounds all these problems and adds its own unique challenges. It involves two matrix multiplications, reductions over both feature and sequence dimensions, and complex optimizations like chunked prefill and prefix caching. FlashAttention, the current state-of-the-art implementation, uses different processing strategies for prefill versus decode operations. The 1000th token gets processed differently depending on whether there are 0 or 999 tokens already in the key-value cache. Each optimization that makes attention faster also makes it less batch-invariant.

From theory to practice: Making determinism real

Now for the practically relevant part: the Thinking Machines team actually built batch-invariant versions of all three operations and integrated them into vLLM, one of the most popular inference engines. Tthose 80 different Feynman biographies collapsed into a single, perfectly reproducible output. One thousand runs, one thousand identical results.

The implementation details reveal both the elegance and the challenges of this approach. For RMSNorm, they simply ignore small-batch optimizations - a straightforward trade-off. For matrix multiplication, they enforce a single kernel configuration across all shapes and avoid split-K strategies, accepting some performance loss when dimensions are small. For attention, they implement a fixed split-size strategy instead of a fixed number of splits, ensuring consistent reduction orders regardless of processing mode.

But keep in mind: the performance cost is real and significant. On their test configuration, deterministic inference took 62% longer than standard vLLM - 26 seconds versus 42 seconds for 1000 sequences. That's not a trivial overhead. For many production systems, that performance penalty might seem unacceptable. Yet the researchers make a compelling argument: what's the cost of incorrect evaluations, unreproducible bugs, or compliance failures? Sometimes, correctness is worth the computational price.

The deeper implications nobody's talking about

The impact on reinforcement learning from human feedback (RLHF) deserves special attention. Current RLHF implementations unknowingly operate in a twilight zone between on-policy and off-policy learning. The model used during training has different numerics than the one used during inference, creating a subtle but persistent distribution shift. The researchers demonstrated this with concrete experiments: without importance weighting to correct for this shift, reward models collapse during training. With deterministic inference, the KL divergence between training and inference stays perfectly flat at zero - true on-policy learning becomes possible for the first time.

Think about what this means for the entire ecosystem of LLM development. Every benchmark score you've seen might be wrong - not dramatically wrong, but wrong enough to change rankings. Every A/B test comparing models might be contaminated by random variance masquerading as performance differences. Every debugging session trying to track down intermittent failures might be chasing ghosts created by batch size variations.

For regulated industries - healthcare, finance, legal services - this could change the conversation entirely. "Our AI system is deterministic" transforms from an impossible claim to an achievable engineering goal. Audit trails become meaningful. Edge cases become reproducible. The "black box" becomes, if not transparent, at least consistent.

The uncomfortable questions we need to ask

As important as this research is, it raises uncomfortable questions about the current state of AI infrastructure. Why did it take until 2025 for someone to properly diagnose this problem? How many research papers have reported results that were actually within the noise floor of batch-variance-induced nondeterminism? How many production systems are making critical decisions based on models that might give different answers depending on server load?

There's also the question of adoption. The researchers have open-sourced their batch-invariant kernels, but will the major frameworks integrate them? Will PyTorch, TensorFlow, and JAX offer deterministic modes? Will cloud providers offer deterministic inference endpoints, even at a premium price? Or will we continue to accept nondeterminism as the price of performance, filing this research away as an interesting but impractical academic exercise?

What happens next: A practical path forward

The optimist in me sees this research as a turning point. Thinking Machines Lab - barely nine months old, $2 billion in seed funding - has chosen to make their first major contribution fully open source. That's a statement of intent about the kind of research culture they're building.

The pragmatist recognizes the significant engineering challenges ahead. Achieving batch invariance in distributed multi-GPU settings remains an open problem. The performance penalties need optimization - that 62% overhead won't fly in many production environments. Hardware vendors need to design future accelerators with determinism in mind, not just raw throughput.

But the researcher in me is most excited about the new possibilities this unlocks. True on-policy reinforcement learning. Perfectly reproducible ablation studies. Deterministic model surgery and intervention. We can finally build rigorous unit tests for neural networks, with exact numerical expectations rather than fuzzy tolerances. We can debug edge cases that previously disappeared into the quantum foam of nondeterminism.

The message is clear: determinism is achievable, but it requires intentional engineering choices. Start by understanding where your systems break batch invariance. Measure the actual nondeterminism in your pipelines - you might be surprised by how variable your "deterministic" models really are. Consider whether the performance trade-offs are worth it for your use case. And most importantly, stop accepting nondeterminism as inevitable.

LLM Watch

Discussion about this post