🤗 Reinforcement Learning Without Human Feedback
The Reinforcement Learning hype just won't stop - another week full of RL papers
Welcome, Watcher! This week’s highlights dive into self-supervised adaptation, pre-computation acceleration, and a reality check on RL for reasoning.
First up, TTRL (Test-Time Reinforcement Learning) shows how LLMs can bootstrap their own learning at inference by voting on multiple candidate answers as pseudo-labels and using those binary rewards to update policies on the fly - delivering up to a 159% boost on hard math benchmarks without any human annotations.
Next, Sleep-time Compute borrows idle cycles to “pre-think” about contexts before questions arrive, slashing test-time token usage by 5×, cutting per-query cost 2.5x when amortized across related prompts, and boosting accuracy by up to 18% - all by trading offline compute for faster, cheaper real-time answers.
Finally, the “Reasoning Capacity” study reveals that reinforcement learning with verifiable rewards (RLVR) doesn’t actually unlock new chains of thought but simply steers LLMs toward already-known high-reward paths, suggesting we need smarter distillation or exploration-driven methods to truly expand machine reasoning.
Together, these papers advance how models learn on the fly, make interactions more efficient, and force us to rethink the very foundations of LLM training.
Don’t forget to subscribe to never miss an update again!
Courtesy of NotebookLM
1. TTRL: Test-Time Reinforcement Learning
What problem does it solve? Large language models (LLMs) shine on reasoning tasks when fine-tuned with labeled examples, but in real-world use you often face a stream of new, unlabeled problems - no “right answers” are available to tell the model when it’s correct. Without ground-truth labels, reinforcement learning (RL) can’t compute rewards, so the model can’t adapt or improve at test time, limiting its ability to learn continually from fresh data.
How does it solve the problem? They introduce Test-Time Reinforcement Learning (TTRL), a method that turns the model’s own guesses into a self-supervised training loop at inference. For each test prompt, TTRL samples multiple candidate answers, uses a simple majority vote among them as a “pseudo-label,” and gives a reward of 1 whenever a sampled answer matches that pseudo-label (0 otherwise). These binary rewards then drive policy updates (via PPO or GRPO) on the fly - no human annotations required.
What are the key findings? When applied to math benchmarks like AIME 2024, AMC, and MATH-500 with Qwen-2.5-Math-7B and LLaMA-3, TTRL delivers dramatic gains: up to a 159% increase in pass@1 on AIME and an average 84% lift across tasks using only unlabeled test data. Remarkably, TTRL often exceeds its own majority-voting “upper bound” and approaches performance of models trained with true labels. Gains grow with model size, hold across RL algorithms, and transfer to out-of-distribution problems.
Why does it matter? TTRL shows that LLMs can effectively “bootstrap” their own learning, massively reducing the need for costly human labels. This paves the way for AI systems that adapt in real time to evolving data streams - think customer support bots or scientific assistants that learn on the fly. By enabling continual, self-supervised RL, TTRL points toward a future of truly autonomous, lifelong learning in large models.
2. Sleep-time Compute: Beyond Inference Scaling at Test-time
Watching: Sleep-time Compute (paper/code)
What problem does it solve? Large language models (LLMs) can “think longer” at test-time to solve hard puzzles, but that makes them slow and costly - especially when many related questions reuse the same background. Right now, every new question forces the model to redo the same context reasoning from scratch, ballooning latency and inference bills.
How does it solve the problem? The researchers introduce “sleep-time compute,” where the model uses idle time to pre-think about a given context before any question arrives. Concretely, they split reasoning datasets (Stateful GSM-Symbolic and Stateful AIME) into “context” and “query,” prompt the model offline to generate useful inferences (a re-represented context), and then answer real queries with a much smaller test-time budget. They also build a Multi-Query GSM-Symbolic benchmark to amortize pre-computed work across related questions.
What are the key findings? Sleep-time compute pushes out the compute-accuracy frontier: it slashes required test-time tokens by about 5x for the same accuracy on both Stateful GSM-Symbolic and AIME. By scaling offline compute, they boost accuracy by up to 13% (GSM-Symbolic) and 18% (AIME). When multiple questions share a context, amortization cuts average cost per query by 2.5x. They further show that more predictable queries benefit most and validate gains in a realistic code-editing agent task.
Why does it matter? This work adds a new axis to LLM scaling - trading idle “sleep” compute for faster, cheaper real-time answers - much like pre-fetching or caching in classic systems. It makes interactive AI more practical in cost-sensitive, low-latency settings and suggests fresh avenues for learning natural-language representations by pre-computing reasoning ahead of user queries.
3. Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Watching: Reasoning Capacity (paper/code)
What problem does it solve? Large Language Models (LLMs) have recently been fine-tuned with Reinforcement Learning using Verifiable Rewards (RLVR) to boost their reasoning in domains like math and coding. It’s widely assumed that RLVR lets these models discover “new” reasoning skills beyond what their original, or base, model could do. This paper questions that assumption by asking: does RL actually expand an LLM’s reasoning boundary, or does it simply make existing skills easier to sample?
How does it solve the problem? They adopt a rigorous “pass@k” evaluation - measuring the chance that an LLM solves a problem in k tries - across multiple model families, tasks (math, code, visual reasoning), and RL algorithms. By sweeping k to large values, they probe each model’s true upper bound of solvable problems. They also analyze the perplexity of reasoning traces to check if RL introduces novel chains of thought and compare RLVR against knowledge distillation from a stronger teacher.
What are the key findings? Contrary to expectations, RLVR does not elicit fundamentally new reasoning patterns. Rather, it biases the model toward already-existing high-reward solution paths, boosting single-sample accuracy (small k) but reducing exploration and narrowing the set of solvable problems at large k. Base models, given enough sampling, match or exceed RL-trained models’ pass@k scores. By contrast, distillation genuinely injects new reasoning paths and expands the reasoning boundary.
Why does it matter? RLVR has been heralded as a route to continuously self-improving, reasoning LLMs. Showing that it merely reshuffles existing knowledge and curtails exploration forces us to rethink LLM training strategies. It highlights the need for alternative paradigms - such as smarter distillation or exploration-boosting methods - to truly push the frontiers of machine reasoning.
Papers of the Week:
Scaling sparse feature circuit finding for in-context learning
From Large to Super-Tiny: End-to-End Optimization for Cost-Efficient LLMs
Continual Pre-Training is (not) What You Need in Domain Adaption
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
The Geometry of Self-Verification in a Task-Specific Reasoning Model
Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning
PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models
Thanks for educating me about TTRL. Seems exciting!