🐋 DeepSeek Strikes Again As OpenAI's Valuation Skyrockets

The trend of cheaper, more efficient AI continues

Apr 04, 2025

AI generalization: getting more for less
3x more efficient test-time scaling
Combining retrieval-augmentation and reasoning

1. DeepSeek-GRM: Inference-Time Scaling for Generalist Reward Modeling

Watching: DeepSeek-GRM (paper)

What problem does it solve? Large Language Models (LLMs) trained with Reinforcement Learning (RL) need guidance on what outputs are good or bad, which traditionally comes from reward models that score their responses. However, current reward models face significant limitations: they either work well only on specific domains (like math problems) where clear rules exist, or they struggle to provide consistent feedback across diverse general topics. Think of reward models like judges in a talent show - the current ones are specialists who only excel at judging specific categories, while what we need are versatile judges who can evaluate performances across all categories. Additionally, these specialized judges can't efficiently use extra time to make better decisions, while human judges naturally improve their evaluations when given more deliberation time. This paper tackles the challenge of creating "generalist reward models" that can both work across diverse domains and effectively use additional computation time to produce better judgments.

How does it solve the problem? Self-Principled Critique Tuning (SPCT) is a new method that teaches reward models to first generate evaluation principles based on the context before making judgments. Imagine teaching someone to judge cooking competitions by first having them articulate what makes a good dish in that specific category (e.g., texture, flavor balance, presentation) before scoring contestants. Their model, DeepSeek-GRM, learns to generate custom evaluation criteria for each query and response. They implemented a two-phase training approach: first "rejective fine-tuning" where the model learns appropriate formats and criteria, then rule-based reinforcement learning where it's rewarded for generating principles that lead to accurate judgments. For inference-time scaling (using more compute during evaluation), they sample multiple sets of principles and critiques for each judgment task, then either vote on the results or use another model (a meta-RM) to determine which samples to trust more heavily.

What are the key findings? DeepSeek-GRM-27B significantly outperformed existing reward models without showing severe biases toward particular domains. The most striking discovery was its exceptional inference-time scalability - unlike previous approaches, its performance consistently improved when generating more samples during evaluation. When allowed to generate 32 different sets of principles and critiques per query, their 27B parameter model matched or even outperformed much larger models with up to 671B parameters. This suggests that scaling inference compute (generating more samples at evaluation time) can actually be more effective than scaling model size, which requires much more training compute. The researchers also found that the principle generation capability was crucial both for model performance and for effective inference-time scaling.

Why does it matter? This might not be enough for another “DeepSeek moment”, but their findings represent a breakthrough for reward alignment techniques. First, they show we don't necessarily need bigger models for better reward modeling - we can instead use more compute during inference time, which is much more economical and accessible. Think of it like showing that having a moderately skilled judge deliberate carefully often leads to better decisions than having a highly credentialed judge make quick assessments. Second, by generating explicit principles, the model creates transparent, interpretable criteria for its judgments, making the evaluation process less of a black box. LLMs will most likely continue expanding into critical applications from healthcare to education, so having reward models that can accurately evaluate responses across diverse domains will become crucial for ensuring these systems remain helpful, harmless, and honest.

2. Z1: Efficient Test-time Scaling with Code

Watching: Z1 (paper/code)

What problem does it solve? LLMs have demonstrated impressive reasoning capabilities in complex tasks through test-time compute scaling. However, this often comes with significant computational costs as models use elaborate reasoning processes that generate super long contexts and numerous thinking tokens. Existing models like DeepSeek R1 use delimiters to enforce a "think-before-answer" pattern, but this rigid approach causes models to overthink simple problems that don't require deep reasoning. The paper addresses a crucial question: "Is there an efficient way to reduce thinking token consumption while preserving reasoning performance?"

How does it solve the problem? The authors introduce an efficient test-time scaling method that trains LLMs to adjust their reasoning based on problem complexity. First, they created Z1-Code-Reasoning-107K, a dataset of simple and complex coding problems paired with both short and long solution trajectories. Second, they developed a novel "Shifted Thinking Window" approach that eliminates rigid context-splitting delimiters and instead caps maximum thinking tokens, appending a hint phrase if reasoning exceeds the threshold. This approach allows models to use minimal reasoning for simple problems while engaging in deeper thought for complex ones. They fine-tuned Qwen2.5-Coder-7B-Instruct with this dataset to create Z1-7B, a model that dynamically adjusts its reasoning depth based on problem complexity.

What are the key findings? The key findings show that Z1-7B matches the performance of state-of-the-art reasoning models while using only about 30% of the thinking tokens required by comparable models like R1-Distill-Qwen-7B. Despite being trained exclusively on code-related reasoning trajectories, Z1-7B generalizes effectively to broader reasoning domains, achieving 47.5% on GPQA Diamond (complex science questions) and 76.4% on MATH500. Through data ablation studies, the authors identified two critical factors for effective reasoning elicitation: longer reasoning trajectories in training data improve performance even with the same token budget, and larger training datasets consistently yield better results. Most importantly, Z1-7B demonstrates efficient test-time scaling by adapting its reasoning level to problem complexity.

Why does it matter? This addresses a significant bottleneck in deploying reasoning-capable LLMs by showing how to maintain performance while substantially reducing computational overhead. This computational efficiency makes powerful reasoning models more accessible and cost-effective to deploy in real-world applications. The generalization capabilities demonstrate that fundamental reasoning skills trained on code can transfer to other domains like science and mathematics, suggesting a more efficient path for training broadly capable reasoning models. Their research provides valuable insights into balancing reasoning depth with computational efficiency, which could guide future work on optimizing LLM reasoning and help bring advanced AI reasoning capabilities to more resource-constrained environments.

3. RARE: Retrieval-Augmented Reasoning Modeling

Watching: RARE (paper)

What problem does it solve? LLMs struggle with maintaining factual accuracy and reasoning coherence in complex, knowledge-intensive tasks. This is particularly evident in specialized domains like medical reasoning, where models may lack up-to-date information or struggle with multi-step logical reasoning. Traditional approaches to reasoning improvement often fail to balance logical coherence with factual accuracy, leading to hallucinations or reasoning errors in scenarios that demand high levels of precision and domain knowledge.

How does it solve the problem? The paper introduces RARE (Retrieval-Augmented Reasoning Enhancement), which extends the mutual reasoning framework (rStar) with specialized retrieval capabilities. RARE incorporates two innovative actions within the Monte Carlo Tree Search: One generates search queries based on the problem statement, retrieves information, and augments reasoning with this data; the other performs targeted retrieval specifically for sub-questions generated during reasoning. Additionally, they developed a Retrieval-Augmented Factuality Scorer that evaluates reasoning paths based on their alignment with retrieved evidence, prioritizing factually supported reasoning trajectories.

What are the key findings? Experiments across multiple medical and commonsense reasoning benchmarks show that RARE significantly enhances the performance of open-source LLMs. When applied to LLaMA 3.1 models of various sizes (3B, 8B, and 70B parameters), RARE consistently outperformed baseline reasoning methods, including Chain of Thought, RAG, and rStar. Most impressively, RARE-enhanced LLaMA 3.1 70B achieved performance competitive with or exceeding proprietary models like GPT-4 and GPT-4o on several benchmarks, with improvements of 2-6% over standard approaches.

Why does it matter? This demonstrates that retrieval-augmented reasoning can bridge the gap between open-source and proprietary LLMs without requiring additional training or fine-tuning. By operating as an autonomous language agent that dynamically incorporates external knowledge into the reasoning process, RARE offers a scalable approach for enhancing model capabilities in domains where factual accuracy is critical. This has significant implications for making high-quality AI reasoning more accessible, particularly in specialized fields like medicine where both logical coherence and factual reliability are essential for real-world applications.

Papers of the Week:

Cognitive Memory in Large Language Models
LLM memory mechanisms (sensory, short-term, long-term) enhance context, reduce hallucinations, and improve efficiency using text-based, KV cache (selection/compression), parameter-based, and hidden-state-based (Mamba model) approaches. Management includes shared attention mechanisms. The analysis highlights significance and future research directions for advancing LLMs.
Effectively Controlling Reasoning Models through Thinking Intervention
Thinking Intervention guides reasoning-enhanced LLMs by inserting/revising thinking tokens, improving performance on IFEval, SEP, and XSTest/SORRY-Bench. Using DeepSeek R1 models, it achieves up to 40% increase in refusal rates for unsafe prompts, enabling fine-grained control.
What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models
Test-time scaling (or computing) boosts large language models' problem-solving in mathematics, coding, and open-ended Q&A. A unified framework analyzes this research across four core dimensions: what, how, where, and how well to scale. The article reviews methods, applications, assessment aspects, and practical deployment, highlighting key developments.
Harnessing the Reasoning Economy: A Survey of Efficient Reasoning for Large Language Models
LLM reasoning economy faces challenges balancing the computational costs of System 2 deep reasoning with System 1's efficiency. This survey analyzes performance vs. costs in post-training and test-time inference, examining reasoning inefficiency and behavior analysis of reasoning patterns. It offers actionable insights, highlights open challenges, and provides a public repository to improve LLM reasoning economy.
InfiniteICL: Breaking the Limit of Context Window Size via Long Short-term Memory Transformation
InfiniteICL enhances LLM scalability and efficiency by converting temporary context knowledge into permanent parameter updates, similar to long-term memory, overcoming conventional context window sizes. Context knowledge elicitation, selection, and consolidation minimize memory use while maintaining performance in skill acquisition and grounded reasoning, processing real-world contexts up to 2M tokens.
From Code Generation to Software Testing: AI Copilot with Context-Based RAG
Copilot for Testing addresses software development demands, linking bug detection/reduction via retrieval augmented generation. Extending AI-assisted programming to cut bugs, it achieves a 31.2% rise in bug detection accuracy and a 10.5% higher user acceptance rate.
ToolACE-R: Tool Learning with Adaptive Self-Refinement
ToolACE-R uses adaptive self-refinement and model-aware iterative training to improve Large Language Models' (LLMs) tool learning capabilities, addressing limitations of data synthesis and fine-tuning. The method requires no external feedback and enhances computational efficiency via an adaptive mechanism.

LLM Watch

Discussion about this post