LLM Watch: Deep Dives

Falcon-H1: The AI Chimera That Challenges The Transformer

Pascal Biese — Sat, 02 Aug 2025 17:57:08 GMT

Sometimes in AI, smaller can be smarter. A recent research release shows a 34-billion-parameter AI model matching the performance of models twice its size by doing something unconventional – giving the AI two different neural networks instead of one. This isn’t just a one-off trick: it hints at how the next generation of foundation models might be built, blending techniques and potentially modalities to achieve more with less. In this post, we break down this hybrid AI breakthrough and explore what it means for tech builders, researchers, and strategists alike.

Meet Falcon-H1 – The Hybrid AI That Punches Above Its Weight

A few months ago, the team at Technology Innovation Institute (TII) unveiled Falcon-H1, a series of large language models with a new hybrid architecture (and now, we finally got the report). Instead of relying purely on the standard Transformer architecture that underpins most modern generative AIs, Falcon-H1 combines two different model paradigms in parallel within each layer. In simple terms, it’s like giving the model “two heads” for thinking: one head uses the classic attention mechanism (great for focusing on specific details in text), and the other uses a State Space Model (SSM) (great for remembering long sequences efficiently). These twin heads work side by side, and their outputs are fused together, allowing Falcon-H1 to leverage the strengths of both approaches.

Why go hybrid? The Transformer (attention) is excellent at understanding contextual relationships (think of it as a spotlight that highlights relevant words), while the SSM provides superior long-term memory and speed (think of it as an efficient notepad that can handle very long documents without slowing down). By marrying the two, Falcon-H1 aims to get the best of both worlds – and according to the results, it succeeds. The hybrid architecture yields faster inference and lower memory usage than a pure Transformer, without sacrificing accuracy. In fact, the researchers found that they only needed a relatively small fraction of attention heads (the Transformer part) to achieve strong performance – the SSM component carries a lot of the load for long-text understanding. This is a significant departure from the “all-Transformer, all the time” approach of most large models.

So, what is Falcon-H1 exactly? It’s a family of six open-source models of varying sizes (from a tiny 0.5 billion parameters up to 34 billion) released with a permissive license (Apache 2.0). Each model comes in both a base version and an instruction-tuned version (trained to follow human instructions better). Despite the relatively moderate sizes (by today’s standards), Falcon-H1 models are punching above their weight class. Here are some of the highlights and features of Falcon-H1:

Hybrid Architecture (Attention + SSM): Each model layer uses two parallel heads – one Transformer attention head and one “Mamba-2” SSM head – whose outputs are concatenated. The ratio of attention to SSM can be tuned, but it turns out you don’t need much attention to get great results. This design provides strong generalization across tasks with faster speed and better memory efficiency than attention alone. (Think of it like a car with both an electric motor and a gas engine – efficient for long hauls yet quick for bursts of power.)

Wide Range of Model Sizes: Falcon-H1 is released in six sizes – 0.5B, 1.5B, 1.5B-Deep, 3B, 7B, and 34B parameters – covering everything from lightweight models that could run on edge devices to hefty models for servers. The “1.5B-Deep” variant is intriguing: it has the same parameter count as the 1.5B model but far more layers (66 layers vs 24), illustrating one of the team’s findings that deeper models can outperform wider ones under the same budget.

Multilingual by Design: These models aren’t just English-centric. Falcon-H1 was trained natively on 18 languages (covering English, Chinese, Arabic, Hindi, Spanish, French, and many more) and the underlying tokenizer can scale to 100+ languages. In practice, this means the models can understand and generate multiple languages out-of-the-box, a nod to our globally connected world where AI needs to speak many tongues.

Compact Models, Big Performance: Perhaps the most impressive claim: Falcon-H1 models are designed to match or exceed the performance of models at least twice their size. The tiny 0.5B model (the size of a single GPT-2, roughly) delivers performance on par with typical 7B parameter models from just a year prior. The 1.5B-Deep model actually outperforms many 7B models out there. And the flagship 34B model has been shown to match or beat certain 70B models (like Meta’s and Alibaba’s latest) on a range of benchmarks. In short, Falcon-H1 squeezes more juice out of every parameter – a welcome development as we confront the cost and complexity of ever-larger AI models.

256K Context Length: Falcon-H1 can handle really long inputs. Thanks to the SSM-based design and some clever tweaking of positional embeddings, it supports up to 256,000 tokens of context (for the 7B and 34B models). That’s orders of magnitude more than the 4K or 8K token limits many GPT-style models had not long ago. For perspective, 256k tokens is roughly a hundreds of pages of text in one go. This opens the door to AI that can ingest and reason about long documents (or lengthy multi-turn conversations) without breaking a sweat.

Domain Expertise (STEM and Code): Because of a carefully curated training set, Falcon-H1 shows strong abilities in math, science, and coding tasks. The team deliberately included high-quality technical content – like math problem sets, scientific papers, and a large code corpus – in the training mix. They even tweaked the tokenizer to handle things like LaTeX math notation and code syntax better (e.g., splitting digits and punctuation so the model learns numbers and code structure more effectively). As a result, the models have exceptional STEM capabilities compared to peers, excelling at mathematical reasoning and scientific Q&A.

Open-Source and Accessible: All Falcon-H1 models are released under a permissive open-source license and are available on Hugging Face for anyone to use. Crucially, the team also provides quantized versions of the instruction-tuned models (over 30 checkpoints in total), meaning you can run these models with much less memory than the full parameter count would normally require. In practice, this could allow a 34B model to run on a single high-end GPU or even a powerful laptop – bringing near state-of-the-art AI capabilities to those without mega-scale infrastructure.

In summary, Falcon-H1 arrives not as a single monolithic model but as a portfolio of AI models that are smaller, multilingual, long-winded (in a good way), and freely available. It’s a bold demonstration that innovation in architecture and training can rival sheer scale. But how did the researchers actually pull this off? Let’s peek under the hood.

How Did They Do It? Rethinking the Recipe for AI Success

It’s one thing to list features and accomplishments, but Falcon-H1’s creation is as interesting as its results. The researchers didn’t just scale up a Transformer in the usual way – they questioned a lot of AI’s conventional wisdom and tried some decidedly off-beat ideas in the process. Here are a few of the key methods and insights behind the scenes:

Hybrid “Mixer” Layers: As mentioned, the core innovation was the parallel hybrid layer design. Instead of stacking some SSM layers and some attention layers separately, Falcon-H1’s layers each contain both components running concurrently

The diagram above (from the Falcon team) illustrates this: an input goes through a normalization, then splits into an SSM branch (purple) and an Attention branch (blue). Both process the data at the same time, then their outputs are concatenated and passed through an MLP (feed-forward network) together before the layer outputs its result. This parallel approach, as opposed to a sequential mix, was found to work best. Crucially, it lets the designers tune the ratio of how many channels go to SSM vs Attention vs MLP. After extensive experiments, they settled on a ratio (for example, in one configuration, roughly 2 parts SSM, 1 part attention, 5 parts MLP) that optimized efficiency and learning dynamics. Interestingly, adding too much attention actually hurt performance, whereas giving the SSM a healthy share improved it. In other words, more Transformer wasn’t always better – a smaller attention mechanism paired with SSM was the sweet spot.

Positional Encoding Trick for Long Contexts: One challenge with super-long context windows (like 256k tokens) is how to give the model a sense of position (which word comes first, second, etc.) without losing precision. Falcon-H1 tackled this by using an unusually high-frequency rotary positional embedding (RoPE) base. Essentially, they cranked up a parameter in the positional encoding formula to $10^{11}$ (far beyond typical values), which significantly improved the model’s ability to generalize on long sequences. Normally, using such a high frequency would destabilize a pure Transformer, but the hypothesis is that the SSM part naturally handles some positional information, allowing the attention part to take on this extreme frequency without trouble. It’s a bit like giving the model an extremely fine-grained ruler to measure position in text – and finding that it actually can make use of all those tick marks when reading a long document.

Deeper, Not Just Wider: When allocating a fixed budget of parameters, the team found that making the model deeper (more layers) often yielded better results than making it wider (bigger layer dimensions). This insight led to the creation of the Falcon-H1-1.5B-Deep variant – it packs 66 layers into 1.5B params, compared to the standard 24 layers – and that deep model outperformed many models with 3B or even 7B parameters. It’s a notable data point in the ongoing “depth vs width” debate in model design, suggesting that for some tasks a tall, thin model can beat a short, fat one (up to a point).

Unorthodox Training Curriculum: In training AI, a common practice is curriculum learning – feed the model easy examples first and gradually move to harder ones. Falcon-H1’s team tried the opposite and it worked better. They gave even the most complex data (like tough math problems and very long texts) right from the start of training, rather than saving them for last. Counterintuitively, this early exposure to hard examples gave the model “more time to learn” the skills needed for those challenges. It’s akin to teaching a language by mixing in advanced literature from day one alongside the basics – a risky approach for humans, but apparently effective for this AI’s development.

More Data, More Reuse: Falcon-H1 was trained on a colossal 20 trillion token corpus (with about 18T actually used). This includes a refined web crawl (“FineWeb”), a large multilingual set (Wikipedia, Common Crawl, subtitles, etc. for 17 languages), a 67-language code dataset (deduplicated and cleaned), and specialized math and science data. One concern with such large datasets is models simply memorizing content. The prevailing wisdom is usually to avoid reusing data in training passes. However, the team found that this risk is a bit overestimated – they carefully measured the model’s “memorization window” and concluded they could reuse high-quality examples more times than usual without hurting generalization. By doing so, they ensured that the best content (like high-quality science articles or code) had a stronger influence on training, rather than being diluted by a flood of lower-quality web text. This approach improved the model’s grasp of those domains without turning it into a rote copy-machine.

Stabilizing the Training of SSMs: State-space models are powerful for long sequences, but they can be tricky – prone to training instabilities (loss spikes). The Falcon team addressed this by inserting special “dampening” factors into the SSM components as part of their parametrization (essentially, controlling the scale at which the SSM part learns). By doing so, they eliminated the nasty training spikes that often plague SSM-based networks, resulting in a smooth learning curve. They also paid close attention to things like parameter norm growth and noise in gradients, tweaking hyperparameters (like using scheduled weight decay) to keep the training process stable. The takeaway for researchers: making novel architectures work often requires this kind of diligent debugging and fine-tuning of the training process.

Efficient Scaling with μP (Mu Parametrization): Training six different model sizes could have been six times the work, but Falcon-H1 applied an advanced technique called Maximal Update Parametrization (μP) to streamline this. μP provides a theory-backed way to set hyperparameters (like learning rate) for one model size and then scale them to larger models reliably. Using μP, the team found optimal settings on a base model and transferred them to all other sizes, allowing them to train multiple models in parallel without lots of per-model tuning. They even went a step further by customizing the μP scaling: instead of assuming the base model’s hyperparameters were perfectly tuned, they optimized 35 separate μP scaling factors for different parts of the network. This fine-grained approach squeezed extra performance out of each model. For the layperson, the result is that Falcon-H1 models were trained efficiently and in concert, making this massive undertaking a bit more manageable.

All these choices – from architectural design to training strategy – contributed to Falcon-H1’s success. The researchers effectively broke some rules and set new ones. It’s a reminder that in the race for bigger and better AI, raw scale isn’t the only path; sometimes clever engineering and fresh ideas can unlock performance that brute force alone would miss.

Results: When Smaller Models Beat Giants

So, did the hybrid approach pay off? The proof is in the pudding (or rather, in a slew of benchmarks). The Falcon-H1 models have demonstrated state-of-the-art or near-SOTA performance across a variety of tasks, often rivaling models far larger:

The flagship Falcon-H1-34B model is reported to match or outperform systems in the 70B parameter range. In evaluations, it stood toe-to-toe with models like Qwen-72B (Alibaba’s model) and a prototype LLaMA-70B on many benchmarks. This is astonishing when you consider it has half (or even half to one-third) the parameters – a testament to parameter efficiency. On standard NLP benchmarks (like MMLU for knowledge, HellaSwag for common-sense reasoning, GSM8k for math, etc.), Falcon-34B’s numbers are right up there with the best, and sometimes winning【25†image】. It’s not uniformly dominant, but the fact that it’s in the same league as models twice its size is a huge win for the approach.

The Falcon-H1-1.5B-Deep model (just 1.5B params) deserves a shout-out. It clearly outperforms other ~2B models (like Qwen3-1.7B) and even holds its own against many 7B models. Essentially, a well-designed 1.5B hybrid model from 2025 can do what a vanilla 7B model from 2024 could do. That’s a big deal for those who need smaller models – it means you might not need a massive cluster to get strong AI performance for many tasks. Falcon-1.5B-Deep even beat some 7B versions of Falcon’s own previous generation (Falcon-40B from 2023 had a 7B little sibling) and competitive models, which validates the “deeper not wider” strategy in practice.

On multilingual benchmarks, Falcon-34B shows very strong performance across languages like Arabic, Spanish, Hindi, etc., often on par with the best models evaluated for those languages. This confirms that the model’s multilingual training wasn’t just token window dressing; it genuinely learned multiple languages well. For businesses in non-English markets or operating globally, this is a welcome capability – an AI that doesn’t treat English as the default brain.

In terms of reasoning and specialized tasks, Falcon-H1 models excel in categories like math and coding. The inclusion of math and code data gave them a leg up on benchmarks like GSM8k (math word problems) and HumanEval (coding tasks). For instance, Falcon-34B performs strongly on GSM8k and is competitive on coding challenges, approaching the scores of much larger models. The team even noted that they did no special fine-tuning for reasoning, yet the instruction-tuned models show robust reasoning abilities out-of-the-box. This is a promising sign that massive fine-tune rounds (like those done for GPT-4’s “Chain-of-Thought” reasoning) might be sidestepped with a good pretraining mix.

On long-context tasks, Falcon-H1 really shines. They tested the 34B model on retrieval and long-document QA tasks where it had to read very lengthy texts (tens of thousands of tokens) and answer questions. Its performance was strong and in some cases it could handle things that a typical model (with a 4K or 8K limit) simply couldn’t attempt. This hints at new applications: think AI assistants that can digest entire books or analyze years of financial reports in one go. One can imagine loading up a 200-page contract and asking the model detailed questions – Falcon-H1’s long attention span is built for that.

To sum up the results: Falcon-H1 proves that smart design can yield outsized results. A smaller open-source model, built on novel ideas, is reaching parity with the closed giants of the field. It’s not every day that David goes up against Goliath in AI and comes out looking this good. And importantly, anyone can use David – the models are downloadable, finetunable, and deployable by anyone with the hardware and savvy to do so.

Why This Matters: A Glimpse of AI’s Future (Hybrid and Multimodal?)

Beyond the cool factor of Falcon-H1’s achievements, there’s a broader significance here. This research is a signpost for several big trends in AI:

1. Rethinking “Bigger is Better”: For the last few years, the story of AI has been “scale, scale, scale.” More parameters, more data, more compute – and you’ll get better results. That’s largely held true (GPT-4 wouldn’t be GPT-4 without an astronomical scale-up). But Falcon-H1 shows another path: architectural innovation and training strategy can give you leaps in performance without merely doubling size. It’s a reminder that we may be entering an era of diminishing returns on brute-force scaling, and the next breakthroughs could come from making models smarter, not just larger. For AI researchers, this is an exciting validation of pursuing new model forms (like combining attention with SSMs, or other hybrids) – there’s room to beat the scaling curve by being clever.

2. The Rise of Modular & Multimodal Foundation Models: Falcon-H1’s hybrid approach hints at a future where AI systems aren’t monolithic slabs of one architecture, but a combination of specialized components working together. Today it’s attention + SSM. Tomorrow it could be attention + SSM + convolution, or text + vision + memory modules, etc. In other words, multimodal foundation models might internally look like a collection of expert modules – an image understanding module, a text reasoning module, maybe a database retrieval module – all integrated in a single system. We’re already seeing early signs: OpenAI’s GPT-4 is multimodal (it can see images as well as read text), and it likely achieves this by combining a vision encoder with a language model. Google’s research has been exploring “universal models” that handle text, images, and more in one network. The success of Falcon-H1’s dual-technique model is a vote of confidence for this direction. It shows that heterogeneous architectures can work at scale. For tech strategists, this implies that the AI solutions of the near future might be more flexible and tailored – not every company will need a 100B-parameter pure Transformer if a 10B hybrid with the right components does the job better.

3. Democratization of AI Capability: The open-source nature of projects like Falcon-H1 means cutting-edge AI is not confined to a few big tech companies. We’re witnessing a fast-follower effect where open models replicate or even innovate beyond the proprietary ones, often within months. This shrinks the advantage gap. For product builders and startups, it’s great news: you can pick up a state-of-the-art foundation model (with support for your language, your long documents, etc.) for free and fine-tune it to your needs. Running a 34B model that competes with a 70B might be feasible on off-the-shelf hardware, especially with quantization and optimizations. That lowers the barrier to entry for AI-powered products. It also means more room for customization – you’re not stuck with a one-size-fits-all API from a large provider; you can have a model in-house that you understand and control. For enterprise strategists, the question shifts from “Where do we buy our AI capabilities?” to “How can we leverage these open foundation models to build unique value?”. The playing field is leveling, and the competitive edge will go to those who can best adapt and integrate these rapidly improving open models.

4. New Applications Unlocked: The technical improvements aren’t just academic; they translate to new use cases. 256K context means AI that can digest entire knowledge bases or code repositories at once – imagine assistants that truly “know” your company’s documentation or an AI that can refactor a whole codebase in one shot. Multilingual fluency means one model can serve users across markets, or analyze text in multiple languages without separate systems. Better math and reasoning means more reliable analytical assistants (maybe it won’t replace your junior analyst, but it can double-check their work). These capabilities get product people thinking: what can we build now that was impractical before? Perhaps AI tutors that read and critique textbooks, legal AI that cross-references entire law libraries, or business intelligence bots that comb through years of reports and data dumps to answer high-level questions. We’re inching closer to AI that isn’t just a clever chatbot, but a genuine research assistant that can handle breadth and depth of information.

To bring it down to specifics, let’s consider a few perspectives:

For Product Builders: Falcon-H1 and models like it mean you can deliver advanced AI features without needing a supercomputer or a mega-cloud budget. Want a customer support AI that handles long, complex threads of emails? A 7B or 34B Falcon can do that with its long context window. Need a coding helper that fits in your IDE? A 1.5B-Deep model might give surprisingly good code suggestions without calling an API. And because it’s open source, you can fine-tune it on your proprietary data (say, your company’s internal docs or codebase) with no external data leakage. The efficiency gain – doing more with smaller models – also hints at on-device AI for specialized apps. We could soon see smartphones or AR glasses running powerful language models locally, tailored to the user’s own data, thanks to these optimizations.

For AI Researchers: Falcon-H1 is a case study in breaking the mold. It encourages researchers to explore hybrid architectures and not assume Transformers are the end-all be-all. The success with SSMs may reinvigorate research into other sequence models or into better ways of integrating modules (maybe techniques from control theory or neuroscience-inspired models could be the next plug-in component). Also, the team’s willingness to challenge training orthodoxies (like reverse curriculum and data reuse) is a reminder that empirical results can defy expectations – we should keep testing those assumptions, especially as we venture into regimes (like 10+ trillion token training) where our old intuitions might not hold. In short, the field may become less uniform; expect a Cambrian explosion of model variants, each with different mixtures of components aimed at different niches (language, vision, speech, etc.), all coexisting and advancing the state of the art in tandem.

For Tech Strategists and Business Leaders: The broader trend indicated by this work is that AI is becoming more accessible and more adaptable. The power that once required a fortune in compute and a crack research team is rapidly trickling down. This compresses the timeline of AI adoption across industries – your competitors might deploy GPT-4-level AI in their workflows via an open model well before you budget for a big vendor contract. It also suggests that proprietary advantage in AI can be short-lived; an algorithmic breakthrough today might be open-sourced tomorrow. However, it also opens opportunities: organizations can craft their own foundation models tuned to their domain (be it finance, biomedicine, law) by starting from something like Falcon-H1 and extending it. Strategic differentiation may come from how you use and fine-tune the foundation models, rather than who has the biggest model. Additionally, keep an eye on the multimodal aspect – as models begin to handle text, images, and more in one system, companies that leverage that (e.g., an AI that can read documents and analyze associated charts/graphs all at once) will have an edge in insight extraction and automation.

Takeaway: The Falcon-H1 project shows that the future of AI won’t just be about piling on more parameters – it will be about combining ideas and domains to create smarter, more efficient systems. A hybrid model with “two brains” (attention + SSM) can outperform much larger single-minded models, hinting that the next leaps in AI might come from this spirit of integration. For those of us building and using these technologies, it’s a thrilling development. We should be prepared for AI models that are more like toolkits than one-trick ponies – able to see, listen, remember vast contexts, and reason, all in one. The foundations of AI are evolving, and with efforts like Falcon-H1, we’re getting a preview of a more surprising, accessible, and multimodal AI era to come.

❤️ If you enjoyed this article, give it a like and share it with your peers.

Can AI Really Understand How We Think?

Pascal Biese — Sat, 05 Jul 2025 14:31:11 GMT

You've probably noticed how AI seems to be getting eerily good at predicting human behavior. From recommendation algorithms that know what you'll want to watch next to chatbots that can mimic your writing style, these systems are becoming uncannily accurate at anticipating our choices. But the million-dollar question remains: does predicting behavior mean understanding cognition?

The debate around this question just exploded into a major scientific controversy with the publication of the Centaur model in Nature. Researchers at the Max Planck Institute fine-tuned Meta's Llama 3.1 language model on a massive dataset of human behavioral experiments, creating what they claim is a "unified model of human cognition." It didn't take long for the backlash from cognitive scientists to be manifested.

What we’re going to cover in this article:

What exactly the Centaur model is and why it's causing such a stir
The impressive (and potentially concerning) capabilities it demonstrates
Why leading cognitive scientists are calling it "absurd"
What this means for the future of understanding the human mind
The genuinely valuable contributions hidden beneath the controversy

Let's unpack what could be the most divisive scientific paper that has been published this year.

Scenario Testing: A New Paradigm for Making AI Agents More Reliable

Pascal Biese — Mon, 30 Jun 2025 15:30:25 GMT

The rise of AI agents marks a fundamental shift in software development. Unlike traditional code that follows predictable paths, AI agents show emergent behaviors, make autonomous decisions, and engage in complex multi-turn conversations. Yet most teams still test these sophisticated systems with manual, ad-hoc approaches - typing test messages, hoping they've covered edge cases, and crossing their fingers before deployment. This testing gap is becoming critical: AI-related incidents increased 56.4% in 2024, and Gartner predicts 40% of agentic AI projects will be cancelled by 2027 due to insufficient testing and unclear ROI.

There must be a better way. Just as we use compilers to check code and linters to enforce standards, we need AI-native testing approaches for AI agents. One possible solution to this problem is scenario testing - a paradigm where AI agents systematically test other AI agents through realistic conversations and scenarios. LangWatch, an open-source LLMOps platform, has pioneered this approach with a framework that transforms how teams validate AI agent behavior.

The fundamental challenge of testing non-deterministic systems

Traditional software testing relies on deterministic behavior: given input X, expect output Y. This approach breaks down completely with AI agents. Consider a customer service agent handling a billing inquiry - the same question might generate different phrasings, follow different conversation paths, or use different tools to reach the same outcome. The challenge isn't just variability; it's that the variability itself is a feature, allowing agents to handle novel situations and adapt to user needs.

This creates three critical testing challenges that manual approaches can't solve:

Coverage blindness: Human testers inevitably test happy paths and obvious edge cases, missing the long tail of unexpected interactions. When an AI agent encounters "I need to speak to your manager, but first, can you explain why my bill shows charges from last Tuesday when I was in the hospital?"- has anyone tested that specific scenario?

Regression invisibility: A seemingly minor prompt change to improve one interaction can break five others. Without systematic testing, these regressions only surface in production, often through frustrated user reports or, worse, silent failures that erode trust.

Scale impossibility: As agents become more sophisticated - handling complex workflows, using multiple tools, maintaining context across sessions - the combinatorial explosion of possible interactions makes manual testing impractical. Testing a simple 5-turn conversation with 10 possible user responses at each turn creates 100,000 potential paths.

Scenario testing: Simulation at scale

A key insight of the LLM-as-a-judge method was recognizing that the best way to test LLMs at scale is with another LLM. While “judging” can be a very intransparent, complex process in itself, it’s much easier to evaluate the success of a well-defined workflow than, let’s say, the quality of the output (which is what the objective of LLM-as-a-judge methods usually is). So how are we testing AI agents with another AI agent? Instead of humans manually typing test messages, the current scenario testing approach uses a three-agent architecture:

Your agent: The AI system being tested
User simulator agent: An AI that realistically simulates user behavior based on scenario descriptions
Judge agent: An AI that evaluates whether the conversation met defined success criteria

This approach transforms testing from manual execution to scenario definition. Let’s take a look at an example from the LangWatch repository - rather than scripting exact conversations, teams describe situations and desired outcomes in natural language:

result = await scenario.run(
    name="billing_inquiry_edge_case",
    description="""
    A frustrated customer received a bill with unexplained charges. 
    They're skeptical of automated systems and may challenge the agent's responses.
    They have a valid concern about a duplicate charge from last month.
    """,
    agents=[
        CustomerServiceAgent(),
        scenario.UserSimulatorAgent(),
        scenario.JudgeAgent(
            criteria=[
                "Agent acknowledges customer frustration empathetically",
                "Agent identifies the duplicate charge without defensive language",
                "Agent offers concrete resolution within policy guidelines",
                "Conversation resolves within 5 exchanges"
            ]
        )
    ]
)

The user simulator agent interprets this description and dynamically generates realistic user messages, adapting based on how the conversation evolves. The judge agent evaluates the entire conversation against the success criteria, providing both pass/fail results and detailed analysis.

The multiplicative power of automated scenario testing

This simulation-based approach unlocks capabilities impossible with manual testing:

Systematic edge case exploration: Teams can define hundreds of scenarios covering edge cases, adversarial inputs, and complex multi-step workflows. Each scenario runs automatically, exploring different conversation paths while maintaining consistent evaluation criteria.

Continuous regression detection: Integrate scenario tests into CI/CD pipelines to run on every code change. That prompt tweak that seemed harmless? Scenario tests catch the three customer journeys it inadvertently broke before they reach production.

Domain expert collaboration: Product managers and subject matter experts can define scenarios and success criteria without writing code. A healthcare product manager can specify: "When a patient asks about drug interactions, the agent must always recommend consulting their physician" without understanding the underlying implementation.

Parallel testing at scale: Unlike human testers who process conversations sequentially, automated scenarios run in parallel. Test suites with hundreds of scenarios complete in minutes, not days.

Consider a real example: a financial services company deployed an AI agent to handle loan inquiries. Manual testing covered basic flows - checking rates, application status, payment questions. But scenario testing revealed critical gaps:

The agent provided different interest rates when asked in slightly different ways
Multi-part questions about refinancing options confused the agent's context management
Edge cases around regulatory disclosures were inconsistently handled

Each issue represented not just a bug, but a compliance risk. Scenario testing caught them before they became regulatory violations.

Beyond pass/fail: The strategic value of scenario testing

The immediate benefit of catching bugs is obvious, but scenario testing delivers deeper strategic value:

Behavioral contracts: Scenarios codify expected agent behavior, creating living documentation of system capabilities. When stakeholders ask "Can our agent handle X?", the scenario suite provides definitive answers.

Confident iteration: Teams can improve agents rapidly, knowing that scenario tests catch regressions. This accelerates the development cycle from cautious monthly releases to confident daily deployments.

Quality metrics: Track scenario pass rates over time to measure agent improvement. Use failure analysis to identify systematic issues - if multiple scenarios fail around date handling, you've identified a capability gap.

Cost optimization: Every production issue caught in testing saves exponentially more than the compute cost of running scenarios. Organizations report 80% reduction in production incidents after implementing comprehensive scenario testing.

Implementation patterns for successful scenario testing

Based on analysis of successful implementations, several patterns emerge:

Start with critical paths, not comprehensive coverage

Resist the temptation to test everything immediately. Begin with your highest-risk, highest-value user journeys:

# Start here: Core business-critical scenarios
- Customer purchase flow
- Account access recovery  
- Complaint escalation handling

# Not here: Every possible greeting variation

Design scenarios for insight, not just validation

Good scenarios reveal how agents behave under stress:

# Less useful: Happy path validation
description = "User asks for store hours"

# More useful: Stress testing
description = """
User is trying to make a return after hours on the last day 
of the return window. They're frustrated and mention considering 
switching to a competitor. They have a receipt but it's partially damaged.
"""

Layer your testing strategy

Scenario testing complements, not replaces, other testing approaches:

Unit tests: Validate individual functions and tools
Component evals: Test specific capabilities (e.g., information extraction)
Scenario tests: Validate end-to-end user journeys
Production monitoring: Track real-world performance

Embrace non-determinism thoughtfully

Rather than fighting AI variability, design scenarios that accommodate it:

criteria = [
    # Too rigid: "Agent must say 'Thank you for contacting support'"
    
    # Better: "Agent acknowledges the customer's contact professionally"
    # Better: "Agent provides refund amount between $47-53 based on calculation"
    # Better: "Resolution offered aligns with company policy document"
]

Enable cross-functional collaboration

The most successful implementations involve diverse perspectives:

Engineers define technical constraints and integration requirements
Product managers specify user journeys and success metrics
Domain experts contribute realistic scenarios and evaluation criteria
QA teams organize test suites and analyze failure patterns

Common pitfalls and how to avoid them

The over-automation trap

Not every test needs to be a scenario. Simple deterministic checks remain valuable:

# Overkill: Scenario test for API availability
# Better: Simple unit test

# Appropriate: Scenario test for multi-step booking flow
# Overkill: Unit test for conversation dynamics

The perfect scenario fallacy

Scenarios don't need to cover every possibility - they need to cover meaningful possibilities:

# Too broad: "User asks about products"
# Too narrow: "User asks about blue widgets on Tuesday afternoon"
# Just right: "Price-sensitive customer comparing products for a home renovation"

The isolation error

Testing agents in isolation misses integration issues. Include scenarios that exercise:

Tool usage and API calls
Knowledge base retrieval
Multi-agent handoffs
Session persistence

The path forward: Making scenario testing operational

For teams ready to implement scenario testing, the most crucial first step is establishing a solid foundation with the right tooling and initial test coverage. Start by setting up a framework like LangWatch or a similar platform that can handle the complexity of scenario-based testing. Rather than trying to cover everything at once, focus your initial efforts on creating just five to ten scenarios that cover your most critical user paths. These should be the workflows that, if broken, would have the most significant impact on your users or business.

During these first couple of weeks, integration with your existing test suite is essential. This isn't about replacing what you have but augmenting it with scenario-based coverage. Take time to establish clear team conventions around how scenarios should be written, named, and organized. This early investment in standards will pay dividends as your test suite grows.

Once you have your foundation in place, the natural progression is to expand your coverage to include edge cases and error paths. This is where scenario testing really shines compared to traditional unit tests. You'll want to implement caching mechanisms to keep your tests running efficiently as the suite grows. Getting your scenarios integrated into your CI/CD pipeline at this stage ensures that they become part of your regular development workflow rather than an afterthought.

The key to long-term success with scenario testing is treating it as a living system rather than a one-time implementation. Regular reviews of your scenarios ensure they stay relevant as your application evolves. Analyzing failure patterns helps you identify areas where your application might be fragile, while performance benchmarking ensures your tests remain fast enough to provide quick feedback. Perhaps most importantly, creating mechanisms for knowledge sharing across teams helps spread both the benefits and the learnings from your scenario testing efforts throughout the organization.

Remember that scenario testing is most effective when it becomes part of your team's culture rather than just another checkbox in your process. Start small, focus on value, and let the practice grow organically as your team sees the benefits firsthand.

❤️ If you enjoyed this article, give it a like and share it with your peers.

Bursting the "AI Is Just Memorization"-Bubble

Pascal Biese — Wed, 04 Jun 2025 15:02:24 GMT

Recent advancements in large language models (LLMs) - and more generally, Generative AI - have sparked intense debate about data memorization. Can these models reproduce their training data verbatim? How much information do they actually store? And perhaps most importantly, when does beneficial learning end and problematic memorization begin?

A research coalition consisting of researchers from Meta's FAIR, Google DeepMind, Cornell, and NVIDIA set out to shed some light on these questions - and with success, it seems. Their new paper titled "How much do language models memorize?" provides a rigorous mathematical framework for measuring memorization and delivers surprising insights about the fundamental capacity limits of transformer models.

What we'll cover in this article:

Why existing definitions of memorization fall short
A new compression-based framework for measuring memorization
The surprising discovery that GPT-style models store ~3.6 bits per parameter
How memorization relates to the double descent phenomenon
Practical scaling laws for membership inference attacks
What this means for the future of LLM development

Are you ready? Let's dive in.

1. The Memorization Problem: Why Current Definitions Don't Work

Before we can measure how much models memorize, we need to define what memorization actually means. This turns out to be surprisingly tricky.

The Extraction Fallacy

Most existing work defines memorization through extraction: if you can prompt a model to generate a specific training sequence, it must have memorized it. But the authors point out a critical flaw in this reasoning. Modern LLMs can be coerced to output almost any string with the right prompt. As they note, "the fact that a model outputs something is not necessarily a sign of memorization."

Consider this example: if you prompt a model with "What is 2^100?" and it correctly responds with "1,267,650,600,228,229,401,496,703,205,376", has it memorized this specific fact, or has it learned to perform exponentiation? The extraction-based definition can't distinguish between these fundamentally different scenarios.

The Stability Problem

Other definitions rely on differential privacy or influence functions, measuring how a model changes when a training example is added or removed. But these approaches have their own limitations:

They depend heavily on the training algorithm
They measure worst-case behavior rather than typical memorization
They can't be applied to a single model in isolation

The authors needed something different - a definition that could separate memorization from generalization, work at the sample level, and be independent of the training process.

AlphaEvolve: Google DeepMind's Latest Breakthrough Success

Pascal Biese — Thu, 15 May 2025 14:15:53 GMT

In order to push the boundaries of computational capabilities, researchers have long sought automated methods to discover novel and improved algorithms. The recently announced AlphaEvolve from Google DeepMind brings us one step closer to achieving this dream. Their evolutionary coding agent combines the pattern recognition and code generation capabilities of Large Language Models (LLMs) with evolutionary computation to tackle some of the most challenging problems in computer science and mathematics.

What distinguishes AlphaEvolve from previous approaches is its ability to evolve entire codebases rather than just single functions, work across multiple programming languages, and leverage rich contextual information to guide the evolutionary process. These capabilities have enabled breakthroughs in longstanding mathematical problems and meaningful optimizations in critical computational infrastructure.

In this deep dive, we'll explore how AlphaEvolve works, what makes it different from previous approaches, and examine its impressive results across scientific discovery and practical engineering applications.

TL;DR - Technical Innovations

AlphaEvolve represents a substantial advancement over previous LLM-guided evolution systems like FunSearch. Key improvements include:

Scale and scope: While FunSearch only evolved single Python functions of 10-20 lines, AlphaEvolve can evolve entire code files with hundreds of lines in any programming language.
Evaluation capabilities: AlphaEvolve can handle evaluations running for hours on accelerators, compared to FunSearch's limitation of ≤20 minutes on a single CPU.
Sample efficiency: AlphaEvolve requires only thousands of LLM samples rather than millions.
Model utilization: AlphaEvolve benefits significantly from state-of-the-art LLMs, whereas FunSearch showed minimal benefit from larger models.
Context richness: AlphaEvolve uses rich context and feedback in prompts, beyond just previous solutions.
Multi-objective optimization: AlphaEvolve can simultaneously optimize multiple metrics, not just a single objective.

Ablation studies confirmed the importance of each component:

The evolutionary approach (versus repeatedly feeding the same initial program to an LLM)
Rich context in prompts
Meta-prompt evolution
Full-file evolution capability
The use of powerful language models

Each of these components contributed significantly to AlphaEvolve's performance across different tasks.

So let’s take a closer look at how the systems works, what kind of breakthrough solutions it produced exactly, and how Google was able to put them to use.

Llama-Nemotron: NVIDIA's Foundation Model for Agentic AI

Pascal Biese — Thu, 08 May 2025 14:04:27 GMT

In recent months, the emergence of “reasoning”-optimized Large Language Models (LLMs) models capable of emitting multi-step chains of thought, self-verification, and backtracking - has reshaped what we expect from AI assistants. However, powering these capabilities at scale still poses a challenge: long, compute-intensive inference runs can become prohibitively expensive, and a one-size-fits-all reasoning strategy is not always ideal.

NVIDIA’s newly released Llama-Nemotron (LN) family addresses these issues, delivering models that (1) support a user-controllable reasoning toggle, (2) pack state-of-the-art scientific and mathematical reasoning into footprints that fit on commodity hardware, and (3) offer open licenses for enterprise and research use.

In this deep dive, we will explore the architecture, training methodology, and innovations that make Llama-Nemotron stand out in an increasingly crowded landscape of LLMs.

Key Contributions in 30 Seconds

The Llama-Nemotron family introduces several notable architecture decisions:

Heterogeneous architecture optimized for inference efficiency through neural architecture search
Dynamic reasoning toggle allowing users to switch between standard chat and reasoning modes
FFN Fusion technique to reduce sequential depth and improve inference latency
Large-scale reinforcement learning pushing reasoning capabilities beyond teacher models
FP8 inference generation for significantly improved throughput

The models come in three sizes - Nano (8B), Super (49B), and Ultra (253B) - each optimized for specific deployment scenarios while maintaining strong reasoning capabilities.

As of April 2025, LN-Ultra is the most “intelligent” open model according to Artificial Analysis. Source.

ThinkPRM: More Than Just Chain-of-Thought (CoT 2.0)

Pascal Biese — Wed, 30 Apr 2025 12:59:40 GMT

As Artificial Intelligence (AI) advances, large language models (LLMs) like ChatGPT and Claude have become increasingly capable of solving complex problems through step-by-step reasoning. But their solutions are only as valuable as they are accurate. The ThinkPRM paper introduces a breakthrough approach to efficiently verify AI reasoning processes. This addresses a critical challenge in AI: how to reliably check if an AI's step-by-step reasoning is correct without requiring enormous amounts of human-labeled data.

Why Verification Is Hard

When an LLM solves a complex math problem or writes computer code, it produces a chain of reasoning steps. Ensuring these steps are correct is crucial for applications in education, scientific research, and critical decision-making. The traditional approach to this verification relies on process reward models (PRMs) – specialized AI systems that score each step in a solution.Until now, there have been two main verification approaches:

Discriminative PRMs: These models classify each reasoning step as correct or incorrect. They're effective but require massive datasets with step-by-step human annotations – often hundreds of thousands of labeled examples. Creating this data is time-consuming and expensive.
LLM-as-a-Judge: This approach prompts an existing LLM to evaluate solutions without additional training. While convenient, these models often struggle with complex reasoning tasks and can produce unreliable results. They frequently suffer from problems like "overthinking" (generating excessively long verifications) or getting stuck in repetitive loops.

Despite advances in both approaches, verification systems face persistent challenges. Discriminative PRMs depend on extensive labeled data that's costly to create, while LLM-as-a-Judge approaches often make errors in complex reasoning scenarios and struggle with consistency. These limitations have constrained progress in developing reliable verification systems that can handle sophisticated reasoning tasks efficiently.

ThinkPRM: A Potential Breakthrough

ThinkPRM (Process Reward Models That Think) is a new approach that fundamentally reimagines verification as a generative, reasoning-based task rather than a simple classification problem.

It works by leveraging the inherent reasoning abilities of language models to verify reasoning. Instead of merely classifying steps as correct or incorrect, it "thinks through" each step, generating detailed verification chains-of-thought (CoT) that explain why a step is right or wrong.Here's the innovative process:

Foundation: The researchers start with open-source reasoning models like R1-Distill-Qwen.
Synthetic Data Generation: Rather than requiring extensive human annotations, they prompt a larger language model (QwQ-32B-Preview) to generate verification chains for a sample of problem solutions.
Quality Filtering: They only keep verification chains that match known step-level labels, ensuring high-quality training data.
Lightweight Training: The model is fine-tuned on this small but high-quality dataset – just 1,000 carefully filtered examples (representing about 8,000 step-level annotations).

The result is a model that can carefully analyze each step in a solution, explaining its reasoning process and providing a judgment about correctness. Here's an abbreviated example of ThinkPRM verifying a math problem solution:

This "thinking" process provides transparency into the verification, making it easier to understand why a particular step is judged correct or incorrect.

Convincing Results

ThinkPRM achieves exceptional performance while using drastically less training data:

1. Data Efficiency

Perhaps the most striking finding is that ThinkPRM performs better than discriminative PRMs trained on 100 times more data. While traditional models required 700,000+ step-level annotations, the method achieves superior results with just 8,000 annotations.

2. Superior Performance

ThinkPRM outperforms both discriminative PRMs and LLM-as-a-judge approaches across multiple challenging benchmarks:

On ProcessBench (a benchmark for identifying reasoning errors), ThinkPRM-14B achieves 86.5% F1 score compared to 73.7% for the LLM-as-a-judge approach using the same base model.
When used to guide search processes in solving MATH-500 problems, ThinkPRM-1.5B outperforms discriminative PRMs by approximately 5 percentage points.

3. Generalization to New Domains

Despite being trained only on math problems, ThinkPRM demonstrates remarkable generalization to entirely different domains:

On physics questions from GPQA-Diamond, ThinkPRM outperforms discriminative PRMs by 8 percentage points.
On code generation tasks from LiveCodeBench, it achieves a 4.5% advantage.

This ability to generalize suggests that ThinkPRM is learning fundamental reasoning verification skills rather than domain-specific patterns.

4. Scalable Verification

A unique advantage of ThinkPRM is its ability to scale verification compute in two ways:

Parallel Scaling: Sampling multiple verification chains independently and aggregating their decisions improves accuracy by ~5 percentage points.
Sequential Scaling: The model can "think longer" by extending its verification process, checking and revising its initial judgment. This capability allows ThinkPRM to continue improving as it's given more computation time.

How This Changes AI Verification

ThinkPRM represents a fundamental shift in how we approach verification of complex reasoning:

From Classification to Reasoning

Traditional PRMs treat verification as a classification task – binary decisions about step correctness. ThinkPRM reframes verification as a reasoning task, where a model must think through and justify its evaluation. This approach is not just more data-efficient but also more aligned with how humans verify reasoning.

Transparency and Interpretability

Unlike black-box discriminative models, ThinkPRM's verification process is fully transparent. Users can read the model's verification chain to understand why it judged a step correct or incorrect. This transparency is crucial for applications where understanding the rationale behind verification decisions matters.

Low-Resource Adaptation

The remarkable data efficiency of ThinkPRM opens possibilities for creating specialized verifiers for niche domains where extensive labeled data is unavailable. This could democratize access to high-quality verification systems across diverse fields of expertise.

Challenges and Limitations

Despite its improvements, ThinkPRM still faces challenges:

Calibration: Like many LLMs, ThinkPRM can be overconfident, with scores clustering at extremes (near 0 or 1) rather than expressing appropriate uncertainty.
Step Label Interference: Errors in verifying earlier steps can cascade, influencing the verification of later steps in the solution.
Computational Overhead: Generating detailed verification chains requires more computation than simple discriminative judgments, though the performance benefits often justify this cost.

The Broader Significance

ThinkPRM demonstrates a powerful principle: models can "think to verify" rather than simply "classify to verify." This represents a move toward more human-like verification systems that reason through solutions rather than making opaque judgments.The implications extend beyond academic research. As AI systems take on increasingly complex reasoning tasks in healthcare, scientific research, and critical infrastructure, reliable verification becomes essential. ThinkPRM's approach offers a path toward more trustworthy AI systems that can not only reason but also rigorously verify their reasoning processes.

Looking Forward

The ThinkPRM approach opens several exciting research directions:

Cross-domain verification: Further exploring how these models can generalize across different domains and types of reasoning tasks.
Interactive verification: Developing systems that can ask clarifying questions when verification is uncertain.
Self-correction: Using verification feedback to improve initial reasoning processes in a closed loop.
Human-AI collaboration: Creating verification systems that can effectively collaborate with humans in complex reasoning tasks.

Conclusion

ThinkPRM represents a significant advancement in AI verification technology, demonstrating that process reward models can achieve superior performance with dramatically less training data by leveraging generative, chain-of-thought reasoning. This "thinking verifier" approach aligns more closely with human verification practices and offers greater transparency into the verification process.

As AI systems tackle increasingly complex reasoning challenges, the ability to efficiently and reliably verify their work becomes ever more crucial. ThinkPRM shows that by teaching verification models to think through their judgments step by step, we can create more efficient, effective, and transparent verification systems – an essential step toward more trustworthy artificial intelligence.

👍 If you enjoyed this article, give it a like and share it with your peers.

Subscribe now

Multi-Agent Failure: What It Is and How to Prevent It

Pascal Biese — Tue, 29 Apr 2025 11:56:32 GMT

Multi-Agent Systems built using Large Language Models (LLMs) have emerged as a promising approach to complex problem-solving. By orchestrating multiple specialized agents working in concert, these systems aim to accomplish tasks that might be challenging for a single agent. However, despite growing enthusiasm and investment in these Multi-Agent LLM Systems (MAS), research reveals that their performance often falls short of expectations.

A recent study presents a systematic analysis of failure patterns in MAS and introduces MAST (Multi-Agent System Failure Taxonomy), a comprehensive framework for understanding why these systems break down. In this deep dive, We'll explore the key findings from their research, what they tell us about building reliable Multi-Agent Systems, and where the field might go next.

The Problem: Promise vs. Reality

Despite their theoretical advantages, Multi-Agent Systems often disappoint in practice. The researchers found that even state-of-the-art MAS frameworks show high failure rates across popular benchmarks. For instance:

ChatDev (ProgramDev): Models a miniature software company - agents take on design, coding and QA roles in sequence - but only gets about one-third of programming tasks right (33.3% correctness).
AppWorld (Test-C): Treats each everyday service (email, music, calendar, etc.) as its own agent under a supervisor orchestrator, yet fails 86.7% of its cross-app test cases.
HyperAgent (SWE-Bench Lite): Uses a central planner to hand off subtasks to navigator, editor and executor agents in a hierarchical software-engineering workflow, and still blows nearly three-quarters of its problems (74.7% failure).

These sobering statistics raise a critical question: Why do systems with sophisticated architectures and powerful underlying LLMs fail so frequently?

d1: Scaling Reasoning in Diffusion Large Language Models

Pascal Biese — Tue, 22 Apr 2025 14:33:34 GMT

TL;DR — Why this paper matters

Large language models that reason well are usually trained and fine‑tuned in an autoregressive (left‑to‑right) way. d1 shows, for the first time, that the same reinforcement‑learning tricks that lifted autoregressive models to GPT‑4‑class reasoning can also lift diffusion language models, which generate text in a coarse‑to‑fine, non‑sequential fashion. The authors introduce two key ingredients -masked supervised fine‑tuning (SFT) and a new RL algorithm called diffu‑GRPO -and demonstrate big gains on math and logic benchmarks without changing the base model size.

Reasoning beyond left‑to‑right text

Most large language models we use today - GPT, Gemini, Claude - generate text one token after another in a single left‑to‑right pass. That autoregressive habit is intuitive for us humans, but it is not the only way to produce language. A fast‑growing line of research borrows ideas from image diffusion models: start with a noisy, fully masked sentence and denoise it in several sweeps, filling in blanks until a coherent text emerges. The resulting diffusion LLMs (dLLMs), such as LLaDA and Dream, can look at future context while writing the present token, often need fewer decoding steps, and open the door to parallel generation on hardware.

Yet all the spectacular reasoning progress you might have seen in DeepSeek‑R1 or Kimi K1.5 was achieved with reinforcement learning (RL) algorithms - PPO, GRPO, variants of policy gradients - tailored to left‑to‑right models. Those methods rely on computing token‑by‑token log‑probabilities. In dLLMs, the log‑probability of a sentence is not factorized that way, so you cannot drop PPO into a diffusion model and hope it works.

That gap is exactly what d1 tries to close. The authors show that with just an 8‑billion‑parameter backbone and careful post‑training, a diffusion model can rival or beat comparable autoregressive models on GSM8K math, MATH500 competition problems, mini‑Sudoku, and the classic Countdown numbers game.

The d1 recipe at a glance

The proposed pipeline has only two stages:

Why not merge the stages? In practice, SFT gives the model a basic “chain‑of‑thought grammar” - self‑checks, backtracking, tidy XML tags - making the subsequent RL much more stable. The recipe is therefore called d1: “diffusion + one‑two punch of SFT then RL.”

Augmented Work: The AI Teammates Are Coming

Pascal Biese — Wed, 16 Apr 2025 16:17:31 GMT

The collaboration between humans and AI is no longer science fiction and will become an everyday reality sooner than later. From healthcare to creative design, AI systems are increasingly working alongside humans as teammates rather than mere tools. But how do these partnerships develop over time? How do they adapt to challenges? And what makes some human-AI teams thrive while others struggle?

Earlier today, I stumbled upon a new paper by Wang and colleagues that’s trying to find answers to a lot of these questions. Through the lens of "human-agent teaming" (HAT), they are offering their perspective on how these collaborations might evolve. But rather than viewing HAT as a static arrangement, they approach it as a dynamic process that unfolds over time ("process dynamics perspective”). Let’s take a closer look ourselves.

Beyond the Tool Paradigm: AI as Teammate

For decades, our relationship with technology has largely followed a tool-user paradigm - humans wielded technology like a hammer or operated it like a vehicle. But today's AI systems, particularly those powered by large language models, possess unprecedented levels of autonomy, social capability, and proactiveness.

This has sparked interest in a new paradigm: human-agent teaming. In HAT, AI systems aren't just passive tools but active participants that share goals, distribute responsibilities, and engage in ongoing coordination with their human counterparts.

As the authors note: "Human-agent teaming (HAT) is defined as a collaborative framework in which humans and agents pursue shared goals, distribute responsibilities, and engage in ongoing coordination and negotiation to achieve joint objectives."

If one agrees with this perspective or not, current AI definitely feels different and this will ultimately change how we design and study human-AI interaction. Rather than focusing solely on usability or performance, researchers now need to consider team dynamics, shared understanding, and adaptive capacity.

From a practical point of view, we’ve only just begun to make the shift from static AI workflows to AI agents to Multi-Agent Systems. It’s already clear that AI (agent) orchestration will become a very important topic in the near future - but we often ignore the human factor when talking about these systems. I would argue, however, that humans will remain to play a key role in a lot of real-world settings, especially in high risk domains. So thinking about Multi-Agent Systems in isolation won’t be enough - and frameworks like HAT try to fill this gap.

The T4 Framework: A Lifecycle View of Human-AI Teams

The T4 framework of HAT process dynamics. Source.

One of the researchers’ core contributions is the T4 framework – a comprehensive model that views HAT through four developmental phases:

Team Formation: Establishing team identity and shared goals
Task and Role Development: Defining who does what and building individual competence
Team Development: Creating shared understanding and effective coordination
Team Improvement: Building adaptability and long-term viability

What makes this framework powerful is how it integrates two key dynamics:

Task dynamics: The cyclical process through which team members set goals, execute tasks, evaluate outcomes, and adjust strategies
Team developmental dynamics: The iterative progression through the four phases

These dynamics don't operate in isolation but continuously influence each other. As the team completes tasks, it progresses through developmental phases. Meanwhile, each phase shapes how the team approaches its tasks.

The ultimate goal? A self-managing, self-regulating team capable of adapting to new challenges without external intervention.

DeepSeek-GRM: What It Is and Why You Should Care

Pascal Biese — Tue, 08 Apr 2025 16:45:36 GMT

At the heart of recent AI advancements lies reward modeling—a mechanism for providing accurate reward signals to guide model optimization. A few days ago, a new paper from DeepSeek AI introduced a novel approach that significantly advances the field by making reward models more effective and scalable across diverse domains.

While we've seen impressive progress in reward modeling for specific domains with clear verification criteria (like mathematics or coding), generating high-quality rewards for general domains remains challenging.

This might not be another “DeepSeek moment”, but it’s a pretty significant contribution that, for some reason, hasn’t really been picked up yet by the wider AI community—which is why I decided to write about it.

In this article, we'll explore the key innovations of their research, explain the technical details of the proposed method, Self-Principled Critique Tuning (SPCT) and inference-time scaling for reward models.

Before we dive into it though, let’s start with an (optional) executive summary that will give you a better idea of what DeepSeek-GRM is about.

Executive Summary

DeepSeek AI’s latest release focuses on improving reward modeling (RM) for large language models (LLMs) through "inference-time scaling." In simple terms, it's about how to make reward models get better results by using more compute at inference time, rather than just making bigger models during training.

Key themes and contributions:

Introducing Self-Principled Critique Tuning (SPCT), a method for generalist reward modeling
Developing a reward modeling approach using pointwise generative reward modeling (GRM)
Showing how to effectively scale inference computation for better reward quality
Creating DeepSeek-GRM models that outperform existing approaches
Introducing a meta RM to guide voting process for better scaling performance

Main challenges addressed:

Flexibility for different input types in reward modeling
Accurate reward generation in various domains
Scaling reward quality with increased inference compute
Learning scalable behaviors for better performance-compute scaling

Key methods:

Pointwise generative reward modeling (GRM)
Self-Principled Critique Tuning (SPCT) with:
- Rejective fine-tuning
- Rule-based online reinforcement learning
Inference-time scaling through:
- Parallel sampling
- Meta RM-guided voting

Key results:

DeepSeek-GRM outperforms baseline methods on various reward modeling benchmarks
Inference-time scaling provides significant performance improvements
Voting with meta RM guidance further boosts performance
Inference-time scaling can outperform training-time scaling (using larger models)

The Challenge: Building Better Generalist Reward Models

Reward modeling serves as the backbone of Reinforcement Learning from Human Feedback (RLHF), providing the signals that guide LLMs toward generating more helpful, harmless, and honest responses. While high-quality rewards exist for specific domains with clear verification criteria (like mathematics or coding), generating accurate rewards for general domains remains challenging. The researchers identify four key requirements for effective generalist reward models:

Flexibility to handle different input types (single responses, paired comparisons, or multiple candidates)
Accuracy in generating rewards across diverse domains
Inference-time scalability to improve reward quality with more compute
Learning scalable behaviors that enhance performance as compute increases

Paradigms of reward modeling. Source.

Existing approaches to reward modeling fall into several categories, each with limitations:

Scalar reward models: Output numerical scores but lack detailed reasoning
Semi-scalar reward models: Generate both scores and critiques but struggle with scaling
Generative reward models: Provide detailed textual critiques but need methods to scale effectively

Additionally, reward models use either pointwise scoring (rating each response individually) or pairwise scoring (selecting the better of two responses). They found that pointwise generative reward models (GRMs) offer the best combination of flexibility and scalability potential.

Scoring patterns. Source.

The Power of Principles in Reward Generation

A key insight from the paper is that quality principles significantly improve reward generation. In preliminary experiments, the researchers discovered that properly selected principles boosted reward quality on benchmarks like Chat Hard and IFEval. Rather than using static principles, they explored allowing the model to dynamically generate principles tailored to each query and response pair. For example, when evaluating responses to a coding question, the model might generate principles like:

Code Correctness (Weight: 40%)
Code Efficiency (Weight: 25%)
Documentation Quality (Weight: 20%)
Error Handling (Weight: 15%)

These weighted principles create a structured framework for generating comprehensive critiques and more accurate rewards.

Self-Principled Critique Tuning: Teaching Models to Generate Better Principles and Critiques

The core contribution of this paper is Self-Principled Critique Tuning (SPCT), a method that enables generative reward models to learn generating adaptive, high-quality principles that guide critique generation.

A rather inaccesible visualization of SPCT training. Source.

If this image is as confusing to you as it was to me: don’t worry, we’ll go through it together—step by step.

See the “RFT” and “RL” on the left? RFT stands for “Rejective Fine-Tuning” and RL for “Reinforcement Lerning”, or more specifically in this case, “Rule-Based Online Reinforcement Learning”. Those are the two phases of SPCT training.

Phase 1: Rejective Fine-Tuning (RFT)

The first phase prepares the model to generate well-formatted principles and critiques for various input types:

Data construction: Sampling trajectories (principles and critiques) from pretrained GRMs for queries and responses
Rejection strategy: Discarding trajectories where predicted rewards don't align with ground truth, and removing "too easy" queries where all sampled trajectories are correct
Optional hinting: For challenging cases, providing hints about the correct answer, though this sometimes leads to "shortcuts" in reasoning

This phase creates a unified format for handling different numbers of responses, unlike previous approaches that used different formats for different scenarios.

Phase 2: Rule-Based Online Reinforcement Learning (RL)

The second phase refines the model's ability through online reinforcement learning:

During rollout, the model generates principles and critiques for input queries and responses
The predicted rewards are extracted and compared to ground truth
The model receives binary rewards: +1 for correctly identifying the best response, -1 otherwise
The reinforcement learning objective includes a KL penalty to maintain format quality

This approach encourages the model to develop principles that enable accurate discrimination between responses while avoiding severe biases.

Scaling at Inference Time: More Compute, Better Results

Once trained with SPCT, DeepSeek-GRM models demonstrate remarkable inference-time scalability through two key methods:

1. Voting with Generated Rewards

The basic approach involves:

Running the model multiple times with different random seeds (parallel sampling)
Having each sample generate its own set of principles and critiques
Extracting numerical scores from each run
Summing the scores to determine the final ranking

Since each sample generates different principles, the model evaluates responses from multiple perspectives, leading to more robust judgments. This process effectively expands the reward space, allowing for finer distinctions between responses.

2. Meta Reward Modeling for Better Voting

To further enhance performance, the researchers introduced a meta reward model that:

Evaluates the quality of principles and critiques generated by the main reward model
Assigns scores to indicate which samples are more reliable
Guides the voting process by filtering out low-quality samples

This meta RM is trained to identify the correctness of principles and critiques, using both positive and negative examples to improve performance.

Impressive Results Across Diverse Benchmarks

The researchers evaluated DeepSeek-GRM models on multiple reward modeling benchmarks including Reward Bench, PPE (both Preference and Correctness), RMB, and ReaLMistake. The results demonstrate several significant findings:

DeepSeek-GRM Outperforms Baseline Methods

Even without inference-time scaling, DeepSeek-GRM-27B (based on Gemma-2-27B) achieved superior overall performance compared to baseline methods like LLM-as-a-Judge and various scalar and semi-scalar reward models. Unlike scalar and semi-scalar reward models that showed strong biases (performing well on some benchmarks but poorly on others), DeepSeek-GRM demonstrated more consistent performance across diverse domains.

Inference-Time Scaling Significantly Improves Performance

The most striking results come from inference-time scaling:

With 8 samples, DeepSeek-GRM-27B showed a performance increase of 2.7 percentage points
With meta RM guidance and 32 samples, it achieved a remarkable 4.9 percentage point improvement
This scaled performance reached levels competitive with or exceeding much larger models like GPT-4o and Nemotron-4-340B-Reward

What's particularly noteworthy is that these improvements far exceeded those of baseline methods under similar scaling conditions.

Inference-Time Scaling vs. Training-Time Scaling

Perhaps most importantly, the researchers showed that inference-time scaling can be more effective than simply using larger models:

DeepSeek-GRM-27B with meta RM-guided voting (32 samples) achieved performance comparable to a 671B parameter model using greedy decoding
This suggests that allocating more compute to inference might be more cost-effective than training larger models

This finding challenges the conventional wisdom that "bigger is better" and offers a more nuanced view of compute allocation.

Why SPCT Works: Technical Analysis

The effectiveness of SPCT stems from several key design choices:

Unified pointwise GRM format enables flexible handling of different input types within the same model
Self-generated principles create a tailored framework for critique that focuses on relevant criteria
Rule-based reinforcement learning teaches the model to generate principles that enable effective discrimination
Parallel sampling with diverse principles evaluates responses from multiple perspectives
Quality control via meta RM ensures only reliable samples contribute to the final decision

The principles serve as a form of chain-of-thought reasoning, helping the model organize its evaluation process and producing more consistent rewards.

Limitations and Future Directions

Despite its impressive performance, the approach has several limitations:

Efficiency challenges: Generative reward models require more computation than scalar models for each inference
Domain-specific performance: DeepSeek-GRM still lags behind scalar models on some verifiable tasks
Long-horizon reasoning: The researchers found that DeepSeek-R1, which uses extensive chain-of-thought reasoning, underperformed relative to its computational cost except on reasoning-intensive tasks

These limitations point to several promising future directions:

Tool integration: Incorporating tools like code interpreters could enhance verification capabilities
Efficiency improvements: Generating principles ahead of time could improve computational efficiency
LLM evaluation applications: Using generated principles as interpretable criteria for identifying LLM weaknesses
Co-scaling with policy models: Combining scalable rewards with scalable policies for even greater performance

Conclusion: A Significant Advance in Reward Modeling

Self-Principled Critique Tuning represents a significant advance in reward modeling for LLMs. By enabling generative reward models to dynamically produce principles tailored to specific queries and responses, SPCT creates models that are both flexible across diverse inputs and effectively scalable with increased inference compute.

The impressive performance of DeepSeek-GRM models, particularly with inference-time scaling, suggests that this approach could become an important component of future LLM training pipelines. The finding that inference-time scaling can outperform training-time scaling (using larger models) is particularly noteworthy, offering a potentially more efficient path to improved reward modeling.

In a field where much attention focuses on scaling model size, this research offers a refreshing perspective: sometimes, using compute more intelligently at inference time can be more effective than simply building bigger models.

Think Like An AI Agent: Introduction to Planning with LLMs

Pascal Biese — Wed, 12 Mar 2025 15:09:32 GMT

Large Language Models (LLMs) are typically seen as the “brain” behind autonomous AI because of the crucial role they are playing in the current emergence of autonomous agents. However, while much attention has been paid to LLMs' reasoning and tool-learning capabilities, their planning abilities—crucial for effective agent autonomy—have received less systematic analysis.

A new paper by Huang et al. provides a comprehensive taxonomy of planning approaches in LLM-based agents. In this deep dive, we'll analyze the paper's taxonomy of approaches, examining implementation details, and discussing the challenges that lie ahead for researchers in this rapidly evolving field.

If you want to learn how LLMs are reshaping agentic planning, stay awhile and listen.

Taxonomy on LLM-Agent planning. Source.

LLM Planning: The Fundamental Paradigm Shift

Planning—the ability to generate a sequence of actions to achieve a goal—has traditionally been the domain of symbolic AI systems or reinforcement learning approaches. The paper defines the general planning formulation as:

Where p represents the plan (sequence of actions), E is the environment, g is the goal, Θ represents the LLM parameters, and P the prompt.

Traditional planning methods face significant limitations:

Symbolic methods require conversion of natural language into formal representations
These approaches lack error tolerance, failing with even minor errors
Reinforcement learning methods require extensive interaction data

LLMs offer a promising alternative by leveraging their pre-trained knowledge and reasoning capabilities to approach planning in a more flexible and robust manner.

Huang et al. systematically categorize existing LLM-Agent planning approaches into five distinct but interconnected directions:

Task Decomposition
Multi-Plan Selection
External Planner-Aided Planning
Reflection and Refinement
Memory-Augmented Planning

Let's examine each approach in detail.

Beyond Attention: Comparing Potential Transformer 2.0 Architectures

Pascal Biese — Tue, 04 Mar 2025 19:26:42 GMT

Let’s talk about Transformers today, and their potential future. While the original Transformer architecture revolutionized natural language processing with its self-attention mechanism, several recent architectural innovations address fundamental limitations that have restricted the practical application of these models: context length restrictions and task-specific adaptability.

Among those new architectures, there are two that are especially well-positioned to succeed the throne—one officially, the other unofficially:

Google's Titans (Transformer 2.0), which addresses the challenge of handling extremely long sequences and maintaining memory over time
Sakana AI's Transformer², which focuses on creating dynamically adaptable models that can switch between tasks without extensive retraining

These architectures represent complementary approaches to enhancing transformer capabilities. While Titans enhances memory to process sequences beyond traditional context windows, Transformer² improves real-time adaptability to diverse tasks. Both address critical limitations in current large language models (LLMs) while taking fundamentally different architectural decisions.

Let's dive into the technical details of each system, analyze their strengths and potential applications, and consider how these approaches might shape the future of language models.

Google's Titans: Extending Memory for Long Sequences

The Challenge of Context Length

Traditional transformer models face a significant limitation in their ability to handle long sequences, stemming from the quadratic computational complexity of the self-attention mechanism. As sequence length increases, both memory requirements and computational costs grow proportionally to the square of the sequence length, making it impractical to process very long documents, videos, or time series data.

The Titans architecture addresses this fundamental challenge by introducing a new memory system that enables efficient processing of sequences up to 2 million tokens in length—far beyond the typical context windows of current state-of-the-art LLMs, which typically range from 8k to 128k tokens.

Three-Tier Memory Architecture

The key innovation in Titans is its three-tier memory system:

Short-term Memory: Implemented through the standard attention mechanism, this component handles immediate context similar to working memory in humans.
Long-term Neural Memory: A specialized neural module designed to learn how to store and retrieve historical context over extended periods, enabling the model to maintain information well beyond the limitations of the attention window.
Persistent Memory: A stable storage mechanism for task-specific knowledge that remains consistent throughout inference, providing a foundation for domain expertise.

This represents a significant departure from traditional transformer design by incorporating explicit memory mechanisms inspired by human cognitive processes rather than relying solely on attention.

Microsoft's Phi-4-Mini: Never Has Small Been This Good

Pascal Biese — Thu, 27 Feb 2025 14:35:00 GMT

Creating efficient, powerful language models has been a significant challenge in AI research. Despite the headline-grabbing capabilities of massive LLMs with hundreds of billions of parameters, there's growing interest in developing smaller, more efficient models that maintain impressive capabilities while running on less hardware. Microsoft's latest entry in this space, Phi-4, represents a notable achievement in this direction. In this article, we'll dive deep into the technical innovations that make Phi-4 stand out—even among its fierce competition.

The Phi Series: A Family of Compact Yet Powerful Models

Microsoft's Phi series, and more specifically, its recently released fourth generation, challenge the conventional wisdom that bigger is always better in language model development. These models demonstrate that with careful data curation, architectural optimizations, and strategic training approaches, even relatively small models can achieve impressive performance across language, vision, and speech tasks.

The newest Phi-4 release consists of two main models:

Phi-4-Mini: A 3.8-billion-parameter language model focused on text understanding and generation
Phi-4-Multimodal: An extension that integrates vision and speech/audio capabilities while preserving the core language abilities

What makes these models particularly interesting is their efficiency-to-performance ratio. Phi-4-Mini matches or exceeds models twice its size on certain tasks, especially in mathematics and coding, while Phi-4-Multimodal delivers competitive multimodal capabilities despite its modest parameter count.

Overview of the multimodal architecture of Phi-4-Multimodal. Source.

Model Architecture: Technical Foundations

Language Model Architecture

At its core, Phi-4-Mini employs a decoder-only Transformer architecture with several optimizations:

32 Transformer layers with hidden state size of 3,072
Tied input/output embeddings to reduce memory consumption
Expanded vocabulary of 200,064 tokens (up from Phi-3.5) to better support multilingual applications
Group Query Attention (GQA) with 24 query heads and 8 key/value heads, reducing KV cache usage to one-third of standard size
LongRoPE positional encoding to support 128K context length
Fractional RoPE dimension ensuring 25% of attention head dimensions remain position-agnostic for smoother long-context handling

The improved tokenizer deserves special mention as it enables better multilingual and multimodal processing. The expanded vocabulary provides more efficient token representation across languages, which is essential for applications beyond English.

Multimodal Architecture: Mixture of LoRAs

Perhaps the most innovative aspect of Phi-4-Multimodal is its "Mixture of LoRAs" approach for handling multiple modalities. This technique preserves the original language model capabilities while extending functionality to vision and speech:

Vision Modality:
- Image encoder based on SigLIP-400M (440M parameters)
- Vision projector to align vision features with text embeddings
- Vision adapter LoRA (370M parameters) for the language decoder
Speech/Audio Modality:
- Audio encoder with conformer blocks (460M parameters)
- Audio projector to map speech features to text embedding space
- Audio adapter LoRA (460M parameters)

The key trick here is that the base language model remains entirely frozen during multimodal training, with only the modality-specific components being updated. This prevents the interference and performance degradation that often occurs when fine-tuning language models for multimodal tasks.

Training Methodology: The Secret Sauce

The impressive performance of Phi-4-Mini doesn't come from architectural innovations alone—careful data curation and training strategies play a crucial role.

Introduction to LIMO: Less is More for LLM Reasoning

Pascal Biese — Wed, 12 Feb 2025 16:50:23 GMT

The recent paper "LIMO: Less is More for Reasoning" challenges foundational assumptions about how large language models (LLMs) acquire complex reasoning capabilities. By demonstrating that meticulously curated instruction data—as few as 817 examples—can outperform models trained on 100x more data, the authors propose a paradigm shift in how we approach reasoning in LLMs.

This article explores the technical innovations, empirical findings, and broader implications of how LLM reasoning might change in the near future.

The LIMO Hypothesis: A New Lens on Reasoning

At its core, the Less-Is-More Reasoning (LIMO) Hypothesis states that:

Sophisticated reasoning capabilities emerge when two conditions converge:
Rich pre-trained knowledge embedded during LLM pre-training.
Precisely orchestrated demonstrations (cognitive templates) during fine-tuning.

This hypothesis overturns the long-held belief that complex reasoning tasks require massive datasets (e.g., >100k samples). Instead, LIMO argues that modern foundation models (e.g., Qwen2.5, Llama 3) already encode extensive domain knowledge; the challenge lies in eliciting this knowledge through high-quality examples that guide the model to "think" systematically.

They define the following requirements for success:

Pre-trained Knowledge: Modern LLMs like Qwen2.5-32B incorporate vast mathematical content during pre-training (3.7T tokens for Llama 3’s math-focused training).
Inference-Time Scaling: Techniques allowing extended reasoning chains (e.g., parallel sampling, tree search) provide the "cognitive workspace" for multi-step problem-solving.

Data Curation: Quality Over Quantity

Data quality is core to the LIMO hypothesis and to ensure data quality, we need proper data curation. LIMO’s success hinges on its systematic data curation process, which prioritizes depth over breadth:

Question Selection

Difficulty: Problems are filtered using state-of-the-art models (e.g., Qwen2.5-Math-7B-Instruct), retaining only those with <10% solve rates.
Diversity: Selected from advanced benchmarks (AIME, OlympiadBench) and multilingual sources (Chinese Gaokao), ensuring coverage of algebra, geometry, and proof-based challenges.
Out-of-Distribution (OOD) Focus: 30% of questions intentionally deviate from standard datasets to test generalization.

From an initial pool of 10M+ problems, only 817 survived rigorous filtration.

DeepSeek-R1: What It Is & Why Everyone Is Talking About it

Pascal Biese — Tue, 21 Jan 2025 18:02:00 GMT

In recent years, Large Language Models (LLMs) have made significant advancements in their ability to understand and generate human-like text. These models, such as GPT-4 and Claude 3.5, have shown impressive performance in various natural language processing tasks. However, there is still room for improvement, particularly in the area of reasoning capabilities. To address this, researchers have explored a plethora of techniques — iteratively moving towards more and more complex data regimes and most recently, scaling up test-time compute.

Contrary to this trend, a rising AI research company has now reported that they had more success with a far simpler approach: the use of reinforcement learning (RL) without having to rely on supervised fine-tuning at all.

What is Reinforcement Learning?

Before diving into DeepSeek's approach, let's first understand what reinforcement learning is. Imagine you're teaching a dog a new trick. You give the dog a treat every time it performs the desired action, such as sitting or rolling over. The dog learns to associate the action with the reward and becomes more likely to repeat the behavior in the future. This is the basic principle behind reinforcement learning.

In the context of LLMs, the model is the "dog," and the reward is a score that measures how well the model performs on a specific task. The model learns to generate text that maximizes the reward, thereby improving its performance on the task.

DeepSeek-R1-Zero: Pure Reinforcement Learning

The first step in DeepSeek's approach was to apply RL directly to the base model, DeepSeek-V3-Base, without any supervised fine-tuning (SFT). This model, called DeepSeek-R1-Zero, was allowed to explore different reasoning strategies, such as Chain-of-Thought (CoT), to solve complex problems.

Think of CoT as a step-by-step thought process that the model goes through to arrive at a solution. For example, if the model is asked, "What is the capital of France?", it might generate the following CoT:

France is a country in Europe.
The capital of a country is usually its largest and most important city.
Paris is the largest and most important city in France.
Therefore, the capital of France is Paris.

By exploring different CoT strategies, DeepSeek-R1-Zero was able to develop powerful reasoning capabilities without any supervised data. This was a significant finding, as it demonstrated that LLMs could improve their reasoning abilities through pure RL.

DeepSeek-R1-Zero having an “aha” moment. Source.

Stream of Search: Teaching Language Models the Language of Search

Pascal Biese — Wed, 30 Oct 2024 18:01:25 GMT

Language models have demonstrated remarkable abilities in various tasks, from natural language processing to code generation. However, these models often struggle with complex problem-solving that requires planning, searching, and backtracking. A recent paper by Gandhi et al. (2024) introduces a novel approach called Stream of Search (SoS), which aims to teach language models to solve problems by searching in language, without relying on external components.

The Stream of Search framework systematizes the elements of search into a unified language, enabling the representation of diverse search strategies in a common format. By training language models on these "streams of search," the authors demonstrate that the models can learn to solve problems more effectively than those trained solely on optimal solution trajectories. Furthermore, the SoS models can self-improve by optimizing for correctness using reinforcement learning techniques such as STaR and APA.

In this article, we will delve into the Stream of Search framework, exploring its key components, the problem setup, and the experimental results that showcase its potential for enhancing the problem-solving capabilities of language models.

The Language of Search

At the core of the Stream of Search framework is a vocabulary of primitive operations that define the components of various search algorithms.

These operations include:

Current State (sc): The state being explored.
Goal State (sg): The target state.
State Queue (Sq): The states at the frontier of the trajectory that haven't been explored yet.
State Expansion Function (SE): A function that explores a state adjacent to the current state based on a transition function.
Exploration Choice: Choosing the order of states to explore (e.g., breadth-first search or depth-first search).
Pruning: Discarding states or subtrees that are unlikely to lead to a solution.
Backtracking: Moving between explored nodes to choose the next state for expansion.
Goal Check: Checking if the current state is the goal state.
Heuristic (h): A function that approximates the distance of the current state from the goal state, guiding the search process.

By representing these operations in language, the Stream of Search framework allows for the creation of a dataset with diverse search strategies. Some operations, such as the current state, goal state, backtracking, goal checks, and exploration choices, are explicitly represented in the search trajectory. Others, like heuristic functions, state values, and pruning strategies, are kept implicit, encouraging the model to internalize abstract representations that can be improved through training.

Visualization of how a search process is translated into a stream of search. Source.

Problem Setup: The Game of Countdown

To demonstrate the utility of the Stream of Search framework, the authors focus on a generalization of the 24 Game called Countdown. In this game, a set of input numbers must be combined using simple arithmetic operations to reach a target number. Countdown presents a challenging search problem due to its high branching factor, requiring planning, search, and backtracking to solve.

The authors construct a synthetic dataset of 500,000 search trajectories using a set of diverse and suboptimal symbolic search strategies based on breadth-first search (BFS) and depth-first search (DFS) with two simple heuristic functions. The heuristics used are:

The absolute difference between the sum of the remaining options and the target.
The distance to the factors of the target.

The search trajectories are serialized as strings, representing a list of tree nodes or states in the order of traversal. The dataset is then used to train language models in two conditions:

Optimal Paths (OP): The model is trained to predict the correct and optimal path for all problems in the dataset.
Stream of Search (SoS): The model is trained on search trajectories sampled from different search strategies.

Overview of the Stream of Search (SoS) framework. Source.

Experimental Results

The authors train a GPT-Neo model with 250M parameters on the Countdown dataset and evaluate its performance on held-out problems. The results demonstrate that the model trained on streams of search (SoS) outperforms the model trained on optimal solutions (OP). The SoS model achieves an accuracy of 51.27% on held-out inputs, compared to 25.73% for the OP model. This finding highlights the importance of exposing models to the messy process of problem-solving, including exploration and backtracking, rather than only the ideal solution steps.

To understand the strategies employed by the trained SoS model, the authors measure the alignment of the model-generated search trajectories with symbolic strategies. They find that the SoS model does not predominantly use any single strategy from its training data but instead exhibits a higher correlation with strategies that use the sum heuristic.

Policy Improvement with Stream of Search

The authors further investigate whether the SoS model can self-improve with feedback based on correctness and efficiency. They employ two reinforcement learning strategies: expert iteration using STaR (Self-Taught Reasoner) and Advantage-Induced Policy Alignment (APA).

In the STaR approach, the authors use problems from the training dataset to generate 100,000 correct trajectories. These trajectories are then used to fine-tune the model iteratively until convergence in performance on the validation set is observed. After three iterations of STaR fine-tuning, the SoS+STaR model solves an additional 5% of the held-out inputs test set compared to the base SoS model.

Alternatively, the authors use APA, an Actor-Critic reinforcement learning technique that involves creating a copy of the language model to serve as a value network, which is then used to enhance the policy (the original language model). A straightforward reward function is defined, considering the correctness and length of the generated trajectory. The authors observe that updating the reference policy whenever the validation reward converges results in further policy improvement. After fine-tuning with APA, the SoS model achieves an improvement of about 6% over the base SoS model.

Analysis of the fine-tuned models reveals that both the STaR and APA models visit more states associated with the 'multiply' heuristic, which measures distance to the factors of the target. The APA model, in particular, diverges more from the symbolic strategies, indicating that it employs different strategies for searching and potentially discovers novel heuristics and search methods.

To further evaluate the performance of the improved models, the authors select 10,000 problems from the SoS training set that were unsolved by symbolic strategies and 10,000 difficult problems that none of the symbolic strategies used to train the SoS models can solve. Remarkably, the models solve approximately 36% of the previously unsolved problems and about 4% of the difficult problems.

Discussion and Future Directions

The SoS framework introduces a new approach to teaching language models to solve problems by searching in language, without relying on external components. By systematizing the elements of search into a unified language, the authors demonstrate that training language models on diverse streams of search leads to superior performance compared to models trained solely on optimal trajectories.

This addresses criticisms of language models for planning and problem-solving, such as the snowballing of errors and difficulty in lookahead tasks. By teaching models to backtrack and explore alternative paths, SoS enables language models to consider multiple possible outcomes before committing to a course of action. Crucially, SoS leads language models to learn an internal 'world model' for search, allowing for more adaptable and generalizable search compared to symbolic search that relies on an explicit environment model.

While the empirical results in the paper are restricted to the game of Countdown, the authors are optimistic that the SoS framework can be extended to more challenging, real-world tasks. Future research could explore integrating subgoals and hierarchical planning, as well as incorporating reflection and self-evaluation to enable models to discover and improve novel search strategies.

Generating the initial SoS dataset can be challenging, as it is not always feasible to create symbolic search algorithms to solve problems. An important question for future research is how well search abilities transfer between domains and between formal and informal domains.

Conclusion

SoS is a promising step forward in teaching language models to solve problems through structured search with backtracking, heuristic state evaluation, and world modeling. By exposing language models to diverse search strategies and iteratively refining them, the approach unlocks the potential of these models to tackle complex problems and discover new ways to solve them.

Frameworks like this will play a crucial role in enhancing the problem-solving capabilities of language models. By embracing the messy process of exploration and backtracking, and learning from productive mistakes, language models can develop more robust and adaptable search strategies, paving the way for their application in a wide range of real-world problem-solving scenarios.

AI that isn’t able to explore - and learn from mistakes - will always be limited by design. Unless we find a way to create flawless and complete training data for everything that we want it to do in this world - which, of course, is extremely unlikely.

👍 If you enjoyed this article, give it a like and share it with your peers.

And in case you want to continue reading, here’s my previous research summary on StructRAG, a framework that combines the best of both worlds - graph-based and standard RAG:

AI for Science: How to turn your LLM into an Innovation Machine

Pascal Biese — Tue, 22 Oct 2024 15:01:15 GMT

Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, including scientific innovation. By harnessing the power of these models, researchers aim to accelerate the discovery process and generate novel research ideas. However, existing LLM-based methods often struggle to produce truly diverse and innovative concepts due to their limited ability to acquire and integrate external knowledge effectively. To address this challenge, a new approach called Nova has been introduced, which combines iterative planning and search to enhance the creative potential of LLM-based systems.

The Nova Pipeline: A Three-Stage Approach to Scientific Innovation

The Nova pipeline streamlines the research process through three key stages: initial idea generation, iterative refinement, and detailed completion. This systematic approach ensures that the generated ideas are not only novel but also well-developed and feasible.

Nova Pipeline. The Pipeline includes initial seed idea generation, seed idea iteration, and idea completion. Source.

Stage 1: Initial Seed Idea Generation

The first stage of the Nova pipeline focuses on generating diverse and novel seed ideas based on an input paper. To achieve this, the system employs a multi-source seed idea generation module that leverages the LLM's internal knowledge, related literature, and scientific discovery techniques.

One of the key components of this module is the knowledge tracking system, which addresses the shortcomings of previous approaches by monitoring the latest publications in the field. By identifying influential recent papers based on user engagement metrics across various platforms, such as social media, forums, and GitHub, Nova ensures that the generated ideas are informed by the most current insights.

To further increase the diversity of the generated ideas, Nova utilizes 10 fundamental scientific discovery methods derived from Kuhn's paradigm of scientific discovery. These methods help identify new research problems by analyzing anomalies in existing approaches, exploring theoretical boundaries, and integrating interdisciplinary knowledge.

Additionally, Nova employs self-correction mechanics, such as self-check, self-critique, and reflection, to prevent hallucination and improve the logicality of the generated seed ideas. By the end of this stage, the system generates 15 seed ideas for each input paper.

StructRAG: Succeeding Where Graph RAG Fails

Pascal Biese — Tue, 15 Oct 2024 18:01:06 GMT

Effectively leveraging external knowledge to enhance the performance of large language models (LLMs) on knowledge-based tasks has become a crucial challenge. Retrieval-augmented generation (RAG) methods have emerged as a promising solution, providing LLMs with relevant information from external sources to improve their factual accuracy and reasoning capabilities. However, existing RAG approaches struggle with knowledge-intensive reasoning tasks, where the required information is often scattered across multiple documents, making it difficult to accurately identify and integrate key pieces of information for global reasoning.

Inspired by cognitive theories suggesting that humans convert raw information into structured knowledge when tackling complex reasoning tasks, researchers from the Chinese Academy of Sciences and Alibaba Group have proposed a new framework called StructRAG. This framework introduces a hybrid information structurization mechanism that constructs and utilizes structured knowledge in the most suitable format based on the specific requirements of the task at hand. By mimicking human-like thinking processes, StructRAG aims to enhance the performance of LLMs on knowledge-intensive reasoning tasks.

In this article, we will delve into the details of the StructRAG framework, exploring its key components and how they work together to improve RAG performance. We will also discuss the training process of the hybrid structure router, a critical module in StructRAG, and present the experimental results demonstrating the effectiveness of this approach.

Overview of the StructRAG framework. Source.

The Framework

The StructRAG framework consists of three main modules that work sequentially to identify the optimal structure type, construct structured knowledge, and utilize that knowledge for accurate reasoning. Let's take a closer look at each of these modules.

Hybrid Structure Router: The hybrid structure router is the core component of StructRAG, responsible for determining the most appropriate structure type for a given task. It takes the question and the core content of the documents as input and outputs the optimal structure type. The router considers five candidate structure types: table, graph, algorithm, catalogue, and chunk, each suited for different kinds of knowledge-intensive tasks.

The selection of the optimal structure type is crucial, as it directly impacts the effectiveness of the subsequent modules. To train the router, the authors propose a novel method based on the Decision Transformer with Preference Optimization (DPO) algorithm, which follows reinforcement learning principles without requiring additional reward models. The training data for the router is generated through a synthesizing-simulating-judging pipeline, which creates high-quality synthetic preference pairs for various tasks and structure types.

Scattered Knowledge Structurizer: Once the optimal structure type is identified, the scattered knowledge structurizer comes into play. This module is responsible for extracting relevant information scattered across the raw documents and reconstructing it into structured knowledge in the chosen format. The structurizer leverages the powerful understanding and generation capabilities of LLMs to perform this complex task.

The structurizer takes the question, the selected structure type, and each raw document as input. It then extracts the structured knowledge from the document and generates a description of the structured knowledge. The output structured knowledge is collected and combined to form the overall structured knowledge for the given task.

Structured Knowledge Utilizer: The final module in the StructRAG framework is the structured knowledge utilizer, which performs reasoning to answer the question based on the constructed structured knowledge. This module is designed to handle complex, combinatorial questions that may hinder the direct identification and utilization of relevant information.

The utilizer employs an LLM-based approach to facilitate question decomposition, precise knowledge extraction, and final answer inference. It first breaks down the original question into several simpler sub-questions based on the overall description of the structured knowledge. Then, it extracts precise knowledge for each sub-question from the structured knowledge. Finally, the utilizer integrates all the sub-questions and their corresponding precise knowledge to generate the final answer.

Training the Hybrid Structure Router

The performance of the hybrid structure router is critical to the overall effectiveness of the StructRAG framework. To train the router, the authors propose a novel method that combines a synthesizing-simulating-judging pipeline for generating training data and the DPO algorithm for preference training.

SFR-RAG: How Open AI Can Beat OpenAI

Pascal Biese — Wed, 25 Sep 2024 15:33:23 GMT

Retrieval Augmented Generation (RAG) has emerged as a critical paradigm for enhancing the capabilities of large language models (LLMs). The recently introduced SFR-RAG model, developed by researchers at Salesforce AI Research, represents a promising direction we’ve seen evolving this year: small(er) models closing the performance gap between proprietary and open AI.

This article will explore the key features of SFR-RAG, its performance on various benchmarks, and a step by step explanation of how it improves RAG performance.

Key Features of SFR-RAG

SFR-RAG is a 9-billion parameter language model specifically designed to excel in RAG applications. The model's primary goal is to faithfully and comprehensively understand provided context and user questions, avoid hallucination, handle challenging scenarios, perform complex reasoning, and produce reliable citations. Let's break down the key aspects of SFR-RAG and how it achieves these objectives.

Novel Chat Template

Traditional LLMs typically use three roles in their conversational structure: System, User, and Assistant. SFR-RAG expands on this by adding two new roles: Thought and Observation.

This comes with the following benefits:

a) Role Clarification: By introducing separate roles for Thought and Observation, SFR-RAG creates a clearer structure for different types of information. This helps the model distinguish between its internal reasoning process (Thought) and external information (Observation).

b) Easier Masking During Training: The new template allows for more precise control over which parts of the conversation should be included in the training loss. Specifically, System, User, and Observation turns can be masked out, while Thought and Assistant turns are included in the fine-tuning process.

c) Enhanced Security: The separation of roles facilitates better instruction hierarchy enforcement. This makes the model more resistant to potential jailbreaks or malicious instructions injected through User or Observation turns.

d) Improved Developer Control: The new template streamlines the process of building reliable and secure RAG applications. Developers can more easily control which parts of the internal processing to display or hide from end-users.

e) Consistent Function Calling: By designating a specific role (Thought) for internal reasoning and tool use syntax, SFR-RAG avoids the need to parse custom keywords from the Assistant output, leading to more reliable function calling.

Example of the chat format used by SFR-RAG. Source.

Comprehensive Fine-tuning Process

The model underwent an extensive fine-tuning process designed to enhance its contextual understanding and generation abilities. This process focused on several key capabilities:

Extracting Relevant Information: SFR-RAG is trained to efficiently extract pertinent information from long contexts. This is crucial for RAG applications where the model needs to sift through large amounts of retrieved data.
Recognizing Information Gaps: The model is trained to identify when relevant information is lacking in the provided context. This helps prevent hallucination by encouraging the model to abstain from generating responses when it lacks sufficient information.
Handling Conflicting Information: SFR-RAG is equipped to recognize and deal with potentially conflicting information in contextual passages. This is essential for real-world applications where retrieved information may be inconsistent or contradictory.
Resilience to Distractions: The fine-tuning process includes exposure to distracting, counter-intuitive, or out-of-distribution content. This helps the model maintain focus on relevant information even in the presence of noise.
Diverse Instruction Following: By using extensive instruction-following data that mimics real-world retrieval question answering applications, SFR-RAG is trained to handle a wide variety of tasks and query types.

Model Performance

To evaluate SFR-RAG's performance, the researchers introduced ContextualBench, a comprehensive evaluation suite comprising seven popular contextual question-answering tasks. This standardized benchmark allows for consistent comparison across different models and studies.