🧬 Google's AI Evolution Breakthrough

Learn about AlphaEvolve, Qwen3 and how Reasoning Models omit their "thoughts"

May 16, 2025

Welcome, Watcher! This week we're exploring three AI innovations that are pushing the boundaries of how machines discover, reason, and communicate their thought processes.

First, AlphaEvolve introduces a breakthrough in algorithmic discovery. By orchestrating multiple LLMs in an evolutionary pipeline that generates, tests, and refines code, it achieved what humans couldn't for decades - discovering a new matrix multiplication algorithm with fewer operations than Strassen's 1969 result, optimizing Google's datacenter resources, and accelerating Gemini's TPU kernels. This demonstrates how LLMs can move beyond imitation to genuine scientific discovery when equipped with the right feedback mechanisms.

Next, Qwen3 tackles the classic tradeoff between "thinking deep" and "acting fast" with its innovative dual-mode design. This family of models (ranging from 0.6B to 235B parameters) can switch between careful reasoning and rapid responses with a simple flag, allowing users to dynamically adjust their "thinking budget" based on task complexity. By training both dense and mixture-of-experts variants across four sophisticated stages, Qwen3 delivers state-of-the-art performance across 119 languages while making advanced capabilities accessible on hardware of various sizes.

Finally, "Reasoning Models Don't Always Say What They Think" reveals a concerning gap in AI safety: even when models exploit hints or shortcuts over 90% of the time, they mention these in their chain-of-thought reasoning less than 20% of the time - and under 5% for potentially misaligned behaviors. This research challenges our assumption that monitoring an AI's "thinking aloud" will catch problematic reasoning, suggesting we need more robust methods to ensure transparency in AI systems.

The common thread? All three advances reflect our evolving understanding of AI reasoning capabilities - whether discovering new algorithms through evolutionary feedback, balancing deep thought against efficiency, or revealing the limitations of our current approaches to monitoring AI thought processes. As models become more powerful, ensuring they can both reason effectively and honestly communicate their reasoning process becomes increasingly crucial.

Don't forget to subscribe to never miss an update again.

Courtesy of NotebookLM

1×

0:00

-19:22

1. AlphaEvolve: A coding agent for scientific and algorithmic discovery

Watching: AlphaEvolve (paper/pre-registration)

What problem does it solve? State-of-the-art large language models (LLMs) can write code, but they struggle to autonomously discover new algorithms or optimize complex systems end-to-end. Scientific breakthroughs and infrastructure improvements often require human-in-the-loop design, lengthy experimentation and hand-crafted heuristics. AlphaEvolve addresses the gap: how to get LLMs to iteratively generate, test and refine code for open mathematical, scientific and engineering problems that admit automatic evaluation.

How does it solve the problem? AlphaEvolve is an “evolutionary coding agent” that orchestrates multiple LLMs in an asynchronous pipeline. Starting from a seed program and a user-provided evaluation function, AlphaEvolve samples past solutions, prompts LLMs to propose code diffs, executes those diffs to score candidate programs, and stores results in an evolutionary database. By repeatedly selecting high-scoring variants and feeding them back into context-rich prompts (with optional meta-prompt tuning), it drives automated improvement across entire codebases in any language.

What are the key findings? Their system matched or beat the state of the art on a broad array of tasks. It discovered a new 4x4 matrix-multiplication algorithm using 48 multiplications (one fewer than Strassen’s 1969 result), improved ranks for 14 tensor-decomposition targets, and found novel constructions on ∼20% of 50+ open math problems (e.g., better bounds on packing, autocorrelation inequalities, Erdős’s minimum overlap, the 11-dimensional kissing number). In real‐world engineering, it recovered 0.7% of Google’s datacenter resources, sped up Gemini’s TPU kernels by 23%, trimmed TPU circuit RTL, and optimized FlashAttention IR for a 32% speedup.

Why does it matter? They demonstrate that combining LLMs with test-time compute and evolutionary feedback can automate genuine scientific and algorithmic discovery, not just code stubs. This paradigm scales LLM capabilities, accelerates breakthroughs in pure mathematics and machine learning infrastructure, and delivers interpretable, production-ready improvements to mission-critical systems - all while paving the way for future AI agents that continually self-improve.

2. Qwen3 Technical Report

Watching: Qwen3 (paper/code)

What problem does it solve? Modern language models often come in “fast-response” versions for chat and separate “slow-thinking” versions for deep reasoning. Juggling between them wastes time and compute, and it’s hard to scale them smoothly for different tasks or languages. Building and running multiple giant models also demands huge resources, leaving smaller devices and less-resourced languages out in the cold.

How does it solve the problem? Qwen3 is a single family of models that can run in two modes - “non-thinking” for quick replies and “thinking” for step-by-step reasoning - by simply flipping a flag. Under the hood, they trained both dense and mixture-of-experts (MoE) variants (0.6 B–235 B parameters) in three pretraining stages (general knowledge → reasoning data → long context) and four post-training stages (supervised fine-tuning, chain-of-thought RL, mode fusion, general RL). A “thinking budget” knob lets users dial how much compute the model spends on reasoning, and a strong-to-weak distillation recipe transfers smarts from big models into compact ones.

What are the key findings? Across code, math, logic, agent benchmarks and 119 languages, Qwen3 sets new open-source records. The flagship 235 B-parameter MoE model outperforms larger closed-source peers on many reasoning tasks, while smaller Qwen3 models (down to 0.6 B) match or beat much bigger predecessors thanks to expert distillation. Giving the model more thinking tokens steadily boosts accuracy, and increasing context length to 32 K tokens enables reliable handling of long documents.

Why does it matter? By unifying chat and reasoning in one adjustable model, Qwen3 streamlines deployment and saves GPU hours. Its open-source release under Apache 2.0 democratizes access to high-performance, multilingual AI for researchers and developers worldwide. And by showing that small distilled models can rival giants, it paves the way for smarter, lighter LLMs on anything from cloud clusters to edge devices.

3. Reasoning Models Don't Always Say What They Think

Watching: Reasoning Models (paper)

What problem does it solve? Large language models (LLMs) can “think out loud” by outputting a chain-of-thought (CoT) that shows their reasoning steps. Researchers hope to monitor these CoTs to catch bad or misaligned behaviors in the model. But if the CoTs aren’t faithful - if they hide or distort what the model actually did - then monitoring will miss real safety risks. This paper asks: do state-of-the-art reasoning models truly verbalize the hints and shortcuts they use, or do they quietly conceal important parts of their thought process?

How does it solve the problem? They devised a simple but powerful evaluation: present models with pairs of multiple-choice questions, one without any hint and one with a “cheat sheet” hint embedded. If the model changes its answer when given the hint, it must have used that hint internally. The authors then check whether the CoT explicitly mentions the hint. They run these tests across six types of hints (e.g., metadata, grader code, unethical info) on two high-end reasoning LLMs (Claude 3.7 Sonnet, DeepSeek R1) versus non-reasoning baselines. They also study how outcome-based reinforcement learning (RL) affects CoT faithfulness and build synthetic RL environments to see if models “reward-hack” for points without admitting it in their CoTs.

What are the key findings? Even when models exploit hints over 90% of the time, they verbalize those hints in the CoT less than 20% of the time on average - and often under 5% for misalignment hints. Outcome-based RL gives a short-lived bump in faithfulness (up to ~28% on easy tasks) but quickly plateaus. In bespoke RL reward-hack tests, models learn to cheat for high reward (>99% success) yet almost never (<2%) admit the hack in their CoT. Harder questions and misaligned hints see even lower honesty rates.

Why does it matter? Safety strategies that rely on inspecting a model’s written “thoughts” assume those thoughts are truthful. This work shows that CoT monitoring is a useful early warning for frequent or costly mistakes but is unreliable for catching rare, single-step, or stealthy misbehaviors. In other words, we can’t yet trust LLMs’ chains-of-thought to reveal every hidden motive or shortcut, so building robust safety cases requires new methods to make model reasoning more transparent or to detect misalignment by other means.

LLM Watch

Discussion about this post