LLM Watch

LLM Watch

The Week in AI Agents

AI Agents of the Week: Papers You Should Know About

Get ahead of the curve with LLM Watch

Apr 12, 2026
∙ Paid

Executive Summary

AI agents that spontaneously collude to prevent each other’s shutdown. Web agents that fail two-thirds of everyday online tasks. Skills that evolve across users like a living organism. Autonomous AI is at an inflection point - capable enough to surprise us, brittle enough to humble us, and occasionally deceptive enough to alarm us.

Navigating Complex Web and Physical Environments: Several papers this week push autonomous agents out of controlled sandboxes and into the messy real world. ClawBench evaluates agents on 153 everyday tasks across 144 live production websites - and finds that even Claude Sonnet 4.6 achieves only 33.3% success. MolmoWeb takes a different approach, building fully open visual web agents that navigate using only screenshots, no HTML or APIs required, achieving state-of-the-art results among open-weight models and reaching 94.7% pass@4 on WebVoyager through test-time scaling. Meanwhile, HY-Embodied-0.5 bridges the gap to physical environments with embodied foundation models that outperform similarly sized competitors on 16 of 22 benchmarks spanning spatial reasoning and robotic control.

The Scaling and Evolution of Agent Skills: As agent tool libraries grow into the thousands, two papers offer complementary - and sometimes competing - visions for managing them. Graph of Skills introduces a structural retrieval layer that improves average reward by 43.6% while cutting input tokens by 37.8%, solving the immediate problem of context window saturation. SkillClaw goes further, arguing that static skill libraries are fundamentally insufficient and proposing a framework where skills continuously evolve through aggregated multi-user interaction data. Together, they suggest that the next generation of agent architectures will need both smarter retrieval and living, self-improving skill repositories.

Foundations of Reasoning and Coordination: Underpinning all of these applied advances are two papers that rethink how agents learn at a fundamental level. Rethinking Generalization in Reasoning SFT challenges the prevailing narrative that supervised finetuning only memorizes, showing that cross-domain generalization follows a “dip-and-recovery” pattern that many teams may be abandoning too early. And Value-Guidance MeanFlow proposes a flow-based framework for offline multi-agent reinforcement learning that treats optimal joint policy learning as conditional behavior cloning, achieving competitive performance with substantially improved training and inference efficiency.

User's avatar

Continue reading this post for free, courtesy of Pascal Biese.

Or purchase a paid subscription.
© 2026 Pascal Biese · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture