LLM Watch

LLM Watch

The Week in AI Agents

AI Agents of the Week: Papers You Should Know About

Get ahead of the curve with LLM Watch

May 17, 2026
∙ Paid

Executive Summary

Long-Horizon Reasoning Stability: The most persistent bottleneck for autonomous agents - maintaining coherent reasoning over extended trajectories - saw a significant advance this week. SU-01, a 30B-A3B model trained with a unified scaling recipe of curriculum SFT and two-stage reinforcement learning, achieved gold-medal-level performance on IMO 2025, USAMO 2026, and IPhO 2024/2025 while sustaining stable reasoning over trajectories exceeding 100,000 tokens. Meanwhile, Self-Distilled Agentic Reinforcement Learning (SDAR) tackled the complementary problem of optimization instability in multi-turn agent tasks, demonstrating that dense token-level guidance via gated self-distillation yields substantial gains over standard GRPO - +9.4% on ALFWorld, +7.0% on Search-QA, and +10.2% on WebShop accuracy. Together, these papers suggest that the field is converging on layered training strategies that combine coarse trajectory rewards with fine-grained supervision.

Multimodal Memory and Visual-Centric Benchmarking: Two papers exposed a critical weakness in how we evaluate whether vision-language agents actually see. MemLens introduced a benchmark of 789 questions across five memory abilities and four context lengths (32K - 256K tokens), revealing that removing evidence images drops frontier model accuracy below 2% on the majority of questions - yet no existing architecture handles both length stability and visual fidelity. MemEye deepened this critique by showing that many agents “cheat” via textual captions, and proposed a granularity-aware framework measuring visual evidence from scene-level down to pixel-level detail. The implication is clear: multimodal agents need hybrid memory architectures that current designs have yet to deliver.

From Individual Agents to Collective Intelligence: The challenge of scaling from single-agent to multi-agent systems received both a practical solution and a theoretical framework this week. LC-MAPF introduced a learnable local communication module for multi-agent pathfinding that outperforms existing learning-based solvers without compromising scalability. The LIFE survey provided a broader conceptual map, organizing multi-agent research into four causally linked stages - Lay, Integrate, Find, Evolve - while warning that tighter coordination amplifies error propagation risks that remain largely unsolved.

Deployment-Ready Safety and Efficiency: Two papers addressed the gap between research prototypes and real-world agent deployment. LiSA proposed a lifelong safety adaptation framework that converts sparse, noisy user feedback into reusable policy abstractions, remaining robust even at 20% label-flip noise rates - without requiring model retraining. SANA-WM demonstrated that minute-scale world modeling at 720p resolution is now feasible on a single consumer GPU, achieving 36× higher throughput than prior open-source baselines. These advances lower the barriers to deploying agents that are both physically grounded and contextually safe.

User's avatar

Continue reading this post for free, courtesy of Pascal Biese.

Or purchase a paid subscription.
© 2026 Pascal Biese · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture