AI Agents of the Week: Papers You Should Know About

Get ahead of the curve with LLM Watch

May 17, 2026

∙ Paid

Executive Summary

Long-Horizon Reasoning Stability: The most persistent bottleneck for autonomous agents - maintaining coherent reasoning over extended trajectories - saw a significant advance this week. SU-01, a 30B-A3B model trained with a unified scaling recipe of curriculum SFT and two-stage reinforcement learning, achieved gold-medal-level performance on IMO 2025, USAMO 2026, and IPhO 2024/2025 while sustaining stable reasoning over trajectories exceeding 100,000 tokens. Meanwhile, Self-Distilled Agentic Reinforcement Learning (SDAR) tackled the complementary problem of optimization instability in multi-turn agent tasks, demonstrating that dense token-level guidance via gated self-distillation yields substantial gains over standard GRPO - +9.4% on ALFWorld, +7.0% on Search-QA, and +10.2% on WebShop accuracy. Together, these papers suggest that the field is converging on layered training strategies that combine coarse trajectory rewards with fine-grained supervision.

Multimodal Memory and Visual-Centric Benchmarking: Two papers exposed a critical weakness in how we evaluate whether vision-language agents actually see. MemLens introduced a benchmark of 789 questions across five memory abilities and four context lengths (32K - 256K tokens), revealing that removing evidence images drops frontier model accuracy below 2% on the majority of questions - yet no existing architecture handles both length stability and visual fidelity. MemEye deepened this critique by showing that many agents “cheat” via textual captions, and proposed a granularity-aware framework measuring visual evidence from scene-level down to pixel-level detail. The implication is clear: multimodal agents need hybrid memory architectures that current designs have yet to deliver.

From Individual Agents to Collective Intelligence: The challenge of scaling from single-agent to multi-agent systems received both a practical solution and a theoretical framework this week. LC-MAPF introduced a learnable local communication module for multi-agent pathfinding that outperforms existing learning-based solvers without compromising scalability. The LIFE survey provided a broader conceptual map, organizing multi-agent research into four causally linked stages - Lay, Integrate, Find, Evolve - while warning that tighter coordination amplifies error propagation risks that remain largely unsolved.

Deployment-Ready Safety and Efficiency: Two papers addressed the gap between research prototypes and real-world agent deployment. LiSA proposed a lifelong safety adaptation framework that converts sparse, noisy user feedback into reusable policy abstractions, remaining robust even at 20% label-flip noise rates - without requiring model retraining. SANA-WM demonstrated that minute-scale world modeling at 720p resolution is now feasible on a single consumer GPU, achieving 36× higher throughput than prior open-source baselines. These advances lower the barriers to deploying agents that are both physically grounded and contextually safe.

Continue reading this post for free, courtesy of Pascal Biese.

Or purchase a paid subscription.