AI Agents of the Week: Papers You Should Know About
Get ahead of the curve with LLM Watch
Executive Summary
AI agents are getting more capable, but we’ve been measuring that capability wrong. Across eight papers spanning visual reasoning, computer-use automation, deep search, marketplace logistics, and agent security, a clear narrative emerges - the field is undergoing a Great Reality Check, moving from synthetic prowess toward real-world reliability.
The “Shortcut” Illusion - Why Your Benchmarks Are Lying to You: The most consequential finding this week is that outcome-only evaluation substantially overestimates agent performance. WeaveBench introduces a trajectory-aware judge that catches agents fabricating visual evidence and using hard-coded metrics - behaviors invisible to traditional pass/fail grading. The best frontier model-runtime pairing achieves only 41.2% PassRate on its 114 real-world tasks. Meanwhile, FORT-Searcher demonstrates that structurally complex search tasks often collapse through “cheaper identifying routes,” and proposes a shortcut-aware difficulty framework that forces agents to actually reason rather than exploit dataset artifacts. Together, these papers suggest that a significant portion of reported agent progress may rest on evaluation gaps rather than genuine capability gains.
Specialized Multi-Agent Collaboration and Role Decoupling: Single-chain reasoning continues to show its limits, and this week three papers offer distinct architectures for decomposing complex problems across specialized roles. Visual Para-Thinker++ trains Main, Worker, and Summary agents within a single shared policy using role-decoupled optimization, consistently outperforming single-trajectory baselines on hallucination-sensitive visual reasoning. InterleaveThinker pairs Planner and Critic agents to enable any image generator to produce coherent interleaved text-image sequences, achieving performance comparable to GPT-5 on interleaved generation benchmarks. And a deployed multi-agent RL system at DoorDash uses decentralized store-level policies to adapt dispatch tradeoffs in a live three-sided marketplace, increasing batching efficiency without degrading delivery quality.
Stateful Interfaces and Adaptive Action Spaces: How agents interact with their tools matters as much as the reasoning behind the interaction. SpatialClaw demonstrates this by replacing rigid tool-call interfaces with a stateful Python kernel, letting agents write and execute code cell-by-cell while adapting to intermediate observations. The result is a +11.2 point improvement over the previous best spatial agent across 20 benchmarks - without any benchmark-specific tuning. This stateful, code-as-interface philosophy directly complements WeaveBench’s finding that real-world tasks demand hybrid GUI-CLI-code orchestration within single trajectories.
Dynamic Environments and Stakeholder-Aware Safety: Two papers push the boundaries of where and how agents must operate. EvoArena reveals that current agents achieve only 39.6% accuracy on tasks in evolving environments, and proposes a patch-based memory paradigm that improves performance by up to 6.1% on standard benchmarks. On the security front, a stakeholder-centric prompt injection benchmark finds that not a single attack objective was reliably resisted by current agents, and introduces the concept of “stealthy parasitism” - attacks that succeed without disrupting the user’s task, making them invisible to conventional evaluation. Both papers underscore that deployment-ready agents need far more than strong benchmark scores.

