AI Agents of the Week: Papers You Should Know About
Get ahead of the curve with LLM Watch
In this Issue
Computer-Use Agents and the Data Bottleneck: The path to general-purpose desktop automation remains constrained not by model capability but by training data quality. This week, CUA-Suite tackles this head-on with approximately 10,000 human-demonstrated tasks across 87 applications, totaling ~55 hours of continuous 30 fps video - dwarfing the prior largest open dataset’s ~20 hours. Preliminary evaluation reveals a sobering ~60% task failure rate on professional desktop applications, confirming that current foundation action models still struggle with real-world workflows. Meanwhile, UI-Voyager demonstrates that a 4B-parameter model can reach 81.0% Pass@1 on AndroidWorld through self-evolving learning from failures, surpassing human-level performance without expensive manual annotation. Together, these papers bracket the field’s central tension: we need far more demonstration data, and we need agents that learn efficiently from their own mistakes.
Agent Safety and Adversarial Robustness: As agents gain the ability to execute real actions through tools, the attack surface expands dramatically. T-MAP introduces trajectory-aware evolutionary red-teaming that discovers adversarial prompts capable of bypassing safety guardrails in frontier models including GPT-5.2, Gemini-3-Pro, Qwen3.5, and GLM-5 - achieving harmful objectives through actual tool interactions rather than mere text generation. On the software engineering side, SlopCodeBench reveals that coding agents produce code that is 2.2x more verbose than human-authored open-source projects, with structural erosion rising in 80% of trajectories and no agent solving any of its 20 problems end-to-end. These findings suggest that current safety and quality evaluations systematically underestimate the risks of deploying agents in iterative, long-horizon settings.
Video Understanding as an Agentic Capability: Two papers this week reframe video comprehension as a core planning and perception challenge for autonomous agents. EVA introduces a planning-before-perception paradigm where the agent autonomously decides what to watch, when to watch, and how to watch, achieving 6-12% improvement over general MLLM baselines on six benchmarks. GameplayQA pushes further into multi-agent 3D environments, densely annotating multiplayer gameplay at 1.22 labels/second and revealing that frontier MLLMs exhibit substantial gaps from human performance in temporal grounding and agent-role attribution. For anyone building embodied or simulation-based agents, these results highlight that passive video recognition is insufficient - agents need active, query-driven visual reasoning.
Learning Dynamics and the Fragility of Self-Improvement: The promise of self-improving agents took a nuanced hit this week. Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? traces performance drops of up to 40% to the suppression of epistemic verbalization - the model’s expression of uncertainty during reasoning. When teacher models are conditioned on rich information, they stop hedging, which helps in-domain but devastates out-of-distribution generalization. This finding has direct implications for any agent pipeline that uses self-generated data for improvement: compressing reasoning traces can silently strip away the uncertainty signals that enable robust decision-making under novel conditions.
Tool Use in Specialized Domains: FinMCP-Bench brings the Model Context Protocol (MCP) into the financial domain with 613 samples across 65 real financial MCPs, spanning single-tool, multi-tool, and multi-turn interactions. While the community signal is modest, the benchmark addresses a critical gap: evaluating whether agents can reliably chain specialized financial tools to solve real-world problems, not just answer questions about them.

