AI Agents of the Week: Papers You Should Know About
Get ahead of the curve with LLM Watch
Executive Summary
What if the entire foundation of AI agent “memory” is built on a category error? This week’s research suggests exactly that - and the implications ripple across every paper in this roundup, from scientific discovery benchmarks where top models score under 10%, to safety-critical systems where non-determinism blocks certification. The papers covered here converge on a sobering but productive conclusion: the agent ecosystem is hitting structural walls that no amount of scaling alone can fix, and the solutions demand rethinking evaluation, architecture, and the very meaning of intelligence in autonomous systems.
The Memory Illusion and Its Consequences: The most provocative claim this week comes from “Contextual Agentic Memory is a Memo, Not True Memory“ (Xu et al.), which argues that vector stores, RAG pipelines, scratchpads, and expanding context windows do not implement memory at all - they implement lookup. The authors formalize a provable generalization ceiling: no increase in context size or retrieval quality can overcome the inability of similarity-based retrieval to handle compositionally novel tasks. Drawing on Complementary Learning Systems theory from neuroscience, they show that current agents implement only the fast, hippocampal half of biological memory while entirely missing the slow neocortical weight consolidation that produces genuine expertise. For agent builders, this means that systems accumulating notes indefinitely are not learning - they are hoarding, and they are structurally vulnerable to persistent memory poisoning in the process.
Evaluation Gaps Exposed: A recurring thread this week is that existing benchmarks dramatically overestimate agent capabilities. AutoResearchBench (Xiong et al.) demonstrates this starkly: even the most powerful LLMs, which have largely conquered general web-browsing benchmarks like BrowseComp, achieve only 9.39% accuracy on deep scientific literature discovery and 9.31% IoU on wide research tasks, with many strong baselines falling below 5%. Meanwhile, “Visual Generation in the New Era“ (Wu et al.) argues that visual generation evaluations overestimate progress by rewarding perceptual quality while ignoring structural, temporal, and causal failures. Together with ClawGym‘s new 200-instance benchmark for multi-step local workflows, these papers make a compelling case that the field needs harder, more honest yardsticks.
Beyond Language as Universal Interface: Two papers this week tackle the limits of text-centric agent design from complementary angles. GLM-5V-Turbo (V Team, Hong, Gu et al.) builds multimodal perception natively into a foundation model’s reasoning, planning, and tool-use pipeline, explicitly rejecting the pattern of bolting vision onto a language model as an afterthought. Heterogeneous Scientific Foundation Model Collaboration (Li, Zou, Fang et al.) takes a different architectural bet: rather than a single integrated model, the Eywa framework uses a language model as a reasoning coordinator that orchestrates domain-specific scientific foundation models over non-linguistic data. Both approaches acknowledge that language alone cannot serve as a universal interface for real-world autonomy, but they disagree sharply on the solution - a tension worth watching as the field matures.

