LLM Watch

LLM Watch

The Week in AI Agents

AI Agents of the Week: Papers You Should Know About

Get ahead of the curve with LLM Watch

Mar 08, 2026
∙ Paid

Executive Summary

Memory & Continual Learning Gains: This week brings significant advances in how agents manage knowledge across extended interactions. Memex(RL) introduces an indexed experience memory mechanism that addresses the fundamental context window bottleneck in long-horizon tasks - rather than lossy summarization, it maintains compact indices while storing full-fidelity interactions in an external database, allowing agents to recover exact past evidence on demand. Meanwhile, SkillNet tackles the persistent problem of agents “reinventing the wheel” by providing infrastructure for creating, evaluating, and organizing over 200,000 reusable skills, improving average rewards by 40% and reducing execution steps by 30% across multiple benchmarks. These complementary approaches - one preserving episodic memory, the other accumulating procedural knowledge - represent meaningful progress toward agents that learn cumulatively rather than forgetting everything between sessions.

Advances in Planning & Environment Interaction: Long-horizon planning with hard constraints remains one of the most challenging problems for autonomous agents, and this week’s research offers concrete solutions. HiMAP-Travel proposes a hierarchical multi-agent framework that splits planning into strategic coordination and parallel day-level execution, achieving 52.78% validation pass rate on TravelPlanner - an improvement of +8.67 percentage points over sequential baselines while reducing latency 2.5x through parallelization. The framework’s transactional monitor and bargaining protocol demonstrate how architectural choices can prevent the constraint drift that plagues sequential planners on complex tasks. Separately, T2S-Bench reveals that explicit text structuring through their Structure of Thought prompting technique yields +5.7% average improvement across eight text-processing tasks, with fine-tuning pushing gains to +8.6% - suggesting that how agents organize information internally matters as much as what information they access.

Multi-Agent Collaboration & Control: The question of how heterogeneous agents can learn from each other without coordinated deployment receives a compelling answer in HACRL, which enables bidirectional mutual learning through verified rollout sharing during training. Their HACPO algorithm outperforms GSPO by an average of 3.3% while using only half the rollout cost - a significant efficiency gain for multi-agent systems. In a different collaborative context, Vivaldi presents a role-structured multi-agent system for interpreting physiological time series, revealing nuanced findings: agentic pipelines improve explanation quality for non-thinking models (+6.9 and +9.7 points on justification and relevance) but can degrade performance for thinking models (14-point drop in relevance). This context-dependent picture challenges assumptions that agentic reasoning uniformly improves outcomes.

Trust, Verification & Safety: Evaluation and reliability emerge as critical themes across this week’s research. AgentVista introduces an ultra-challenging benchmark spanning 25 sub-domains where even the best model (Gemini-3-Pro with tools) achieves only 27.3% overall accuracy, with hard instances requiring more than 25 tool-calling turns. This sobering result highlights how far current agents remain from reliable real-world deployment. The Vivaldi study reinforces the importance of context-aware design, finding that explicit tool-based computation is decisive for codifiable clinical metrics while subjective targets show limited improvement - suggesting that the value of agentic AI lies in selective externalization of computation rather than maximal reasoning complexity.

Tools & Frameworks in Practice: Practical infrastructure for agent development receives substantial attention this week. DARE addresses the underutilization of R’s statistical ecosystem by LLM agents through distribution-aware retrieval, achieving 93.47% NDCG@10 - outperforming state-of-the-art embedding models by up to 17% with substantially fewer parameters. Their RCodingAgent demonstrates significant gains on downstream analysis tasks when integrated with DARE. SkillNet’s release of an interactive platform and Python toolkit alongside their 200,000-skill repository provides immediately usable infrastructure for agent developers. Together with Memex(RL)’s reinforcement learning framework for optimizing memory operations, these contributions offer concrete tools rather than just conceptual advances.

User's avatar

Continue reading this post for free, courtesy of Pascal Biese.

Or purchase a paid subscription.
© 2026 Pascal Biese · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture