LLM Watch

LLM Watch

The Week in AI Agents

AI Agents of the Week: Papers You Should Know About

Get ahead of the curve with LLM Watch

Feb 22, 2026
∙ Paid

Executive Summary

Memory & Continual Learning Gains: This week’s research demonstrates significant advances in how agents maintain coherent behavior over extended interactions. IntentCUA introduces intent-level representations that abstract raw interaction traces into reusable skills, achieving a 74.83% task success rate with a Step Efficiency Ratio of 0.91 on desktop automation tasks.

Advances in Planning & Environment Interaction: Planning under uncertainty received substantial attention, with two papers addressing how agents navigate complex, dynamic environments. IntentCUA coordinates a Planner, Plan-Optimizer, and Critic over shared memory to stabilize long-horizon execution, while AgentConductor introduces reinforcement learning-optimized topology evolution for multi-agent code generation, achieving up to 14.6% improvement in pass@1 accuracy over baselines. The latter’s density-aware layered DAG construction reduces token costs by 68% while improving performance - a notable efficiency gain for compute-constrained deployments.

Multi-Agent Collaboration & Control: The coordination of multiple specialized agents emerged as a key theme. AgentConductor demonstrates that dynamically adapting interaction topologies to task difficulty outperforms fixed communication graphs, with density reductions of 13% alongside accuracy improvements. AutoNumerics applies multi-agent orchestration to scientific computing, autonomously designing and verifying PDE solvers across 24 canonical problems. These systems highlight that the architecture of agent collaboration - not just individual agent capability - determines system-level performance.

Trust, Verification & Safety: Ensuring reliable agent behavior under real-world conditions featured prominently this week. Wink presents a production-deployed system for recovering from coding agent misbehaviors, finding that Specification Drift, Reasoning Problems, and Tool Call Failures occur in approximately 30% of all agent trajectories. Their lightweight self-intervention system resolves 90% of single-intervention misbehaviors and achieved statistically significant reductions in engineer interventions during live A/B testing. CowCorpus contributes a taxonomy of human intervention patterns, enabling models to predict when users will intervene with 61.4-63.4% improvement over baselines.

Tools & Frameworks in Practice: How AI Coding Agents Communicate analyzes pull request characteristics across five AI coding agents, revealing that presentation style correlates with reviewer engagement and merge outcomes.

User's avatar

Continue reading this post for free, courtesy of Pascal Biese.

Or purchase a paid subscription.
© 2026 Pascal Biese · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture