AI Agents of the Week: Papers You Should Know About

Get ahead of the curve with LLM Watch

Mar 01, 2026

∙ Paid

Executive Summary

Memory & Continual Learning Gains: This week brings a compelling advance in how agents learn from their own reflections. ParamMem introduces a parametric memory module that encodes cross-sample reflection patterns directly into model parameters, enabling diverse reflection generation through temperature-controlled sampling. The framework demonstrates consistent improvements across code generation, mathematical reasoning, and multi-hop question answering, with notable sample efficiency and the ability to enable weak-to-strong transfer across model scales. For autonomous agents that must iterate and improve over extended interactions, this work suggests a path toward self-improvement without reliance on stronger external models - a critical capability for truly autonomous systems.

Advances in Planning & Environment Interaction: Racing strategy optimization receives a sophisticated treatment in Learning-based Multi-agent Race Strategies in Formula 1, where reinforcement learning agents learn to balance energy management, tire degradation, aerodynamic interaction, and pit-stop decisions in response to competitors. The combination of a pre-trained single-agent policy with an interaction module and self-play training generates competitive policies that adapt pit timing, tire selection, and energy allocation dynamically. Meanwhile, Toward Expert Investment Teams demonstrates that fine-grained task decomposition significantly improves risk-adjusted returns compared to conventional coarse-grained designs in financial trading systems. These papers underscore that effective planning in competitive, multi-stakeholder environments requires both reactive adaptation and structured task decomposition.

Multi-Agent Collaboration & Control: The challenges of multi-agent coordination receive sobering examination this week. Three AI-agents walk into a bar reveals that when LLM agents compete for limited resources, tribal dynamics emerge - Aggressive (27.3%), Conservative (24.7%), and Opportunistic (48.1%) - with more capable agents actually increasing systemic failure rates. This “Lord of the Flies” phenomenon suggests that scaling agent intelligence does not automatically yield better collective outcomes. On the constructive side, AgentDropoutV2 proposes a test-time rectify-or-reject pruning framework that achieves an average accuracy gain of 6.3 percentage points on math benchmarks by intercepting and correcting erroneous agent outputs before they propagate through the system. The contrast between these papers highlights both the risks and the potential remedies for multi-agent information flow.

Trust, Verification & Safety: Architectural rigor takes center stage in ESAA: Event Sourcing for Autonomous Agents, which separates cognitive intention from state mutation using an append-only event log with cryptographic verification. The architecture successfully orchestrated a clinical dashboard system with 50 tasks, 86 events, and 4 concurrent heterogeneous LLMs (Claude Sonnet 4.6, Codex GPT-5, Antigravity/Gemini 3 Pro, and Claude Opus 4.6), demonstrating forensic traceability and immutability of completed tasks. For consumer protection, MALLET introduces a multi-agent emotional detoxification system that reduces stimulus scores by up to 19.3% while maintaining semantic preservation. Both papers address the growing need for verifiable, trustworthy agent behavior in high-stakes domains.

Tools & Frameworks in Practice: Standardized evaluation receives its most comprehensive treatment yet in General Agent Evaluation, which proposes a Unified Protocol and the Exgentic framework for benchmarking general-purpose agents. The resulting Open General Agent Leaderboard benchmarks five prominent agent implementations across six environments, showing that general agents can achieve performance comparable to domain-specific agents without environment-specific tuning. This work establishes a foundation for systematic research on general-purpose agents and addresses a critical gap: without fair evaluation, comparing agent architectures remains guesswork.

Continue reading this post for free, courtesy of Pascal Biese.

Or purchase a paid subscription.