LLM Watch

LLM Watch

The Week in AI Agents

AI Agents of the Week

These are the AI agents you should know about

Pascal Biese's avatar
Pascal Biese
Nov 23, 2025
∙ Paid

Executive Summary

  1. Diversity & Self-Evolution as Key to Agent Performance: This week’s papers show that autonomous agents improve by generating more diverse ideas and self-training. Meta AI researchers found that ideation diversity strongly correlates with better outcomes in AI research agent benchmarks, and they demonstrated that deliberately increasing an agent’s idea diversity boosts its performance. Meanwhile, another team introduced a self-evolving learning framework where an agent assumes dual roles (questioner and answerer) to teach itself complex visual reasoning with no human labels, yielding significant gains in reasoning accuracy and reduced hallucinations. Together, these works highlight how encouraging creative exploration and self-play can make agents smarter and more robust.

  2. Interactive Environments Drive Learning & Reasoning: Several papers emphasize that embodiment and environment interaction are crucial for developing advanced reasoning. Shanghai Jiao Tong and CMU’s Interactive Physical Reasoner (IPR-1) shows an agent learning physics by playing over 1,000 diverse games - it steadily improves its physical reasoning abilities through trial-and-error and even outperforms a GPT-5 model on exploratory “curiosity” tasks. Similarly, Peking University’s FreeAskWorld provides a rich simulation of human-centric tasks (like asking for directions in a navigation task) where agents use an LLM for high-level planning. Agents fine-tuned in this closed-loop interactive world gained markedly better understanding and social interaction skills. These advances indicate that letting agents learn from interactive, long-horizon environments - whether physical puzzles or simulated social scenarios - can yield more human-like reasoning and adaptability.

  3. Beyond Single Devices: Agents as Cross-Platform Orchestrators: We’re seeing frameworks that tear down the walls between disparate tools and platforms, turning collections of tools into one coherent agent. Microsoft’s UFO^3: Weaving the Digital Agent Galaxy unifies heterogeneous devices (PCs, servers, mobiles, edge) into a single agent “constellation.” It models a user’s goal as a dynamic DAG of subtasks distributed across devices, with a central orchestrator coordinating in parallel. UFO^3 achieved over 83% subtask completion and cut end-to-end latency by 31% versus sequential execution, showing the efficiency of this cross-device autonomy. In the vision domain, the Orion framework demonstrates that an agent can orchestrate multiple specialized computer vision tools (object detectors, OCR, etc.) under the hood to tackle complex visual problems. Orion’s agentic tool-call approach attained competitive results on vision benchmarks while transitioning from passive image captioning to active, tool-driven visual reasoning. These systems point to a future where autonomous AI seamlessly taps into a toolbox of devices and APIs, effectively multiplying its capabilities.

  4. Multi-Agent Collaboration & Specialized Roles: A common thread is the move toward multiple agents or specialized sub-agents working in concert. One approach trained two agent roles (question generator and answerer) together so that they challenge and refine each other, leading to broadly improved visual language skills. Another work even applied a multi-agent evolutionary loop to AI safety: an autonomous red-teaming framework had agents cooperatively evolve new jailbreak strategies to attack large language models, achieving an 85.5% success rate and inventing more diverse exploits than humans. Whether for positive skill acquisition or adversarial testing, these examples illustrate that teams of AI agents can achieve outcomes no single agent could, by sharing labor, exploring different angles, and iteratively correcting errors. As tasks grow more complex, we can expect agent specialization and collaboration - potentially across different domains of expertise - to become standard in advanced AI workflows.

In the hightlights below, we unpack each paper’s core innovation, why it matters for autonomous AI, the problems they address, and future implications.

Keep reading with a 7-day free trial

Subscribe to LLM Watch to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Pascal Biese
Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture