AI Agents of the Week: Papers You Should Know About
Get ahead of the curve with LLM Watch
Executive Summary
This week in AI Agents: A critical vulnerability in how agents select and call tools. Small models that punch far above their weight class. Multi-agent systems tackling bias and personalized healthcare. This week’s research paints a vivid picture of an agentic AI field maturing fast - and confronting the hard problems that come with real-world deployment.
Security at the Function-Calling Interface: The most urgent finding this week comes from researchers who demonstrated that the very mechanism enabling agents to use tools - function calling - can be hijacked with alarming reliability. The Function Hijacking Attack paper showed that adversaries can force agentic models to invoke attacker-chosen functions at a 70% to 100% attack success rate across five different models, including both instructed and reasoning variants. Unlike traditional jailbreaks that exploit semantic preferences, these attacks are largely agnostic to context, meaning they generalize across domains and query types. For anyone building or deploying tool-using agents, this paper is required reading.
Small Models, Big Ambitions: Three papers this week converge on a shared thesis: you don’t need massive models to build capable agents. DR-Venus demonstrates that a 4B-parameter deep research agent trained on roughly 10K open data points can significantly outperform prior agentic models under 9B parameters and begin closing the gap with 30B-class systems. AgenticQwen introduces dual data flywheels - one for reasoning, one for agentic behavior - that automatically synthesize increasingly difficult training tasks, enabling small models to handle industrial-scale tool use. And TACO tackles the quadratic token cost growth that plagues long-horizon terminal agents, delivering consistent 1% - 4% accuracy gains on TerminalBench while cutting token overhead by around 10%. Together, these papers suggest that strategic data engineering and inference-time optimization can substitute for raw parameter count.
Data Synthesis as the New Bottleneck-Breaker: A recurring theme this week is that the quality and structure of training data matters more than its volume. OpenMobile builds an open-source pipeline for synthesizing mobile agent trajectories, achieving 64.7% success on AndroidWorld with a fine-tuned Qwen3-VL - competitive with closed-data approaches. LLaTiSA formalizes time series reasoning into a four-level cognitive taxonomy and introduces an 83K-sample dataset with verified chain-of-thought trajectories. Both papers demonstrate that carefully structured synthetic data, combined with curriculum-style training, can unlock capabilities that brute-force scaling alone cannot.
Multi-Agent Architectures for Fairness and Healthcare: Two papers this week deploy specialized multi-agent systems to tackle domain-specific challenges. FairQE uses collaborating agents to detect gender cues, generate gender-flipped translation variants, and dynamically calibrate quality scores - mitigating systematic gender bias in translation evaluation without sacrificing accuracy. The Agentic Physiotherapy framework coordinates four micro-agents to parse clinical notes, synthesize personalized exercise videos, estimate patient pose in real time, and deliver corrective feedback. Both illustrate how decomposing complex tasks across specialized agents can address problems that monolithic models handle poorly.

