← Back to topics

Topic

AI Agents

Agentic systems, multi-agent coordination, and task planning.

37 papers · latest 2026-04-14

Most active fields for this topic

Yijuan Liang, Xinghao Chen, Yifan Ge et al.

breakthrough🟡 IntermediateReasoning & AgentsAI Agents
cs.AIcs.AI

A unified 22k-tool, 390k-example tool-use stack that standardizes data and evaluation and lets an 8B model beat major commercial models on hard distractor-heavy calling.

Haoran Ding, Zhaoguo Wang, Haibo Chen

breakthrough🔴 AdvancedReasoning & AgentsAI Agents
cs.SEcs.AIcs.SE

This brings Hoare-style reasoning to 143k-line systems by inferring specs from caller intent, surfacing 522 new bugs in already-tested codebases.

Xiaomeng Hu, Yinger Zhang, Fei Huang et al.

breakthrough🟡 IntermediateReasoning & AgentsAI AgentsWorld Models
cs.CLcs.CL

OccuBench is a 100-scenario benchmark for professional agents across 65 domains that also injects hidden environment faults, exposing how brittle frontier models still are in real work settings.

Jinhua Wang, Biswa Sengupta

breakthrough🟡 IntermediateReasoning & AgentsAI Agents
cs.SEcs.AIcs.SE

This benchmark-driven translation of a production AI coding agent from Rust to Python shows how LLMs can migrate large systems continuously while staying competitive on real agent benchmarks.

CocoaBench Team, Shibo Hao, Zhining Zhang et al.

significant🟡 IntermediateReasoning & AgentsAI Agents
cs.CLcs.AIcs.CL

CocoaBench is a strong reality check for unified digital agents, with long-horizon tasks that force systems to combine vision, search, and coding in one workflow.

Lei Xiong, Huaying Yuan, Zheng Liu et al.

significant🟡 IntermediateReasoning & AgentsAI Agents
cs.AIcs.AI

PaperScope evaluates agentic deep research across multiple scientific papers, tables, and figures, exposing how hard real multi-document synthesis still is.

S. Aaron McClendon, Jorge Gallego-Feliciano, Stavros Zervoudakis et al.

cs.AIcs.AI

By reusing one small model as summarizer, agent, and isolated code reviewer, this inference-time scaffold roughly doubles AppWorld performance on a single 24GB GPU.

Ningyan Zhu, Huacan Wang, Jie Zhou et al.

significant🟡 IntermediateReasoning & AgentsAI Agents
cs.AIcs.AI

SemaClaw frames harness engineering as the real differentiator for personal AI agents, focusing on the infrastructure layer that turns raw models into auditable systems.

Xiaozhe Li, Tianyi Lyu, Yizhao Yang et al.

significant🔴 AdvancedReasoning & AgentsAI Agents
cs.AIcs.AI

A small RL-trained ContextCurator learns to trim noisy history while preserving reasoning anchors, boosting long-horizon agents and slashing token use up to 8x.

Xing Zhang, Guanghui Wang, Yanwei Cui et al.

significant🟢 BeginnerReasoning & AgentsAI Agents
cs.AIcs.CLcs.AI

A rare large-scale study of CLAUDE.md-style rules finds that negative constraints help coding agents while many positive instructions quietly hurt them.

Ziqian Zhong, Shashwat Saxena, Aditi Raghunathan

significant🔴 AdvancedReasoning & AgentsAI Agents
cs.AIcs.AI

Hodoscope uses unsupervised behavior monitoring to surface novel agent exploits and cut review effort by 6x to 23x, making it a practical safety layer for red teams and benchmark maintainers.

Dhruv Atreja, Julia White, Nikhil Nayak et al.

breakthrough🔴 AdvancedReasoning & AgentsAI Agents
cs.AIcs.CLcs.LG

Pioneer Agent turns small-model adaptation into an automated closed loop that diagnoses failures, curates new data, retrains under regression constraints, and materially improves production-style tasks.

Kaiyang Qian, Xinmin Fang, Zhengxiong Li

significant🟡 IntermediateReasoning & AgentsAI Agents
cs.MAcs.AIcs.MA

MPAC proposes a real coordination protocol for multi-owner agent systems, adding structured conflict handling and governance so agents can safely share state instead of silently clobbering each other.

Tiantian He, Yihang Chen, Keyue Jiang et al.

significant🔴 AdvancedReasoning & AgentsTool UseAI Agents
cs.AIcs.AI

EE-MCP shows how MCP-plus-GUI agents can self-improve by generating environments, synthesizing gap tasks, and accumulating reusable experience, with clear gains across desktop apps.

Siyuan Xu, Shiyang Li, Xin Liu et al.

significant🔴 AdvancedReasoning & AgentsAI Agents
cs.AIcs.AI

COVERT turns synthetic tool-use data into reward-checkable RL environments, making it much easier to harden agent tool calling against ambiguity, distractor tools, and noisy outputs.

Yucheng Shen, Jiulong Wu, Jizhou Huang et al.

significant🔴 AdvancedReasoning & AgentsRAGAI Agents
cs.CVcs.AIcs.CV

VISOR pushes visual RAG toward real agent behavior with iterative search, evidence-space tracking, and drift control for long-horizon multimodal question answering over documents.

Mohamed Elfeki, Tu Trinh, Kelvin Luu et al.

significant🟡 IntermediateReasoning & AgentsAI Agents
cs.AIcs.AI

HiL-Bench measures whether agents know when to ask for missing information, exposing a major reliability gap that standard pass/fail coding benchmarks mostly hide.

Yushi Feng, Junye Du, Qifan Wang et al.

significant🔴 AdvancedReasoning & AgentsAI Agents
cs.LGcs.AIcs.LG

CORA adds conformal risk control to mobile GUI agents so teams can set explicit harm budgets and abstain before risky clicks instead of trusting heuristic guardrails.

Jingyu Zhang, Tianjian Li, William Jurayj et al.

significant🟡 IntermediateReasoning & AgentsAI Agents
cs.CLcs.AIcs.CL

Many-Tier Instruction Hierarchy shows today's agents break down when instruction privilege gets more granular, making it a useful stress test for serious multi-tool and multi-role deployments.

Suhana Bedi, Ryan Welch, Ethan Steinberg et al.

significant🟡 IntermediateReasoning & AgentsAI Agents
cs.AIcs.AI

HealthAdminBench gives computer-use agents a rare end-to-end GUI benchmark in a real workflow domain and shows that strong subtask scores still collapse into poor task completion.

Tanmay Gupta, Piper Wolters, Zixian Ma et al.

breakthrough🟡 IntermediateReasoning & AgentsAI Agents
cs.CVcs.CV

An open 4B and 8B visual web agent plus large mixed training set that beats comparable open agents and some larger closed systems, giving builders a reproducible browser-automation stack without HTML or accessibility-tree dependence.

Yuxuan Zhang, Yubo Wang, Yipeng Zhu et al.

breakthrough🟡 IntermediateReasoning & AgentsAI Agents
cs.CLcs.AIcs.CL

A live-web benchmark across 144 production sites and everyday tasks, showing frontier agents still complete only a small slice of real user workflows and giving builders a far more realistic yardstick than sandboxed browser evals.

Shilin Yan, Jintao Tong, Hongwei Xue et al.

cs.CVcs.AIcs.CV

Act Wisely separates task accuracy from tool-efficiency rewards so multimodal agents learn when not to call tools, cutting unnecessary invocations by orders of magnitude while improving accuracy, latency, and cost.

Boyang Zhang, Sebastián G. Acosta, Preston Carlson et al.

significant🟡 IntermediateReasoning & AgentsAI Agents
cs.CVcs.CV

ParseBench is a 2,000-page enterprise document benchmark that scores tables, charts, formatting, faithfulness, and grounding the way agents actually need them, exposing why text-similarity metrics miss business-critical parsing failures.

Tongbo Chen, Zhengxi Lu, Zhan Xu et al.

significant🔴 AdvancedReasoning & AgentsAI Agents
cs.AIcs.AI

KnowU-Bench evaluates personalized mobile agents in live GUI environments, including when to ask, act, or stay silent, which is much closer to real assistant behavior than static preference benchmarks.

Khushal Sethi

significant🟡 IntermediateReasoning & AgentsAI Agents
cs.AIcs.CLcs.MA

TrACE spends extra rollouts only on uncertain agent steps, matching fixed self-consistency accuracy with far fewer model calls and offering an easy path to cheaper agent inference.

Guo Gan, Yuxuan Ding, Cong Chen et al.

significant🔴 AdvancedReasoning & AgentsAI Agents
cs.LGcs.AIcs.LG

Reframes online agent RL as single-state multi-action learning, boosting Android agent success while reducing expensive emulator waste—useful for training UI agents under tight latency and budget constraints.

Yu Li, Sizhe Tang, Tian Lan

significant🔴 AdvancedReasoning & AgentsAI Agents
cs.AIcs.LGcs.AI

Builds a cognitive tree across multi-turn trajectories to assign credit at the step level, improving policy optimization for reasoning, planning, and interactive agents with long sparse-reward chains.

Seongwoo Jeong, Seonil Son

significant🟡 IntermediateReasoning & AgentsAI Agents
cs.AIcs.CLcs.AI

Shows explicit world models and symbolic reflection do most of the work in a self-revising agent, suggesting many agent stacks can trade extra model calls for better runtime structure.

Eranga Bandara, Ross Gore, Sachin Shetty et al.

breakthrough🟡 IntermediateReasoning & AgentsAI Agents
cs.AIcs.AI

Agentic AI automates end-to-end retail supply chains with real-world coordination—reduces manual labor at scale, proving LLM agents can drive high-stakes, operational workflows reliably.

Bowen Ye, Rang Li, Qibin Yang et al.

cs.AIcs.AI

Claw-Eval introduces transparent, safety-aware, multimodal evaluation for autonomous agents, addressing critical gaps in benchmarking—essential for building trustworthy, real-world AI agents.

Maria Nesterova, Mikhail Kolosov, Anton Andreychuk et al.

breakthrough🔴 AdvancedReasoning & AgentsAI Agents
cs.AIcs.AI

A single GPT-based model learns diverse MARL tasks, eliminating task-specific architectures—enabling scalable, generalizable multi-agent systems without retraining for each environment.

Wang Yang, Chaoda Song, Xinpeng Li et al.

significant🟡 IntermediateReasoning & AgentsAI Agents
cs.AIcs.CLcs.AI

ACE-Bench reduces agent evaluation overhead by 41% with controllable, scalable tasks—enabling reliable, repeatable benchmarking of LLM agents for real-world deployment.

Nirajan Acharya, Gaurav Kumar Gupta

breakthrough🔴 AdvancedReasoning & AgentsAI AgentsTool Use
cs.CRcs.AIcs.CR

First formal security framework for MCP-based AI agents, defining threats and verifiable defenses. Essential for builders deploying LLM agents with external tool access in production environments.

Guan-Ting Lin, Chen Chen, Zhehuai Chen et al.

significant🟡 IntermediateReasoning & AgentsTool UseAI Agents
cs.CL

Voice agents often fail when users stutter, pause, or interrupt, leading to broken API calls and frustrated users. This benchmark uses real human speech to reveal exactly how top models handle these messy realities. It allows developers to test if their voice systems can actually execute tasks reliably in natural conversation.

Rafael O. Jarczewski, Gabriel U. Talasso, Leandro Villas et al.

significant🔴 AdvancedReasoning & AgentsAI Agents
cs.MAcs.AIcs.MA

Agentic Federated Learning uses AI agents to dynamically manage distributed training across unreliable devices. This matters because it makes privacy-preserving AI training faster and more reliable in real-world settings like mobile networks or hospitals with spotty connectivity.

Chenxi Wang, Zhuoyun Yu, Xin Xie et al.

significant🟡 IntermediateReasoning & AgentsAI Agents
cs.CLcs.AIcs.IR

SkillX creates a shared knowledge base of skills that allows AI agents to learn from each other's experiences rather than starting from scratch. This prevents redundant exploration and speeds up the development of capable agents. Builders can reuse these skills across different projects, significantly cutting down training time and costs.

© 2026 A2A.pub — AI to Action. From papers to practice, daily.
Summaries are AI-assistedPrivacyTerms