Field

Reasoning & Agents

Reasoning, planning, tool use, and agentic workflows.

78 papers · latest 2026-04-23

Common topics in this field

AI Agents · 56 LLM Reasoning · 15 Alignment & Safety · 8 Tool Use · 4 Embodied Agents · 3 Efficient Inference · 2

Interval POMDP Shielding for Imperfect-Perception Agents

William Scarbro, Ravi Mangal

significant🔴 AdvancedReasoning & Agents AI Agents

cs.AIcs.AI

Provides safety shielding for autonomous agents with imperfect perception, using confidence intervals to block potentially unsafe actions.

Details → arXiv →

ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks

Jan-Philipp Schmidt

significant🟡 IntermediateReasoning & Agents AI Agents

cs.AIcs.CLcs.AI

Presents ActuBench, a multi-agent LLM pipeline for generating and evaluating actuarial reasoning tasks, enabling automated, curriculum-aligned assessment item creation and validation.

Details → arXiv →

Stateless Decision Memory for Enterprise AI Agents

Vasundra Srinivasan

significant🔴 AdvancedReasoning & Agents AI Agents

cs.AIcs.AI

Proposes stateless decision memory for regulated enterprise AI agents. Enables scalable, auditable, and compliant long-horizon decision-making in sensitive domains.

Details → arXiv →

EvoAgent: An Evolvable Agent Framework with Skill Learning and Multi-Agent Delegation

Aimin Zhang, Jiajing Guo, Fuwei Jia et al.

significant🔴 AdvancedReasoning & Agents AI Agents

cs.AIcs.AI

Presents EvoAgent, an evolvable LLM agent framework with structured skill learning and hierarchical delegation that enables continuous capability improvement through user feedback and multi-agent collaboration.

Details → arXiv →

Explicit Trait Inference for Multi-Agent Coordination

Suhaib Abdurahman, Etsuko Ishii, Katerina Margatina et al.

breakthrough🔴 AdvancedReasoning & Agents AI Agents Efficient Inference

cs.AIcs.MAcs.AI

ETI improves multi-agent coordination by modeling psychological traits of partners, reducing goal drift and errors. Builders should integrate it to create reliable, human-like agent teams for complex collaborative tasks.

Details → arXiv →

EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation

Ruibing Hou, Mingyue Zhou, Yuwei Gui et al.

breakthrough🔴 AdvancedReasoning & Agents LLM Reasoning Diffusion Models

cs.CVcs.CV

EgoMotion introduces the first diffusion-based framework for egocentric vision-language motion generation, enabling realistic 3D human motion synthesis from first-person views—critical for immersive VR, robotics, and human-robot interaction systems.

Details → arXiv →

How Adversarial Environments Mislead Agentic AI?

Zhonghao Zhan, Huichi Zhou, Zhenhao Li et al.

breakthrough🔴 AdvancedReasoning & Agents AI Agents

cs.AIcs.AI

Introduces the 'Trust Gap' in agentic AI, revealing that tools can be weaponized to mislead agents—demanding new evaluation standards that test skepticism, not just competence, for real-world deployment safety.

Details → arXiv →

AIT Academy: Cultivating the Complete Agent with a Confucian Three-Domain Curriculum

Jiaqi Li, Lvyang Zhang, Yang Zhao et al.

breakthrough🔴 AdvancedReasoning & Agents AI Agents

cs.AIcs.AI

AIT Academy proposes the first principled curriculum for holistic agent development, addressing systemic gaps in current agent training—vital for builders aiming for general-purpose AI agents.

Details → arXiv →

Using large language models for embodied planning introduces systematic safety risks

Tao Zhang, Kaixian Qu, Zhibin Li et al.

breakthrough🔴 AdvancedReasoning & Agents Alignment & Safety Embodied Agents

cs.AIcs.LGcs.RO

DESPITE reveals that even highly accurate LLM planners can systematically fail safety-critical tasks, exposing a critical gap between planning accuracy and real-world safety—essential for deploying robots in human environments.

Details → arXiv →

From Natural Language to Executable Narsese: A Neuro-Symbolic Benchmark and Pipeline for Reasoning with NARS

Mina Gabriel, Pei Wang

significant🔴 AdvancedReasoning & Agents LLM Reasoning

cs.AIcs.AI

Presents a neuro-symbolic pipeline translating natural language into Narsese, enabling interpretable, uncertainty-aware reasoning—vital for building trustworthy AI systems requiring explicit logic over LLM hallucinations.

Details → arXiv →

Human-Guided Harm Recovery for Computer Use Agents

Christy Li, Sky CH-Wang, Andi Peng et al.

breakthrough🔴 AdvancedReasoning & Agents AI Agents

cs.AIcs.CLcs.AI

Human-Guided Harm Recovery introduces the first formal framework for correcting harmful agent actions post-execution, enabling safe, real-world deployment of AI agents with human-aligned recovery protocols.

Details → arXiv →

Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data

Zhenwen Liang, Yujun Zhou, Sidi Lu et al.

breakthrough🔴 AdvancedReasoning & Agents LLM Reasoning

cs.LGcs.LG

CUTS solves RL mode collapse in saturated reasoning by sampling from constrained top-K outputs, enabling continued learning even when models are already correct—vital for improving LLM reasoning robustness without manual data curation.

Details → arXiv →

Structured Abductive-Deductive-Inductive Reasoning for LLMs via Algebraic Invariants

Sankalp Gilda, Shlok Gilda

breakthrough🔴 AdvancedReasoning & Agents LLM Reasoning

cs.AIcs.LGcs.LO

Embeds Peircean reasoning as algebraic invariants in LLMs, enforcing logical structure—vital for builders of reliable reasoning agents where correctness, not just fluency, is non-negotiable.

Details → arXiv →

Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

Sai Srinivas Kancheti, Aditya Sanjiv Kanade, Vineeth N. Balasubramanian et al.

breakthrough🟡 IntermediateReasoning & Agents LLM Reasoning

cs.CVcs.AIcs.CV

Reveals CoT prompting harms visual spatial reasoning in multimodal LLMs—forcing a rethink of reasoning paradigms in robotics, AR/VR, and vision-language systems where spatial accuracy is non-negotiable.

Details → arXiv →

Don't Start What You Can't Finish: A Counterfactual Audit of Support-State Triage in LLM Agents

Eren Unlu

breakthrough🔴 AdvancedReasoning & Agents AI Agents

cs.AIcs.AI

Proposes SSTA-32, a diagnostic framework to evaluate if agents can diagnose task blockers before acting—critical for building trustworthy autonomous systems that avoid costly errors in open-ended environments.

Details → arXiv →

Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench

Bhaskar Gurram

breakthrough🟡 IntermediateReasoning & Agents AI Agents

cs.AIcs.CLcs.MA

Reveals critical flaws in automated LLM agent evaluation and provides a human-validated benchmark with runtime mitigation, essential for building reliable tool-using agents in production systems.

Details → arXiv →

Certified Program Synthesis with a Multi-Modal Verifier

Yueyang Feng, Dipesh Kafle, Vladimir Gladshtein et al.

breakthrough🔴 AdvancedReasoning & Agents LLM Reasoning

cs.SEcs.AIcs.PL

This work introduces a multi-modal verifier that dynamically adjusts LLM-generated specs to be both implementable and formally sound—enabling trustworthy, automated code generation for safety-critical systems.

Details → arXiv →

SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems

Hikaru Shindo, Hanzhao Lin, Lukas Helff et al.

breakthrough🔴 AdvancedReasoning & Agents AI Agents Embodied Agents

cs.AIcs.LGcs.MA

SocialGrid provides the first benchmark for social reasoning in embodied multi-agent systems, exposing critical gaps in LLM agents' planning and deception detection—essential for building trustworthy autonomous agents.

Details → arXiv →

IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning

Zihan Liang, Yufei Ma, Ben Chen et al.

significant🔴 AdvancedReasoning & Agents LLM Reasoning

cs.AIcs.CLcs.IR

IG-Search introduces step-level information gain rewards to precisely guide LLM search queries in reasoning tasks, avoiding gradient collapse—critical for building reliable search-augmented agents that avoid redundant or vague queries.

Details → arXiv →

Autogenesis: A Self-Evolving Agent Protocol

Wentao Zhang, Zhe Zhao, Haibin Wen et al.

breakthrough🔴 AdvancedReasoning & Agents AI Agents

cs.AIcs.AI

Autogenesis introduces a self-evolving agent protocol with lifecycle and versioning control, enabling scalable, maintainable multi-agent systems—essential for production AI ecosystems that require autonomous updates without brittleness.

Details → arXiv →

Mind DeepResearch Technical Report

MindDR Team, Li Auto Inc

breakthrough🔴 AdvancedReasoning & Agents Alignment & Safety

cs.AIcs.AI

Demonstrates leading deep research performance with 30B models via a novel three-agent architecture and specialized training—proving high capability doesn't require trillion-parameter models, reshaping cost-efficiency in autonomous AI systems.

Details → arXiv →

Intermediate Layers Encode Optimal Biological Representations in Single-Cell Foundation Models

Vincenzo Yuto Civale, Roberto Semeraro, Andrew David Bagdanov et al.

breakthrough🔴 AdvancedReasoning & Agents AI Agents

cs.AIcs.AI

Optimal representations in single-cell models are not in final layers but task-dependent intermediate ones—revolutionizing how to extract features for biological AI, directly improving prediction accuracy in research systems.

Details → arXiv →

Scaling Test-Time Compute for Agentic Coding

Joongwon Kim, Wannan Yang, Kelvin Niu et al.

breakthrough🔴 AdvancedReasoning & Agents AI Agents Efficient Inference

cs.SEcs.AIcs.CL

Scaling test-time compute for agentic coding introduces trajectory-based evaluation, enabling meaningful refinement of long-horizon code agents—key for autonomous dev tools.

Details → arXiv →

Mamba-SSM with LLM Reasoning for Biomarker Discovery: Causal Feature Refinement via Chain-of-Thought Gene Evaluation

Pushpa Kumar Balan, Aijing Feng

breakthrough🔴 AdvancedReasoning & Agents LLM Reasoning

cs.AI

Mamba-SSM + LLM CoT filters confounding genes via causal reasoning, boosting biomarker specificity—enabling reliable, interpretable genomic discovery without manual curation, directly impacting precision medicine pipelines.

Details → arXiv →

Coalition Formation in LLM Agent Networks: Stability Analysis and Convergence Guarantees

Dongxin Guo, Jikun Wu, Siu-Ming Yiu

breakthrough🔴 AdvancedReasoning & Agents AI Agents

cs.GTcs.AIcs.GT

This work formally models LLM agent coalitions using hedonic game theory, providing the first stability and convergence guarantees—critical for deploying reliable, cooperative multi-agent systems in real-world environments.

Details → arXiv →

MM-Doc-R1: Training Agents for Long Document Visual Question Answering through Multi-turn Reinforcement Learning

Jiahang Lin, Kai Hu, Binghai Wang et al.

breakthrough🔴 AdvancedReasoning & Agents AI Agents RAG

cs.CLcs.CL

Introduces a multi-turn RL agent for visual QA over long documents, enabling iterative retrieval and synthesis—transforming RAG from static lookup to dynamic reasoning for complex document systems.

Details → arXiv →

AIBuildAI: An AI Agent for Automatically Building AI Models

Ruiyi Zhang, Peijia Qin, Qi Cao et al.

breakthrough🔴 AdvancedReasoning & Agents AI Agents

cs.AIcs.AI

Introduces an AI agent that autonomously builds AI models end-to-end, reducing expert dependency—game-changing for practitioners needing rapid, scalable model development without manual tuning.

Details → arXiv →

Drowsiness-Aware Adaptive Autonomous Braking System based on Deep Reinforcement Learning for Enhanced Road Safety

Hossem Eddine Hafidi, Elisabetta De Giovanni, Teodoro Montanaro et al.

breakthrough🔴 AdvancedReasoning & Agents Alignment & Safety

cs.LGcs.LG

First DRL system integrating real-time drowsiness detection with adaptive braking, directly enhancing road safety—practitioners should adopt this to build life-critical AI systems that respond to human state.

Details → arXiv →

SafeHarness: Lifecycle-Integrated Security Architecture for LLM-based Agent Deployment

Xixun Lin, Yang Liu, Yancheng Chen et al.

breakthrough🔴 AdvancedReasoning & Agents AI Agents

cs.CRcs.AIcs.CR

SafeHarness is the first lifecycle-integrated security architecture for LLM agents, closing critical attack vectors in tool orchestration—essential for trustworthy, production-grade agent systems.

Details → arXiv →

SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models

You Qin, Linqing Wang, Hao Fei et al.

breakthrough🔴 AdvancedReasoning & Agents Alignment & Safety

cs.LGcs.AIcs.LG

SOAR closes the SFT-RL gap in diffusion models by enabling self-correction during inference, improving alignment and robustness—critical for deploying safe, reliable generative systems under real-world distribution shifts.

Details → arXiv →

ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search

Myungchul Kim, Kwanyong Park, Junmo Kim et al.

breakthrough🔴 AdvancedReasoning & Agents AI Agents

cs.CVcs.AIcs.MA

ARGOS frames person search as an interactive agent task with questioning and reasoning—enabling real-world surveillance systems to operate under ambiguity with minimal human input.

Details → arXiv →

CIA: Inferring the Communication Topology from LLM-based Multi-Agent Systems

Yongxuan Wu, Xixun Lin, He Zhang et al.

breakthrough🔴 AdvancedReasoning & Agents AI Agents

cs.AIcs.AI

First demonstration that LLM agent communication topologies can be inferred via black-box queries—exposing critical privacy risks and demanding new architectural safeguards in multi-agent deployments.

Details → arXiv →

Heuristic Classification of Thoughts Prompting (HCoT): Integrating Expert System Heuristics for Structured Reasoning into Large Language Models

Lei Lin, Jizhao Zhu, Yong Liu et al.

breakthrough🔴 AdvancedReasoning & Agents LLM Reasoning

cs.AIcs.AI

HCoT injects expert system heuristics into LLM reasoning, replacing stochastic sampling with structured, deterministic planning—transforming LLMs into reliable agents for high-stakes decision systems.

Details → arXiv →

UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents

Yijuan Liang, Xinghao Chen, Yifan Ge et al.

breakthrough🟡 IntermediateReasoning & Agents AI Agents

cs.AIcs.AI

A unified 22k-tool, 390k-example tool-use stack that standardizes data and evaluation and lets an 8B model beat major commercial models on hard distractor-heavy calling.

Details → arXiv →

FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning

Haoran Ding, Zhaoguo Wang, Haibo Chen

breakthrough🔴 AdvancedReasoning & Agents AI Agents

cs.SEcs.AIcs.SE

This brings Hoare-style reasoning to 143k-line systems by inferring specs from caller intent, surfacing 522 new bugs in already-tested codebases.

Details → arXiv →

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models

Xiaomeng Hu, Yinger Zhang, Fei Huang et al.

breakthrough🟡 IntermediateReasoning & Agents AI Agents World Models

cs.CLcs.CL

OccuBench is a 100-scenario benchmark for professional agents across 65 domains that also injects hidden environment faults, exposing how brittle frontier models still are in real work settings.

Details → arXiv →

From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python

Jinhua Wang, Biswa Sengupta

breakthrough🟡 IntermediateReasoning & Agents AI Agents

cs.SEcs.AIcs.SE

This benchmark-driven translation of a production AI coding agent from Rust to Python shows how LLMs can migrate large systems continuously while staying competitive on real agent benchmarks.

Details → arXiv →

CocoaBench: Evaluating Unified Digital Agents in the Wild

CocoaBench Team, Shibo Hao, Zhining Zhang et al.

significant🟡 IntermediateReasoning & Agents AI Agents

cs.CLcs.AIcs.CL

CocoaBench is a strong reality check for unified digital agents, with long-horizon tasks that force systems to combine vision, search, and coding in one workflow.

Details → arXiv →

PaperScope: A Multi-Modal Multi-Document Benchmark for Agentic Deep Research Across Massive Scientific Papers

Lei Xiong, Huaying Yuan, Zheng Liu et al.

significant🟡 IntermediateReasoning & Agents AI Agents

cs.AIcs.AI

PaperScope evaluates agentic deep research across multiple scientific papers, tables, and figures, exposing how hard real multi-document synthesis still is.

Details → arXiv →

SemaClaw: A Step Towards General-Purpose Personal AI Agents through Harness Engineering

Ningyan Zhu, Huacan Wang, Jie Zhou et al.

significant🟡 IntermediateReasoning & Agents AI Agents

cs.AIcs.AI

SemaClaw frames harness engineering as the real differentiator for personal AI agents, focusing on the infrastructure layer that turns raw models into auditable systems.

Details → arXiv →

Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning

Xiaozhe Li, Tianyi Lyu, Yizhao Yang et al.

significant🔴 AdvancedReasoning & Agents AI Agents

cs.AIcs.AI

A small RL-trained ContextCurator learns to trim noisy history while preserving reasoning anchors, boosting long-horizon agents and slashing token use up to 8x.

Details → arXiv →

Do Agent Rules Shape or Distort? Guardrails Beat Guidance in Coding Agents

Xing Zhang, Guanghui Wang, Yanwei Cui et al.

significant🟢 BeginnerReasoning & Agents AI Agents

cs.AIcs.CLcs.AI

A rare large-scale study of CLAUDE.md-style rules finds that negative constraints help coding agents while many positive instructions quietly hurt them.

Details → arXiv →

Hodoscope: Unsupervised Monitoring for AI Misbehaviors

Ziqian Zhong, Shashwat Saxena, Aditi Raghunathan

significant🔴 AdvancedReasoning & Agents AI Agents

cs.AIcs.AI

Hodoscope uses unsupervised behavior monitoring to surface novel agent exploits and cut review effort by 6x to 23x, making it a practical safety layer for red teams and benchmark maintainers.

Details → arXiv →

Pioneer Agent: Continual Improvement of Small Language Models in Production

Dhruv Atreja, Julia White, Nikhil Nayak et al.

breakthrough🔴 AdvancedReasoning & Agents AI Agents

cs.AIcs.CLcs.LG

Pioneer Agent turns small-model adaptation into an automated closed loop that diagnoses failures, curates new data, retrains under regression constraints, and materially improves production-style tasks.

Details → arXiv →

MPAC: A Multi-Principal Agent Coordination Protocol for Interoperable Multi-Agent Collaboration

Kaiyang Qian, Xinmin Fang, Zhengxiong Li

significant🟡 IntermediateReasoning & Agents AI Agents

cs.MAcs.AIcs.MA

MPAC proposes a real coordination protocol for multi-owner agent systems, adding structured conflict handling and governance so agents can safely share state instead of silently clobbering each other.

Details → arXiv →

ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models

Nastaran Darabi, Amit Ranjan Trivedi

significant🔴 AdvancedReasoning & Agents Embodied Agents Alignment & Safety

cs.ROcs.CLcs.CV

ProGAL-VLA adds verified grounding and prospective sub-goals to VLA robots, sharply improving instruction sensitivity, ambiguity handling, and robustness under perturbation.

Details → arXiv →

EE-MCP: Self-Evolving MCP-GUI Agents via Automated Environment Generation and Experience Learning

Tiantian He, Yihang Chen, Keyue Jiang et al.

significant🔴 AdvancedReasoning & Agents Tool Use AI Agents

cs.AIcs.AI

EE-MCP shows how MCP-plus-GUI agents can self-improve by generating environments, synthesizing gap tasks, and accumulating reusable experience, with clear gains across desktop apps.

Details → arXiv →

Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning

Siyuan Xu, Shiyang Li, Xin Liu et al.

significant🔴 AdvancedReasoning & Agents AI Agents

cs.AIcs.AI

COVERT turns synthetic tool-use data into reward-checkable RL environments, making it much easier to harden agent tool calling against ambiguity, distractor tools, and noisy outputs.

Details → arXiv →

VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning

Yucheng Shen, Jiulong Wu, Jizhou Huang et al.

significant🔴 AdvancedReasoning & Agents RAG AI Agents

cs.CVcs.AIcs.CV

VISOR pushes visual RAG toward real agent behavior with iterative search, evidence-space tracking, and drift control for long-horizon multimodal question answering over documents.

Details → arXiv →

HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

Mohamed Elfeki, Tu Trinh, Kelvin Luu et al.

significant🟡 IntermediateReasoning & Agents AI Agents

cs.AIcs.AI

HiL-Bench measures whether agents know when to ask for missing information, exposing a major reliability gap that standard pass/fail coding benchmarks mostly hide.

Details → arXiv →

CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI Automation

Yushi Feng, Junye Du, Qifan Wang et al.

significant🔴 AdvancedReasoning & Agents AI Agents

cs.LGcs.AIcs.LG

CORA adds conformal risk control to mobile GUI agents so teams can set explicit harm budgets and abstain before risky clicks instead of trusting heuristic guardrails.

Details → arXiv →

Many-Tier Instruction Hierarchy in LLM Agents

Jingyu Zhang, Tianjian Li, William Jurayj et al.

significant🟡 IntermediateReasoning & Agents AI Agents

cs.CLcs.AIcs.CL

Many-Tier Instruction Hierarchy shows today's agents break down when instruction privilege gets more granular, making it a useful stress test for serious multi-tool and multi-role deployments.

Details → arXiv →

HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks

Suhana Bedi, Ryan Welch, Ethan Steinberg et al.

significant🟡 IntermediateReasoning & Agents AI Agents

cs.AIcs.AI

HealthAdminBench gives computer-use agents a rare end-to-end GUI benchmark in a real workflow domain and shows that strong subtask scores still collapse into poor task completion.

Details → arXiv →

MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

Tanmay Gupta, Piper Wolters, Zixian Ma et al.

breakthrough🟡 IntermediateReasoning & Agents AI Agents

cs.CVcs.CV

An open 4B and 8B visual web agent plus large mixed training set that beats comparable open agents and some larger closed systems, giving builders a reproducible browser-automation stack without HTML or accessibility-tree dependence.

Details → arXiv →

ClawBench: Can AI Agents Complete Everyday Online Tasks?

Yuxuan Zhang, Yubo Wang, Yipeng Zhu et al.

breakthrough🟡 IntermediateReasoning & Agents AI Agents

cs.CLcs.AIcs.CL

A live-web benchmark across 144 production sites and everyday tasks, showing frontier agents still complete only a small slice of real user workflows and giving builders a far more realistic yardstick than sandboxed browser evals.

Details → arXiv →

Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

Shilin Yan, Jintao Tong, Hongwei Xue et al.

breakthrough🔴 AdvancedReasoning & Agents AI Agents Multimodal Understanding

cs.CVcs.AIcs.CV

Act Wisely separates task accuracy from tool-efficiency rewards so multimodal agents learn when not to call tools, cutting unnecessary invocations by orders of magnitude while improving accuracy, latency, and cost.

Details → arXiv →

ParseBench: A Document Parsing Benchmark for AI Agents

Boyang Zhang, Sebastián G. Acosta, Preston Carlson et al.

significant🟡 IntermediateReasoning & Agents AI Agents

cs.CVcs.CV

ParseBench is a 2,000-page enterprise document benchmark that scores tables, charts, formatting, faithfulness, and grounding the way agents actually need them, exposing why text-similarity metrics miss business-critical parsing failures.

Details → arXiv →

SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

Ashima Suvarna, Kendrick Phan, Mehrab Beikzadeh et al.

breakthrough🔴 AdvancedReasoning & Agents LLM Reasoning

cs.AIcs.LGcs.AI

SUPERNOVA turns natural-instruction datasets into RL-ready supervision for general reasoning, delivering large gains beyond math and code and giving post-training teams a practical recipe for broader reasoning improvement.

Details → arXiv →

OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

Wenbo Hu, Xin Chen, Yan Gao-Tian et al.

significant🔴 AdvancedReasoning & Agents LLM Reasoning

cs.CVcs.AIcs.CL

OpenVLThinkerV2 introduces a more stable RL objective and task-shaping recipe for open multimodal reasoning, helping a generalist model balance perception with multi-step thinking across 18 visual benchmarks.

Details → arXiv →

KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

Tongbo Chen, Zhengxi Lu, Zhan Xu et al.

significant🔴 AdvancedReasoning & Agents AI Agents

cs.AIcs.AI

KnowU-Bench evaluates personalized mobile agents in live GUI environments, including when to ask, act, or stay silent, which is much closer to real assistant behavior than static preference benchmarks.

Details → arXiv →

Don't Overthink It: Inter-Rollout Action Agreement as a Free Adaptive-Compute Signal for LLM Agents

Khushal Sethi

significant🟡 IntermediateReasoning & Agents AI Agents

cs.AIcs.CLcs.MA

TrACE spends extra rollouts only on uncertain agent steps, matching fixed self-consistency accuracy with far fewer model calls and offering an easy path to cheaper agent inference.

Details → arXiv →

Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

Sai Srinivas Kancheti, Aditya Kanade, Rohit Sinha et al.

significant🔴 AdvancedReasoning & Agents LLM Reasoning

cs.CVcs.AIcs.CV

Faithful GRPO adds consistency and grounding constraints to multimodal RL training, sharply reducing unfaithful visual reasoning traces while also improving final spatial reasoning accuracy.

Details → arXiv →

Android Coach: Improve Online Agentic Training Efficiency with Single State Multiple Actions

Guo Gan, Yuxuan Ding, Cong Chen et al.

significant🔴 AdvancedReasoning & Agents AI Agents

cs.LGcs.AIcs.LG

Reframes online agent RL as single-state multi-action learning, boosting Android agent success while reducing expensive emulator waste—useful for training UI agents under tight latency and budget constraints.

Details → arXiv →

Reason in Chains, Learn in Trees: Self-Rectification and Grafting for Multi-turn Agent Policy Optimization

Yu Li, Sizhe Tang, Tian Lan

significant🔴 AdvancedReasoning & Agents AI Agents

cs.AIcs.LGcs.AI

Builds a cognitive tree across multi-turn trajectories to assign credit at the step level, improving policy optimization for reasoning, planning, and interactive agents with long sparse-reward chains.

Details → arXiv →

How Much LLM Does a Self-Revising Agent Actually Need?

Seongwoo Jeong, Seonil Son

significant🟡 IntermediateReasoning & Agents AI Agents

cs.AIcs.CLcs.AI

Shows explicit world models and symbolic reflection do most of the work in a self-revising agent, suggesting many agent stacks can trade extra model calls for better runtime structure.

Details → arXiv →

Flowr -- Scaling Up Retail Supply Chain Operations Through Agentic AI in Large Scale Supermarket Chains

Eranga Bandara, Ross Gore, Sachin Shetty et al.

breakthrough🟡 IntermediateReasoning & Agents AI Agents

cs.AIcs.AI

Agentic AI automates end-to-end retail supply chains with real-world coordination—reduces manual labor at scale, proving LLM agents can drive high-stakes, operational workflows reliably.

Details → arXiv →

CritBench: A Framework for Evaluating Cybersecurity Capabilities of Large Language Models in IEC 61850 Digital Substation Environments

Gustav Keppler, Moritz Gstür, Veit Hagenmeyer

breakthrough🔴 AdvancedReasoning & Agents LLM Reasoning

cs.CRcs.AIcs.CR

CritBench is the first benchmark evaluating LLM agents on OT protocols like IEC 61850, exposing critical cybersecurity gaps in industrial systems. Essential for deploying LLMs in critical infrastructure safely.

Details → arXiv →

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Bowen Ye, Rang Li, Qibin Yang et al.

breakthrough🟡 IntermediateReasoning & Agents AI Agents Alignment & Safety

cs.AIcs.AI

Claw-Eval introduces transparent, safety-aware, multimodal evaluation for autonomous agents, addressing critical gaps in benchmarking—essential for building trustworthy, real-world AI agents.

Details → arXiv →

MARL-GPT: Foundation Model for Multi-Agent Reinforcement Learning

Maria Nesterova, Mikhail Kolosov, Anton Andreychuk et al.

breakthrough🔴 AdvancedReasoning & Agents AI Agents

cs.AIcs.AI

A single GPT-based model learns diverse MARL tasks, eliminating task-specific architectures—enabling scalable, generalizable multi-agent systems without retraining for each environment.

Details → arXiv →

ACE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments

Wang Yang, Chaoda Song, Xinpeng Li et al.

significant🟡 IntermediateReasoning & Agents AI Agents

cs.AIcs.CLcs.AI

ACE-Bench reduces agent evaluation overhead by 41% with controllable, scalable tasks—enabling reliable, repeatable benchmarking of LLM agents for real-world deployment.

Details → arXiv →

A Formal Security Framework for MCP-Based AI Agents: Threat Taxonomy, Verification Models, and Defense Mechanisms

Nirajan Acharya, Gaurav Kumar Gupta

breakthrough🔴 AdvancedReasoning & Agents AI Agents Tool Use

cs.CRcs.AIcs.CR

First formal security framework for MCP-based AI agents, defining threats and verifiable defenses. Essential for builders deploying LLM agents with external tool access in production environments.

Details → arXiv →

QED-Nano: Teaching a Tiny Model to Prove Hard Theorems

LM-Provers, Yuxiao Qu, Amrith Setlur et al.

breakthrough🔴 AdvancedReasoning & Agents LLM Reasoning

cs.AIcs.CLcs.LG

QED-Nano proves complex math theorems using a tiny, open model—no giant AI needed. This matters because it makes high-level reasoning accessible to anyone, enabling reproducible, affordable AI that can be inspected, improved, and deployed without cloud costs.

Details → arXiv →

Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency

Guan-Ting Lin, Chen Chen, Zhehuai Chen et al.

significant🟡 IntermediateReasoning & Agents Tool Use AI Agents

cs.CL

Voice agents often fail when users stutter, pause, or interrupt, leading to broken API calls and frustrated users. This benchmark uses real human speech to reveal exactly how top models handle these messy realities. It allows developers to test if their voice systems can actually execute tasks reliably in natural conversation.

Details → arXiv →

Agentic Federated Learning: The Future of Distributed Training Orchestration

Rafael O. Jarczewski, Gabriel U. Talasso, Leandro Villas et al.

significant🔴 AdvancedReasoning & Agents AI Agents

cs.MAcs.AIcs.MA

Agentic Federated Learning uses AI agents to dynamically manage distributed training across unreliable devices. This matters because it makes privacy-preserving AI training faster and more reliable in real-world settings like mobile networks or hospitals with spotty connectivity.

Details → arXiv →

SkillX: Automatically Constructing Skill Knowledge Bases for Agents

Chenxi Wang, Zhuoyun Yu, Xin Xie et al.

significant🟡 IntermediateReasoning & Agents AI Agents

cs.CLcs.AIcs.IR

SkillX creates a shared knowledge base of skills that allows AI agents to learn from each other's experiences rather than starting from scratch. This prevents redundant exploration and speeds up the development of capable agents. Builders can reuse these skills across different projects, significantly cutting down training time and costs.

Details → arXiv →

Discovering Failure Modes in Vision-Language Models using RL

Kanishk Jain, Qian Yang, Shravan Nayak et al.

significant🟡 IntermediateReasoning & Agents LLM Reasoning

cs.CVcs.AIcs.CV

Finding specific weaknesses in vision-language models usually requires slow, manual testing. This paper uses reinforcement learning to automatically discover scenarios where models fail, such as spatial reasoning errors. This automation allows teams to rapidly identify and fix blind spots that human testers might miss.

Details → arXiv →

AI Assistance Reduces Persistence and Hurts Independent Performance

Grace Liu, Brian Christian, Tsvetomira Dumbalska et al.

breakthrough🟡 IntermediateReasoning & Agents Alignment & Safety

cs.AIcs.AI

AI assistants that always answer quickly make users dependent and worse at thinking alone. This is the first solid evidence that good AI should sometimes say 'figure it out'—a wake-up call for designers building educational or productivity tools.

Details → arXiv →

Learning, Potential, and Retention: An Approach for Evaluating Adaptive AI-Enabled Medical Devices

Alexis Burgon, Berkman Sahiner, Nicholas A Petrick et al.

significant🟡 IntermediateReasoning & Agents Alignment & Safety

cs.AIcs.PFcs.AI

This work introduces a standardized framework to evaluate AI medical devices that learn and adapt over time, solving a major regulatory bottleneck. It provides clear metrics to distinguish between a model actually improving versus just memorizing new data, which is critical for getting adaptive AI approved for clinical use.

Details → arXiv →