← Back to fields

Field

Reasoning & Agents

Reasoning, planning, tool use, and agentic workflows.

78 papers · latest 2026-04-23

Common topics in this field

William Scarbro, Ravi Mangal

significant🔴 AdvancedReasoning & AgentsAI Agents
cs.AIcs.AI

Provides safety shielding for autonomous agents with imperfect perception, using confidence intervals to block potentially unsafe actions.

Jan-Philipp Schmidt

significant🟡 IntermediateReasoning & AgentsAI Agents
cs.AIcs.CLcs.AI

Presents ActuBench, a multi-agent LLM pipeline for generating and evaluating actuarial reasoning tasks, enabling automated, curriculum-aligned assessment item creation and validation.

Vasundra Srinivasan

significant🔴 AdvancedReasoning & AgentsAI Agents
cs.AIcs.AI

Proposes stateless decision memory for regulated enterprise AI agents. Enables scalable, auditable, and compliant long-horizon decision-making in sensitive domains.

Aimin Zhang, Jiajing Guo, Fuwei Jia et al.

significant🔴 AdvancedReasoning & AgentsAI Agents
cs.AIcs.AI

Presents EvoAgent, an evolvable LLM agent framework with structured skill learning and hierarchical delegation that enables continuous capability improvement through user feedback and multi-agent collaboration.

Suhaib Abdurahman, Etsuko Ishii, Katerina Margatina et al.

cs.AIcs.MAcs.AI

ETI improves multi-agent coordination by modeling psychological traits of partners, reducing goal drift and errors. Builders should integrate it to create reliable, human-like agent teams for complex collaborative tasks.

Ruibing Hou, Mingyue Zhou, Yuwei Gui et al.

cs.CVcs.CV

EgoMotion introduces the first diffusion-based framework for egocentric vision-language motion generation, enabling realistic 3D human motion synthesis from first-person views—critical for immersive VR, robotics, and human-robot interaction systems.

Zhonghao Zhan, Huichi Zhou, Zhenhao Li et al.

breakthrough🔴 AdvancedReasoning & AgentsAI Agents
cs.AIcs.AI

Introduces the 'Trust Gap' in agentic AI, revealing that tools can be weaponized to mislead agents—demanding new evaluation standards that test skepticism, not just competence, for real-world deployment safety.

Jiaqi Li, Lvyang Zhang, Yang Zhao et al.

breakthrough🔴 AdvancedReasoning & AgentsAI Agents
cs.AIcs.AI

AIT Academy proposes the first principled curriculum for holistic agent development, addressing systemic gaps in current agent training—vital for builders aiming for general-purpose AI agents.

Tao Zhang, Kaixian Qu, Zhibin Li et al.

cs.AIcs.LGcs.RO

DESPITE reveals that even highly accurate LLM planners can systematically fail safety-critical tasks, exposing a critical gap between planning accuracy and real-world safety—essential for deploying robots in human environments.

Mina Gabriel, Pei Wang

significant🔴 AdvancedReasoning & AgentsLLM Reasoning
cs.AIcs.AI

Presents a neuro-symbolic pipeline translating natural language into Narsese, enabling interpretable, uncertainty-aware reasoning—vital for building trustworthy AI systems requiring explicit logic over LLM hallucinations.

Christy Li, Sky CH-Wang, Andi Peng et al.

breakthrough🔴 AdvancedReasoning & AgentsAI Agents
cs.AIcs.CLcs.AI

Human-Guided Harm Recovery introduces the first formal framework for correcting harmful agent actions post-execution, enabling safe, real-world deployment of AI agents with human-aligned recovery protocols.

Zhenwen Liang, Yujun Zhou, Sidi Lu et al.

breakthrough🔴 AdvancedReasoning & AgentsLLM Reasoning
cs.LGcs.LG

CUTS solves RL mode collapse in saturated reasoning by sampling from constrained top-K outputs, enabling continued learning even when models are already correct—vital for improving LLM reasoning robustness without manual data curation.

Sankalp Gilda, Shlok Gilda

breakthrough🔴 AdvancedReasoning & AgentsLLM Reasoning
cs.AIcs.LGcs.LO

Embeds Peircean reasoning as algebraic invariants in LLMs, enforcing logical structure—vital for builders of reliable reasoning agents where correctness, not just fluency, is non-negotiable.

Sai Srinivas Kancheti, Aditya Sanjiv Kanade, Vineeth N. Balasubramanian et al.

breakthrough🟡 IntermediateReasoning & AgentsLLM Reasoning
cs.CVcs.AIcs.CV

Reveals CoT prompting harms visual spatial reasoning in multimodal LLMs—forcing a rethink of reasoning paradigms in robotics, AR/VR, and vision-language systems where spatial accuracy is non-negotiable.

Eren Unlu

breakthrough🔴 AdvancedReasoning & AgentsAI Agents
cs.AIcs.AI

Proposes SSTA-32, a diagnostic framework to evaluate if agents can diagnose task blockers before acting—critical for building trustworthy autonomous systems that avoid costly errors in open-ended environments.

Bhaskar Gurram

breakthrough🟡 IntermediateReasoning & AgentsAI Agents
cs.AIcs.CLcs.MA

Reveals critical flaws in automated LLM agent evaluation and provides a human-validated benchmark with runtime mitigation, essential for building reliable tool-using agents in production systems.

Yueyang Feng, Dipesh Kafle, Vladimir Gladshtein et al.

breakthrough🔴 AdvancedReasoning & AgentsLLM Reasoning
cs.SEcs.AIcs.PL

This work introduces a multi-modal verifier that dynamically adjusts LLM-generated specs to be both implementable and formally sound—enabling trustworthy, automated code generation for safety-critical systems.

Hikaru Shindo, Hanzhao Lin, Lukas Helff et al.

cs.AIcs.LGcs.MA

SocialGrid provides the first benchmark for social reasoning in embodied multi-agent systems, exposing critical gaps in LLM agents' planning and deception detection—essential for building trustworthy autonomous agents.

Zihan Liang, Yufei Ma, Ben Chen et al.

significant🔴 AdvancedReasoning & AgentsLLM Reasoning
cs.AIcs.CLcs.IR

IG-Search introduces step-level information gain rewards to precisely guide LLM search queries in reasoning tasks, avoiding gradient collapse—critical for building reliable search-augmented agents that avoid redundant or vague queries.

Wentao Zhang, Zhe Zhao, Haibin Wen et al.

breakthrough🔴 AdvancedReasoning & AgentsAI Agents
cs.AIcs.AI

Autogenesis introduces a self-evolving agent protocol with lifecycle and versioning control, enabling scalable, maintainable multi-agent systems—essential for production AI ecosystems that require autonomous updates without brittleness.

MindDR Team, Li Auto Inc

breakthrough🔴 AdvancedReasoning & AgentsAlignment & Safety
cs.AIcs.AI

Demonstrates leading deep research performance with 30B models via a novel three-agent architecture and specialized training—proving high capability doesn't require trillion-parameter models, reshaping cost-efficiency in autonomous AI systems.

Vincenzo Yuto Civale, Roberto Semeraro, Andrew David Bagdanov et al.

breakthrough🔴 AdvancedReasoning & AgentsAI Agents
cs.AIcs.AI

Optimal representations in single-cell models are not in final layers but task-dependent intermediate ones—revolutionizing how to extract features for biological AI, directly improving prediction accuracy in research systems.

Joongwon Kim, Wannan Yang, Kelvin Niu et al.

cs.SEcs.AIcs.CL

Scaling test-time compute for agentic coding introduces trajectory-based evaluation, enabling meaningful refinement of long-horizon code agents—key for autonomous dev tools.

Pushpa Kumar Balan, Aijing Feng

breakthrough🔴 AdvancedReasoning & AgentsLLM Reasoning
cs.AI

Mamba-SSM + LLM CoT filters confounding genes via causal reasoning, boosting biomarker specificity—enabling reliable, interpretable genomic discovery without manual curation, directly impacting precision medicine pipelines.

Dongxin Guo, Jikun Wu, Siu-Ming Yiu

breakthrough🔴 AdvancedReasoning & AgentsAI Agents
cs.GTcs.AIcs.GT

This work formally models LLM agent coalitions using hedonic game theory, providing the first stability and convergence guarantees—critical for deploying reliable, cooperative multi-agent systems in real-world environments.

Jiahang Lin, Kai Hu, Binghai Wang et al.

breakthrough🔴 AdvancedReasoning & AgentsAI AgentsRAG
cs.CLcs.CL

Introduces a multi-turn RL agent for visual QA over long documents, enabling iterative retrieval and synthesis—transforming RAG from static lookup to dynamic reasoning for complex document systems.

Ruiyi Zhang, Peijia Qin, Qi Cao et al.

breakthrough🔴 AdvancedReasoning & AgentsAI Agents
cs.AIcs.AI

Introduces an AI agent that autonomously builds AI models end-to-end, reducing expert dependency—game-changing for practitioners needing rapid, scalable model development without manual tuning.

Hossem Eddine Hafidi, Elisabetta De Giovanni, Teodoro Montanaro et al.

breakthrough🔴 AdvancedReasoning & AgentsAlignment & Safety
cs.LGcs.LG

First DRL system integrating real-time drowsiness detection with adaptive braking, directly enhancing road safety—practitioners should adopt this to build life-critical AI systems that respond to human state.

Xixun Lin, Yang Liu, Yancheng Chen et al.

breakthrough🔴 AdvancedReasoning & AgentsAI Agents
cs.CRcs.AIcs.CR

SafeHarness is the first lifecycle-integrated security architecture for LLM agents, closing critical attack vectors in tool orchestration—essential for trustworthy, production-grade agent systems.

You Qin, Linqing Wang, Hao Fei et al.

breakthrough🔴 AdvancedReasoning & AgentsAlignment & Safety
cs.LGcs.AIcs.LG

SOAR closes the SFT-RL gap in diffusion models by enabling self-correction during inference, improving alignment and robustness—critical for deploying safe, reliable generative systems under real-world distribution shifts.

Myungchul Kim, Kwanyong Park, Junmo Kim et al.

breakthrough🔴 AdvancedReasoning & AgentsAI Agents
cs.CVcs.AIcs.MA

ARGOS frames person search as an interactive agent task with questioning and reasoning—enabling real-world surveillance systems to operate under ambiguity with minimal human input.

Yongxuan Wu, Xixun Lin, He Zhang et al.

breakthrough🔴 AdvancedReasoning & AgentsAI Agents
cs.AIcs.AI

First demonstration that LLM agent communication topologies can be inferred via black-box queries—exposing critical privacy risks and demanding new architectural safeguards in multi-agent deployments.

Lei Lin, Jizhao Zhu, Yong Liu et al.

breakthrough🔴 AdvancedReasoning & AgentsLLM Reasoning
cs.AIcs.AI

HCoT injects expert system heuristics into LLM reasoning, replacing stochastic sampling with structured, deterministic planning—transforming LLMs into reliable agents for high-stakes decision systems.

Yijuan Liang, Xinghao Chen, Yifan Ge et al.

breakthrough🟡 IntermediateReasoning & AgentsAI Agents
cs.AIcs.AI

A unified 22k-tool, 390k-example tool-use stack that standardizes data and evaluation and lets an 8B model beat major commercial models on hard distractor-heavy calling.

Haoran Ding, Zhaoguo Wang, Haibo Chen

breakthrough🔴 AdvancedReasoning & AgentsAI Agents
cs.SEcs.AIcs.SE

This brings Hoare-style reasoning to 143k-line systems by inferring specs from caller intent, surfacing 522 new bugs in already-tested codebases.

Xiaomeng Hu, Yinger Zhang, Fei Huang et al.

breakthrough🟡 IntermediateReasoning & AgentsAI AgentsWorld Models
cs.CLcs.CL

OccuBench is a 100-scenario benchmark for professional agents across 65 domains that also injects hidden environment faults, exposing how brittle frontier models still are in real work settings.

Jinhua Wang, Biswa Sengupta

breakthrough🟡 IntermediateReasoning & AgentsAI Agents
cs.SEcs.AIcs.SE

This benchmark-driven translation of a production AI coding agent from Rust to Python shows how LLMs can migrate large systems continuously while staying competitive on real agent benchmarks.

CocoaBench Team, Shibo Hao, Zhining Zhang et al.

significant🟡 IntermediateReasoning & AgentsAI Agents
cs.CLcs.AIcs.CL

CocoaBench is a strong reality check for unified digital agents, with long-horizon tasks that force systems to combine vision, search, and coding in one workflow.

Lei Xiong, Huaying Yuan, Zheng Liu et al.

significant🟡 IntermediateReasoning & AgentsAI Agents
cs.AIcs.AI

PaperScope evaluates agentic deep research across multiple scientific papers, tables, and figures, exposing how hard real multi-document synthesis still is.

Ningyan Zhu, Huacan Wang, Jie Zhou et al.

significant🟡 IntermediateReasoning & AgentsAI Agents
cs.AIcs.AI

SemaClaw frames harness engineering as the real differentiator for personal AI agents, focusing on the infrastructure layer that turns raw models into auditable systems.

Xiaozhe Li, Tianyi Lyu, Yizhao Yang et al.

significant🔴 AdvancedReasoning & AgentsAI Agents
cs.AIcs.AI

A small RL-trained ContextCurator learns to trim noisy history while preserving reasoning anchors, boosting long-horizon agents and slashing token use up to 8x.

Xing Zhang, Guanghui Wang, Yanwei Cui et al.

significant🟢 BeginnerReasoning & AgentsAI Agents
cs.AIcs.CLcs.AI

A rare large-scale study of CLAUDE.md-style rules finds that negative constraints help coding agents while many positive instructions quietly hurt them.

Ziqian Zhong, Shashwat Saxena, Aditi Raghunathan

significant🔴 AdvancedReasoning & AgentsAI Agents
cs.AIcs.AI

Hodoscope uses unsupervised behavior monitoring to surface novel agent exploits and cut review effort by 6x to 23x, making it a practical safety layer for red teams and benchmark maintainers.

Dhruv Atreja, Julia White, Nikhil Nayak et al.

breakthrough🔴 AdvancedReasoning & AgentsAI Agents
cs.AIcs.CLcs.LG

Pioneer Agent turns small-model adaptation into an automated closed loop that diagnoses failures, curates new data, retrains under regression constraints, and materially improves production-style tasks.

Kaiyang Qian, Xinmin Fang, Zhengxiong Li

significant🟡 IntermediateReasoning & AgentsAI Agents
cs.MAcs.AIcs.MA

MPAC proposes a real coordination protocol for multi-owner agent systems, adding structured conflict handling and governance so agents can safely share state instead of silently clobbering each other.

Nastaran Darabi, Amit Ranjan Trivedi

cs.ROcs.CLcs.CV

ProGAL-VLA adds verified grounding and prospective sub-goals to VLA robots, sharply improving instruction sensitivity, ambiguity handling, and robustness under perturbation.

Tiantian He, Yihang Chen, Keyue Jiang et al.

significant🔴 AdvancedReasoning & AgentsTool UseAI Agents
cs.AIcs.AI

EE-MCP shows how MCP-plus-GUI agents can self-improve by generating environments, synthesizing gap tasks, and accumulating reusable experience, with clear gains across desktop apps.

Siyuan Xu, Shiyang Li, Xin Liu et al.

significant🔴 AdvancedReasoning & AgentsAI Agents
cs.AIcs.AI

COVERT turns synthetic tool-use data into reward-checkable RL environments, making it much easier to harden agent tool calling against ambiguity, distractor tools, and noisy outputs.

Yucheng Shen, Jiulong Wu, Jizhou Huang et al.

significant🔴 AdvancedReasoning & AgentsRAGAI Agents
cs.CVcs.AIcs.CV

VISOR pushes visual RAG toward real agent behavior with iterative search, evidence-space tracking, and drift control for long-horizon multimodal question answering over documents.

Mohamed Elfeki, Tu Trinh, Kelvin Luu et al.

significant🟡 IntermediateReasoning & AgentsAI Agents
cs.AIcs.AI

HiL-Bench measures whether agents know when to ask for missing information, exposing a major reliability gap that standard pass/fail coding benchmarks mostly hide.

Yushi Feng, Junye Du, Qifan Wang et al.

significant🔴 AdvancedReasoning & AgentsAI Agents
cs.LGcs.AIcs.LG

CORA adds conformal risk control to mobile GUI agents so teams can set explicit harm budgets and abstain before risky clicks instead of trusting heuristic guardrails.

Jingyu Zhang, Tianjian Li, William Jurayj et al.

significant🟡 IntermediateReasoning & AgentsAI Agents
cs.CLcs.AIcs.CL

Many-Tier Instruction Hierarchy shows today's agents break down when instruction privilege gets more granular, making it a useful stress test for serious multi-tool and multi-role deployments.

Suhana Bedi, Ryan Welch, Ethan Steinberg et al.

significant🟡 IntermediateReasoning & AgentsAI Agents
cs.AIcs.AI

HealthAdminBench gives computer-use agents a rare end-to-end GUI benchmark in a real workflow domain and shows that strong subtask scores still collapse into poor task completion.

Tanmay Gupta, Piper Wolters, Zixian Ma et al.

breakthrough🟡 IntermediateReasoning & AgentsAI Agents
cs.CVcs.CV

An open 4B and 8B visual web agent plus large mixed training set that beats comparable open agents and some larger closed systems, giving builders a reproducible browser-automation stack without HTML or accessibility-tree dependence.

Yuxuan Zhang, Yubo Wang, Yipeng Zhu et al.

breakthrough🟡 IntermediateReasoning & AgentsAI Agents
cs.CLcs.AIcs.CL

A live-web benchmark across 144 production sites and everyday tasks, showing frontier agents still complete only a small slice of real user workflows and giving builders a far more realistic yardstick than sandboxed browser evals.

Shilin Yan, Jintao Tong, Hongwei Xue et al.

cs.CVcs.AIcs.CV

Act Wisely separates task accuracy from tool-efficiency rewards so multimodal agents learn when not to call tools, cutting unnecessary invocations by orders of magnitude while improving accuracy, latency, and cost.

Boyang Zhang, Sebastián G. Acosta, Preston Carlson et al.

significant🟡 IntermediateReasoning & AgentsAI Agents
cs.CVcs.CV

ParseBench is a 2,000-page enterprise document benchmark that scores tables, charts, formatting, faithfulness, and grounding the way agents actually need them, exposing why text-similarity metrics miss business-critical parsing failures.

Ashima Suvarna, Kendrick Phan, Mehrab Beikzadeh et al.

breakthrough🔴 AdvancedReasoning & AgentsLLM Reasoning
cs.AIcs.LGcs.AI

SUPERNOVA turns natural-instruction datasets into RL-ready supervision for general reasoning, delivering large gains beyond math and code and giving post-training teams a practical recipe for broader reasoning improvement.

Wenbo Hu, Xin Chen, Yan Gao-Tian et al.

significant🔴 AdvancedReasoning & AgentsLLM Reasoning
cs.CVcs.AIcs.CL

OpenVLThinkerV2 introduces a more stable RL objective and task-shaping recipe for open multimodal reasoning, helping a generalist model balance perception with multi-step thinking across 18 visual benchmarks.

Tongbo Chen, Zhengxi Lu, Zhan Xu et al.

significant🔴 AdvancedReasoning & AgentsAI Agents
cs.AIcs.AI

KnowU-Bench evaluates personalized mobile agents in live GUI environments, including when to ask, act, or stay silent, which is much closer to real assistant behavior than static preference benchmarks.

Khushal Sethi

significant🟡 IntermediateReasoning & AgentsAI Agents
cs.AIcs.CLcs.MA

TrACE spends extra rollouts only on uncertain agent steps, matching fixed self-consistency accuracy with far fewer model calls and offering an easy path to cheaper agent inference.

Sai Srinivas Kancheti, Aditya Kanade, Rohit Sinha et al.

significant🔴 AdvancedReasoning & AgentsLLM Reasoning
cs.CVcs.AIcs.CV

Faithful GRPO adds consistency and grounding constraints to multimodal RL training, sharply reducing unfaithful visual reasoning traces while also improving final spatial reasoning accuracy.

Guo Gan, Yuxuan Ding, Cong Chen et al.

significant🔴 AdvancedReasoning & AgentsAI Agents
cs.LGcs.AIcs.LG

Reframes online agent RL as single-state multi-action learning, boosting Android agent success while reducing expensive emulator waste—useful for training UI agents under tight latency and budget constraints.

Yu Li, Sizhe Tang, Tian Lan

significant🔴 AdvancedReasoning & AgentsAI Agents
cs.AIcs.LGcs.AI

Builds a cognitive tree across multi-turn trajectories to assign credit at the step level, improving policy optimization for reasoning, planning, and interactive agents with long sparse-reward chains.

Seongwoo Jeong, Seonil Son

significant🟡 IntermediateReasoning & AgentsAI Agents
cs.AIcs.CLcs.AI

Shows explicit world models and symbolic reflection do most of the work in a self-revising agent, suggesting many agent stacks can trade extra model calls for better runtime structure.

Eranga Bandara, Ross Gore, Sachin Shetty et al.

breakthrough🟡 IntermediateReasoning & AgentsAI Agents
cs.AIcs.AI

Agentic AI automates end-to-end retail supply chains with real-world coordination—reduces manual labor at scale, proving LLM agents can drive high-stakes, operational workflows reliably.

Gustav Keppler, Moritz Gstür, Veit Hagenmeyer

breakthrough🔴 AdvancedReasoning & AgentsLLM Reasoning
cs.CRcs.AIcs.CR

CritBench is the first benchmark evaluating LLM agents on OT protocols like IEC 61850, exposing critical cybersecurity gaps in industrial systems. Essential for deploying LLMs in critical infrastructure safely.

Bowen Ye, Rang Li, Qibin Yang et al.

cs.AIcs.AI

Claw-Eval introduces transparent, safety-aware, multimodal evaluation for autonomous agents, addressing critical gaps in benchmarking—essential for building trustworthy, real-world AI agents.

Maria Nesterova, Mikhail Kolosov, Anton Andreychuk et al.

breakthrough🔴 AdvancedReasoning & AgentsAI Agents
cs.AIcs.AI

A single GPT-based model learns diverse MARL tasks, eliminating task-specific architectures—enabling scalable, generalizable multi-agent systems without retraining for each environment.

Wang Yang, Chaoda Song, Xinpeng Li et al.

significant🟡 IntermediateReasoning & AgentsAI Agents
cs.AIcs.CLcs.AI

ACE-Bench reduces agent evaluation overhead by 41% with controllable, scalable tasks—enabling reliable, repeatable benchmarking of LLM agents for real-world deployment.

Nirajan Acharya, Gaurav Kumar Gupta

breakthrough🔴 AdvancedReasoning & AgentsAI AgentsTool Use
cs.CRcs.AIcs.CR

First formal security framework for MCP-based AI agents, defining threats and verifiable defenses. Essential for builders deploying LLM agents with external tool access in production environments.

LM-Provers, Yuxiao Qu, Amrith Setlur et al.

breakthrough🔴 AdvancedReasoning & AgentsLLM Reasoning
cs.AIcs.CLcs.LG

QED-Nano proves complex math theorems using a tiny, open model—no giant AI needed. This matters because it makes high-level reasoning accessible to anyone, enabling reproducible, affordable AI that can be inspected, improved, and deployed without cloud costs.

Guan-Ting Lin, Chen Chen, Zhehuai Chen et al.

significant🟡 IntermediateReasoning & AgentsTool UseAI Agents
cs.CL

Voice agents often fail when users stutter, pause, or interrupt, leading to broken API calls and frustrated users. This benchmark uses real human speech to reveal exactly how top models handle these messy realities. It allows developers to test if their voice systems can actually execute tasks reliably in natural conversation.

Rafael O. Jarczewski, Gabriel U. Talasso, Leandro Villas et al.

significant🔴 AdvancedReasoning & AgentsAI Agents
cs.MAcs.AIcs.MA

Agentic Federated Learning uses AI agents to dynamically manage distributed training across unreliable devices. This matters because it makes privacy-preserving AI training faster and more reliable in real-world settings like mobile networks or hospitals with spotty connectivity.

Chenxi Wang, Zhuoyun Yu, Xin Xie et al.

significant🟡 IntermediateReasoning & AgentsAI Agents
cs.CLcs.AIcs.IR

SkillX creates a shared knowledge base of skills that allows AI agents to learn from each other's experiences rather than starting from scratch. This prevents redundant exploration and speeds up the development of capable agents. Builders can reuse these skills across different projects, significantly cutting down training time and costs.

Kanishk Jain, Qian Yang, Shravan Nayak et al.

significant🟡 IntermediateReasoning & AgentsLLM Reasoning
cs.CVcs.AIcs.CV

Finding specific weaknesses in vision-language models usually requires slow, manual testing. This paper uses reinforcement learning to automatically discover scenarios where models fail, such as spatial reasoning errors. This automation allows teams to rapidly identify and fix blind spots that human testers might miss.

Grace Liu, Brian Christian, Tsvetomira Dumbalska et al.

breakthrough🟡 IntermediateReasoning & AgentsAlignment & Safety
cs.AIcs.AI

AI assistants that always answer quickly make users dependent and worse at thinking alone. This is the first solid evidence that good AI should sometimes say 'figure it out'—a wake-up call for designers building educational or productivity tools.

Alexis Burgon, Berkman Sahiner, Nicholas A Petrick et al.

significant🟡 IntermediateReasoning & AgentsAlignment & Safety
cs.AIcs.PFcs.AI

This work introduces a standardized framework to evaluate AI medical devices that learn and adapt over time, solving a major regulatory bottleneck. It provides clear metrics to distinguish between a model actually improving versus just memorizing new data, which is critical for getting adaptive AI approved for clinical use.

© 2026 A2A.pub — AI to Action. From papers to practice, daily.
Summaries are AI-assistedPrivacyTerms