AI Research Highlights

Tuesday, April 14, 2026

UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents

Yijuan Liang, Xinghao Chen, Yifan Ge et al.

breakthrough🟡 IntermediateReasoning & Agents AI Agents

cs.AIcs.AI

A unified 22k-tool, 390k-example tool-use stack that standardizes data and evaluation and lets an 8B model beat major commercial models on hard distractor-heavy calling.

Details → arXiv →

FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning

Haoran Ding, Zhaoguo Wang, Haibo Chen

breakthrough🔴 AdvancedReasoning & Agents AI Agents

cs.SEcs.AIcs.SE

This brings Hoare-style reasoning to 143k-line systems by inferring specs from caller intent, surfacing 522 new bugs in already-tested codebases.

Details → arXiv →

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models

Xiaomeng Hu, Yinger Zhang, Fei Huang et al.

breakthrough🟡 IntermediateReasoning & Agents AI Agents World Models

cs.CLcs.CL

OccuBench is a 100-scenario benchmark for professional agents across 65 domains that also injects hidden environment faults, exposing how brittle frontier models still are in real work settings.

Details → arXiv →

From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python

Jinhua Wang, Biswa Sengupta

breakthrough🟡 IntermediateReasoning & Agents AI Agents

cs.SEcs.AIcs.SE

This benchmark-driven translation of a production AI coding agent from Rust to Python shows how LLMs can migrate large systems continuously while staying competitive on real agent benchmarks.

Details → arXiv →

CocoaBench: Evaluating Unified Digital Agents in the Wild

CocoaBench Team, Shibo Hao, Zhining Zhang et al.

significant🟡 IntermediateReasoning & Agents AI Agents

cs.CLcs.AIcs.CL

CocoaBench is a strong reality check for unified digital agents, with long-horizon tasks that force systems to combine vision, search, and coding in one workflow.

Details → arXiv →

Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale

Liujie Zhang, Benzhe Ning, Rui Yang et al.

significant🔴 AdvancedNLP LLM Reasoning

cs.CLcs.CL

Relax is an open asynchronous RL engine for omni-modal post-training that doubles throughput on Qwen3-Omni-scale runs without sacrificing convergence.

Details → arXiv →

PaperScope: A Multi-Modal Multi-Document Benchmark for Agentic Deep Research Across Massive Scientific Papers

Lei Xiong, Huaying Yuan, Zheng Liu et al.

significant🟡 IntermediateReasoning & Agents AI Agents

cs.AIcs.AI

PaperScope evaluates agentic deep research across multiple scientific papers, tables, and figures, exposing how hard real multi-document synthesis still is.

Details → arXiv →

Retrieval as Generation: A Unified Framework with Self-Triggered Information Planning

Bo Li, Mingda Wang, Gexiang Fang et al.

significant🔴 AdvancedNLP RAG

cs.CLcs.AIcs.CL

GRIP turns retrieval into a native decoding action so the model can decide when to search, rewrite queries, and stop inside one reasoning trace instead of bolting on a controller.

Details → arXiv →

Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents

S. Aaron McClendon, Jorge Gallego-Feliciano, Stavros Zervoudakis et al.

significant🟡 IntermediateMachine Learning Efficient Inference AI Agents

cs.AIcs.AI

By reusing one small model as summarizer, agent, and isolated code reviewer, this inference-time scaffold roughly doubles AppWorld performance on a single 24GB GPU.

Details → arXiv →

Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning

Xiaozhe Li, Tianyi Lyu, Yizhao Yang et al.

significant🔴 AdvancedReasoning & Agents AI Agents

cs.AIcs.AI

A small RL-trained ContextCurator learns to trim noisy history while preserving reasoning anchors, boosting long-horizon agents and slashing token use up to 8x.

Details → arXiv →

Synthius-Mem: Brain-Inspired Hallucination-Resistant Persona Memory Achieving 94.4% Memory Accuracy and 99.6% Adversarial Robustness on LoCoMo

Artem Gadzhiev, Andrew Kislov

significant🟡 IntermediateNLP RAG

cs.CLcs.AIcs.LG

Synthius-Mem replaces retrieval-heavy agent memory with structured persona memory, improving both long-term recall and adversarial robustness against invented facts.

Details → arXiv →

SemaClaw: A Step Towards General-Purpose Personal AI Agents through Harness Engineering

Ningyan Zhu, Huacan Wang, Jie Zhou et al.

significant🟡 IntermediateReasoning & Agents AI Agents

cs.AIcs.AI

SemaClaw frames harness engineering as the real differentiator for personal AI agents, focusing on the infrastructure layer that turns raw models into auditable systems.

Details → arXiv →

Decomposing and Reducing Hidden Measurement Error in LLM Evaluation Pipelines

Solomon Messing

significant🟡 IntermediateNLP LLM Reasoning

cs.CLcs.CL

This work shows how prompt wording, judge choice, and temperature can flip LLM eval results, then gives a budget-aware recipe that materially reduces benchmark noise and gaming surface.

Details → arXiv →

Hodoscope: Unsupervised Monitoring for AI Misbehaviors

Ziqian Zhong, Shashwat Saxena, Aditi Raghunathan

significant🔴 AdvancedReasoning & Agents AI Agents

cs.AIcs.AI

Hodoscope uses unsupervised behavior monitoring to surface novel agent exploits and cut review effort by 6x to 23x, making it a practical safety layer for red teams and benchmark maintainers.

Details → arXiv →

Do Agent Rules Shape or Distort? Guardrails Beat Guidance in Coding Agents

Xing Zhang, Guanghui Wang, Yanwei Cui et al.

significant🟢 BeginnerReasoning & Agents AI Agents

cs.AIcs.CLcs.AI

A rare large-scale study of CLAUDE.md-style rules finds that negative constraints help coding agents while many positive instructions quietly hurt them.

Details → arXiv →