AI Research Highlights
Tuesday, April 14, 2026
Yijuan Liang, Xinghao Chen, Yifan Ge et al.
A unified 22k-tool, 390k-example tool-use stack that standardizes data and evaluation and lets an 8B model beat major commercial models on hard distractor-heavy calling.
Haoran Ding, Zhaoguo Wang, Haibo Chen
This brings Hoare-style reasoning to 143k-line systems by inferring specs from caller intent, surfacing 522 new bugs in already-tested codebases.
Xiaomeng Hu, Yinger Zhang, Fei Huang et al.
OccuBench is a 100-scenario benchmark for professional agents across 65 domains that also injects hidden environment faults, exposing how brittle frontier models still are in real work settings.
Jinhua Wang, Biswa Sengupta
This benchmark-driven translation of a production AI coding agent from Rust to Python shows how LLMs can migrate large systems continuously while staying competitive on real agent benchmarks.
CocoaBench Team, Shibo Hao, Zhining Zhang et al.
CocoaBench is a strong reality check for unified digital agents, with long-horizon tasks that force systems to combine vision, search, and coding in one workflow.
Liujie Zhang, Benzhe Ning, Rui Yang et al.
Relax is an open asynchronous RL engine for omni-modal post-training that doubles throughput on Qwen3-Omni-scale runs without sacrificing convergence.
Lei Xiong, Huaying Yuan, Zheng Liu et al.
PaperScope evaluates agentic deep research across multiple scientific papers, tables, and figures, exposing how hard real multi-document synthesis still is.
Bo Li, Mingda Wang, Gexiang Fang et al.
GRIP turns retrieval into a native decoding action so the model can decide when to search, rewrite queries, and stop inside one reasoning trace instead of bolting on a controller.
S. Aaron McClendon, Jorge Gallego-Feliciano, Stavros Zervoudakis et al.
By reusing one small model as summarizer, agent, and isolated code reviewer, this inference-time scaffold roughly doubles AppWorld performance on a single 24GB GPU.
Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning
Xiaozhe Li, Tianyi Lyu, Yizhao Yang et al.
A small RL-trained ContextCurator learns to trim noisy history while preserving reasoning anchors, boosting long-horizon agents and slashing token use up to 8x.
Artem Gadzhiev, Andrew Kislov
Synthius-Mem replaces retrieval-heavy agent memory with structured persona memory, improving both long-term recall and adversarial robustness against invented facts.
Ningyan Zhu, Huacan Wang, Jie Zhou et al.
SemaClaw frames harness engineering as the real differentiator for personal AI agents, focusing on the infrastructure layer that turns raw models into auditable systems.
Solomon Messing
This work shows how prompt wording, judge choice, and temperature can flip LLM eval results, then gives a budget-aware recipe that materially reduces benchmark noise and gaming surface.
Ziqian Zhong, Shashwat Saxena, Aditi Raghunathan
Hodoscope uses unsupervised behavior monitoring to surface novel agent exploits and cut review effort by 6x to 23x, making it a practical safety layer for red teams and benchmark maintainers.
Xing Zhang, Guanghui Wang, Yanwei Cui et al.
A rare large-scale study of CLAUDE.md-style rules finds that negative constraints help coding agents while many positive instructions quietly hurt them.