Latest AI Research Highlights
Tuesday, April 14, 2026
Scores indicate our recommendation strength, not a quality judgment.
New
Landmark guides for long-term learning
Daily picks tell you what matters now. Landmark guides tell you which papers built a field over time. Start with the guide closest to what you want to build, then come back to the daily feed with better judgment.
Landmark Guide
12 Papers That Built Modern LLMs
A beginner-friendly map of the papers that shaped transformers, scaling, alignment, and the open-source LLM era.
Landmark Guide
12 Papers That Shaped Modern AI Agents
A beginner-friendly map of the ideas behind tool use, planning, memory, multi-agent workflows, and software agents.
Landmark Guide
12 Papers That Shaped Modern RAG
A beginner-friendly map of the ideas behind dense retrieval, retrieval-augmented generation, self-correction, and structured retrieval.
Landmark Guide
12 Papers That Shaped Modern Computer Use Agents
A beginner-friendly map of the papers behind web agents, GUI grounding, smartphone control, and full computer-use benchmarks.
Landmark Guide
12 Papers That Shaped Modern AI Coding Agents
A beginner-friendly map of code LLMs, repo grounding, software-engineering benchmarks, and modern SWE agents.
Yijuan Liang, Xinghao Chen, Yifan Ge et al.
A unified 22k-tool, 390k-example tool-use stack that standardizes data and evaluation and lets an 8B model beat major commercial models on hard distractor-heavy calling.
Haoran Ding, Zhaoguo Wang, Haibo Chen
This brings Hoare-style reasoning to 143k-line systems by inferring specs from caller intent, surfacing 522 new bugs in already-tested codebases.
Xiaomeng Hu, Yinger Zhang, Fei Huang et al.
OccuBench is a 100-scenario benchmark for professional agents across 65 domains that also injects hidden environment faults, exposing how brittle frontier models still are in real work settings.
Jinhua Wang, Biswa Sengupta
This benchmark-driven translation of a production AI coding agent from Rust to Python shows how LLMs can migrate large systems continuously while staying competitive on real agent benchmarks.
CocoaBench Team, Shibo Hao, Zhining Zhang et al.
CocoaBench is a strong reality check for unified digital agents, with long-horizon tasks that force systems to combine vision, search, and coding in one workflow.
Liujie Zhang, Benzhe Ning, Rui Yang et al.
Relax is an open asynchronous RL engine for omni-modal post-training that doubles throughput on Qwen3-Omni-scale runs without sacrificing convergence.
Lei Xiong, Huaying Yuan, Zheng Liu et al.
PaperScope evaluates agentic deep research across multiple scientific papers, tables, and figures, exposing how hard real multi-document synthesis still is.
Bo Li, Mingda Wang, Gexiang Fang et al.
GRIP turns retrieval into a native decoding action so the model can decide when to search, rewrite queries, and stop inside one reasoning trace instead of bolting on a controller.
S. Aaron McClendon, Jorge Gallego-Feliciano, Stavros Zervoudakis et al.
By reusing one small model as summarizer, agent, and isolated code reviewer, this inference-time scaffold roughly doubles AppWorld performance on a single 24GB GPU.
Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning
Xiaozhe Li, Tianyi Lyu, Yizhao Yang et al.
A small RL-trained ContextCurator learns to trim noisy history while preserving reasoning anchors, boosting long-horizon agents and slashing token use up to 8x.
Artem Gadzhiev, Andrew Kislov
Synthius-Mem replaces retrieval-heavy agent memory with structured persona memory, improving both long-term recall and adversarial robustness against invented facts.
Ningyan Zhu, Huacan Wang, Jie Zhou et al.
SemaClaw frames harness engineering as the real differentiator for personal AI agents, focusing on the infrastructure layer that turns raw models into auditable systems.
Solomon Messing
This work shows how prompt wording, judge choice, and temperature can flip LLM eval results, then gives a budget-aware recipe that materially reduces benchmark noise and gaming surface.
Ziqian Zhong, Shashwat Saxena, Aditi Raghunathan
Hodoscope uses unsupervised behavior monitoring to surface novel agent exploits and cut review effort by 6x to 23x, making it a practical safety layer for red teams and benchmark maintainers.
Xing Zhang, Guanghui Wang, Yanwei Cui et al.
A rare large-scale study of CLAUDE.md-style rules finds that negative constraints help coding agents while many positive instructions quietly hurt them.