Landmark Guide

12 Papers That Shaped Modern AI Agents

A beginner-friendly map of the ideas behind tool use, planning, memory, multi-agent workflows, and software agents.

If you are trying to understand why AI agents suddenly became a real product category, this is the shortest serious reading path we know. We are not trying to cover the full history of symbolic agents or reinforcement learning here. This page focuses on the modern LLM-agent stack: systems that reason in language, call tools, interact with environments, and keep working across multi-step tasks.

Timeline

2021–2024

Core papers

12

Builder bonus

5

If you only read 3 papers, start here

Most readers do not need every benchmark or framework on day one. If you want the shortest path to modern AI agents, start with the agent loop, tool use, and one paper that shows what real task execution looks like.

The shortest useful history of AI agents

Modern AI agents did not start with humanoid robots. They emerged when language models stopped being answer engines and started browsing, calling tools, reflecting on failures, and operating inside real environments.

From Answers to Actions

Agents became interesting once language models started browsing, choosing actions, and coordinating with external tools instead of only producing text.

Tools, Plans, and Feedback Loops

The field learned that better agents need tool use, explicit search, and mechanisms for correcting themselves after bad actions.

Memory and Long-Horizon Behavior

Researchers pushed agents past single-turn prompting into persistent worlds, skill accumulation, and richer notions of memory.

Teams, Environments, and Real Work

Multi-agent systems, realistic benchmarks, and software engineering tasks turned the agent story into something builders could actually measure.

The 12 core papers

These are the papers we would keep if we had to explain modern AI agents to a smart newcomer in one page. They are the papers that changed what people built next.

From Answers to Actions

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano et al.

2021Must readIntermediate

A language model becomes more useful when it can browse the web, quote sources, and get feedback on its actions.

Why it matters

WebGPT is one of the clearest early signals that browsing itself could be part of the product, not just a hidden implementation detail behind a model.

What changed after this

Browser use, citation-first QA, and web-task agents became much easier to imagine as real product categories.

Who should read

Anyone who wants the bridge from closed-book LLMs to web-using agents.

From Answers to Actions

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao et al.

2022Must readBeginner

Reasoning traces become much more useful when the model can act, observe results, and keep going.

Why it matters

ReAct gave the field a simple, memorable loop for modern agents: think, act, observe, then repeat.

What changed after this

A huge amount of agent work inherited the ReAct template, directly or indirectly, from browser agents to coding agents.

Who should read

Everyone. This is the best first paper for understanding how modern agent loops are structured.

From Answers to Actions

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick et al.

2023Must readIntermediate

Tool use is not just orchestration glue; it can be part of the model’s learned behavior.

Why it matters

Toolformer made tool invocation feel central to the model stack and helped move agents beyond the idea of a purely text-only LLM.

What changed after this

Function calling, retrieval calls, calculator use, and API-triggered workflows became easier to reason about as first-class design patterns.

Who should read

Builders who want to understand why tool use sits at the center of modern agent systems.

Tools, Plans, and Feedback Loops

Gorilla: Large Language Model Connected with Massive APIs

Shishir G. Patil et al.

2023Should knowIntermediate

Agents get more valuable when they can reliably choose from large tool catalogs instead of a few hand-wired calls.

Why it matters

Gorilla pushed the tool-use story toward realistic API ecosystems, which is much closer to how production agent systems actually behave.

What changed after this

API schema grounding, tool selection quality, and execution reliability became major concerns for builder-facing agent stacks.

Who should read

Builders working on assistants that need to call real services, not toy tools.

Tools, Plans, and Feedback Loops

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Shunyu Yao et al.

2023Must readIntermediate

A stronger agent sometimes needs to search over multiple candidate plans instead of committing to the first thought.

Why it matters

Tree of Thoughts made planning and branching feel explicit, showing that better agents may need search, not just longer prompts.

What changed after this

Planning modules, deliberation loops, and multi-trajectory search became standard ideas in agent design.

Who should read

Anyone building agents for tasks where one bad early step can ruin the whole run.

Tools, Plans, and Feedback Loops

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn et al.

2023Must readIntermediate

Agents can improve by writing down lessons from failure instead of only changing weights.

Why it matters

Reflexion captured one of the most practical agent insights: many improvements come from better memory and critique loops, not retraining.

What changed after this

Self-critique, retry policies, and verbal feedback buffers became common patterns in agent scaffolding.

Who should read

Builders who care about agent reliability over repeated trials.

Memory and Long-Horizon Behavior

Generative Agents: Interactive Simulacra of Human Behavior

Joon Sung Park et al.

2023Must readBeginner

Persistent memory and planning can make agents feel less like chat turns and more like continuing entities.

Why it matters

Generative Agents shaped how people talk about memory, reflection, and believable long-horizon behavior in LLM-driven systems.

What changed after this

Memory architectures, reflection summaries, and simulation-style agent environments became mainstream design patterns.

Who should read

PMs, researchers, and builders who want the cleanest paper on agent memory and behavior loops.

Memory and Long-Horizon Behavior

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang et al.

2023Should knowIntermediate

Long-horizon agents get stronger when they can accumulate skills instead of starting from zero each time.

Why it matters

Voyager turned agents into something closer to a growing skill library and showed why iterative learning matters for hard environments.

What changed after this

Skill memory, reusable tool trajectories, and long-horizon task decomposition became more concrete parts of agent engineering.

Who should read

Builders interested in persistent agents rather than one-shot task completion.

2023Must readIntermediate

Multiple role-based agents can cooperate by talking to each other instead of packing every behavior into one prompt.

Why it matters

CAMEL helped popularize the modern multi-agent framing and gave builders a simple story for decomposing roles across collaborating agents.

What changed after this

Role prompts, agent societies, and multi-agent decomposition became a major branch of the field.

Who should read

Readers who want to understand why multi-agent systems took off so quickly in 2023.

Teams, Environments, and Real Work

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu et al.

2023Must readBeginner

Multi-agent systems became much easier to build once conversation itself became the programming model.

Why it matters

AutoGen translated multi-agent ideas into a concrete developer framework, which is one reason the agent wave became so implementation-heavy so quickly.

What changed after this

Agent orchestration libraries, conversation graphs, and role-specialized pipelines became mainstream builder patterns.

Who should read

Builders who want the bridge from research papers to practical agent framework design.

Teams, Environments, and Real Work

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou et al.

2023Must readIntermediate

Agent progress only becomes believable when you test agents in realistic environments instead of cherry-picked demos.

Why it matters

WebArena gave the field a concrete environment for evaluating web agents on tasks that look closer to real user workflows.

What changed after this

Benchmarking shifted toward browser environments, reproducible tasks, and less hand-wavy claims about autonomy.

Who should read

Anyone who wants to judge agent quality with more than anecdotes.

Teams, Environments, and Real Work

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang et al.

2024Must readBeginner

AI agents started feeling commercially real once they could operate inside valuable workflows like software engineering.

Why it matters

SWE-agent is one of the clearest examples of agent ideas turning into a concrete, high-value task domain with measurable outcomes.

What changed after this

Coding agents, task-specific agent tooling, and interface design for agent workspaces accelerated fast.

Who should read

Everyone who wants to understand why the market suddenly cares so much about agents.

5 builder bonus papers

The core 12 explain the shape of the field. These bonus papers explain routing, grounding, orchestration, and evaluation patterns that serious builders keep rediscovering.

Builder Bonus

PAL: Program-aided Language Models

Luyu Gao et al.

2022Must readIntermediate

Sometimes the right move is not more reasoning text but handing subproblems to an external runtime.

Why it matters

PAL is a crisp example of why tool execution matters: the model delegates the exact part it should not fake.

What changed after this

Program execution, calculator calls, and code-assisted reasoning became natural ingredients in agent systems.

Who should read

Builders who want to understand delegation as an engineering pattern.

2023Should knowBeginner

A strong general agent can act like a controller that routes work to specialized models.

Why it matters

HuggingGPT gave builders a vivid orchestration picture: one LLM as planner, many specialist models as executors.

What changed after this

Model routing, tool registries, and planner-executor splits became easier to explain and prototype.

Who should read

Readers who care about orchestrating heterogeneous tools and models.

2022ReferenceAdvanced

Language planning gets much more useful when the agent also knows what actions are actually possible.

Why it matters

SayCan is a valuable reminder that real agents need grounded action choices, not just plausible language plans.

What changed after this

Feasibility-aware planning and environment-grounded action selection became more explicit parts of the agent conversation.

Who should read

Builders who want to connect LLM planning with real-world constraints.

Builder Bonus

AgentBench: Evaluating LLMs as Agents

Xiaoge Sun et al.

2023Must readIntermediate

If you want better agents, you need benchmarks that test action, memory, and adaptation instead of only next-token quality.

Why it matters

AgentBench helped push the conversation from impressive demos toward broader, more systematic evaluation of agent behavior.

What changed after this

The field became more serious about measuring agent performance across environments rather than relying on isolated examples.

Who should read

Anyone building agent infrastructure, evals, or product claims.

2023Should knowBeginner

Multi-agent workflows get more legible when you treat roles, artifacts, and handoffs as an explicit process.

Why it matters

MetaGPT made multi-agent collaboration feel closer to a software production workflow than an open-ended chat experiment.

What changed after this

Artifact-driven multi-agent design, role specialization, and structured handoffs became more common in applied agent stacks.

Who should read

Builders interested in agent teams for real business workflows.

How to read this page

A beginner, research engineer, and product builder should not read the same subset in the same order. Pick the path that matches the kind of system you want to build.

Where to go after AI agents

Once you understand the agent loop, the next question is usually what the agent sits on top of: stronger base models, better retrieval, or richer computer-use environments.

FAQ

Answers for the questions readers are most likely to search before or after landing on this page.

What is the most important AI agent paper for beginners?

If you only read one paper, start with ReAct. It explains the think-act-observe loop that sits underneath a huge amount of modern agent work.

Are AI agents just tool-calling LLMs?

Not quite. Tool calling is part of the stack, but modern agents usually add state, retries, planning, memory, or environment interaction on top of simple tool invocation.

Why is SWE-agent on this list if it is about coding?

Because it shows why the agent category matters commercially. Coding is a valuable workflow with clear feedback loops, which makes it one of the best places to see agent ideas turn into real products.

What should I read after this page?

If you need stronger foundations, go to the LLM guide. If you care about grounded answers, go to RAG. If you care about richer action spaces, keep an eye on computer-use and multimodal agents.

Want the daily version of this judgment?

This guide explains the long arc. Our daily feed explains what matters now.

© 2026 A2A.pub — AI to Action. From papers to practice, daily.
Summaries are AI-assistedPrivacyTerms