Landmark Guide

12 Papers That Shaped Modern AI Coding Agents

A beginner-friendly map of code LLMs, repo grounding, software-engineering benchmarks, and modern SWE agents.

If you are trying to understand why coding agents suddenly became one of the most commercially important AI categories, this is the shortest serious reading path we know. We are not trying to cover the full history of program synthesis or software engineering automation here. This page focuses on the modern LLM coding-agent stack: code-native language models, repository context, issue-resolution benchmarks, and agents that can operate inside real software workflows.

Timeline

2021–2024

Core papers

Builder bonus

Start with the 3 must-read papers → Jump to full timeline →

If you only read 3 papers, start here

Most readers do not need every code benchmark or agent variant on day one. If you want the shortest path to modern coding agents, start with the first breakout code model, the benchmark that made real-world SWE measurable, and the agent paper that turned the category into a product story.

Evaluating Large Language Models Trained on Code

The paper that made code-specialized LLMs feel practically useful, not just academically interesting.

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

The benchmark that made real software issue resolution measurable and comparable.

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

The paper that made coding agents feel like a real product category instead of a benchmark curiosity.

The shortest useful history of AI coding agents

Modern coding agents did not emerge from one code model. They emerged when code LLMs became useful, repository context became tractable, and issue-resolution benchmarks turned software engineering into a serious agent task.

Code LLMs Become Usable

The first wave was about showing that large language models could actually write code, solve challenges, and support serious programming tasks.

Repositories Become Context

Snippet-level generation was not enough. The field had to learn how to retrieve, rank, and reason over repository-scale context.

Real-World SWE Becomes Benchmarkable

Coding agents became a real field once issue resolution and repo-level tasks turned into public benchmarks instead of cherry-picked demos.

Agents Move Beyond Autocomplete

The latest shift is from code generation toward end-to-end software engineering: navigating repos, choosing actions, and validating fixes.

The 12 core papers

These are the papers we would keep if we had to explain modern AI coding agents to a smart newcomer in one page. They are the papers that changed what people built next.

Code LLMs Become Usable

Evaluating Large Language Models Trained on Code

Mark Chen et al.

2021Must readBeginner

Code-specialized LLMs became useful enough to change how developers think about programming assistance.

Why it matters

This is the Codex paper, the cleanest starting point for understanding why code LLMs exploded into real developer workflows.

What changed after this

Autocomplete, prompt-based coding help, and the entire modern coding-assistant wave accelerated from here.

Who should read

Everyone. This is the base of the whole modern coding-agent story.

Read paper → Read next: Measuring Coding Challenge Competence With APPS →

Code LLMs Become Usable

Measuring Coding Challenge Competence With APPS

Dan Hendrycks et al.

2021Should knowBeginner

You cannot improve coding models seriously without harder problem-solving benchmarks than toy next-token metrics.

Why it matters

APPS gave the field a much more realistic benchmark for multi-step code problem solving and exposed how far models still had to go.

What changed after this

Code-model evaluation got harder, and competitive problem-solving became a serious proving ground for code LLMs.

Who should read

Readers who want to understand how the field measured progress before repo-level agents took over.

Read paper → Read next: Competitive programming with AlphaCode →

Code LLMs Become Usable

Competitive programming with AlphaCode

Yujia Li et al.

2022Must readIntermediate

Large models can tackle hard coding problems when generation, filtering, and search are combined carefully.

Why it matters

AlphaCode pushed code generation far beyond autocomplete and helped show how search and ranking matter in coding systems.

What changed after this

The field started taking code-generation search pipelines, execution filtering, and harder coding tasks more seriously.

Who should read

Anyone who wants to understand why coding systems are not just “generate one answer and pray.”

Read paper → Read next: CodeT5+: Open Code Large Language Models for Code Understanding and Generation →

Code LLMs Become Usable

CodeT5+: Open Code Large Language Models for Code Understanding and Generation

Yue Wang et al.

2023Should knowBeginner

Open code LLMs made serious code understanding and generation accessible beyond a few closed labs.

Why it matters

CodeT5+ matters because it helped widen the coding-model ecosystem and gave builders stronger open foundations for software tasks.

What changed after this

Open code models, fine-tuning, and broader experimentation on code tasks became easier for more teams.

Who should read

Builders who care about open code-model foundations instead of only proprietary assistants.

Read paper → Read next: RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation →

Repositories Become Context

RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation

Fengji Zhang et al.

2023Must readIntermediate

Coding quality jumps when the model can iteratively retrieve relevant repository context instead of guessing from one file.

Why it matters

RepoCoder is a key bridge from snippet-level code LLMs to repo-aware systems that understand dependency and context.

What changed after this

Repository retrieval, iterative context gathering, and codebase-aware prompting became central patterns in coding agents.

Who should read

Anyone who wants to understand why repo context is not optional for serious coding agents.

Read paper → Read next: RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems →

Repositories Become Context

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

Tianyang Liu et al.

2023Must readIntermediate

Repo-level coding systems need repo-level benchmarks, not just single-file completion scores.

Why it matters

RepoBench made repository-scale understanding a first-class evaluation target, which is crucial for coding agents.

What changed after this

Teams started benchmarking code systems against multi-file and repository tasks instead of only local completions.

Who should read

Builders and researchers working on repo-aware tooling rather than standalone code snippets.

Read paper → Read next: SWE-bench: Can Language Models Resolve Real-World GitHub Issues? →

Real-World SWE Becomes Benchmarkable

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez et al.

2024Must readBeginner

Coding agents only became a serious field once real GitHub issues became a benchmark instead of a demo source.

Why it matters

SWE-bench changed the conversation from “can models write code?” to “can systems actually resolve real repository issues?”

What changed after this

Issue-resolution benchmarks became the main proving ground for modern SWE agents.

Who should read

Everyone following coding agents. This is the benchmark that redefined the category.

Read paper → Read next: AutoCodeRover: Autonomous Program Improvement →

Agents Move Beyond Autocomplete

AutoCodeRover: Autonomous Program Improvement

Yuntong Zhang et al.

2024Must readIntermediate

A coding agent needs more than generation: it needs search, navigation, and repair behavior over real codebases.

Why it matters

AutoCodeRover is one of the clearest early systems papers on autonomous program improvement over real software tasks.

What changed after this

More teams started thinking in terms of end-to-end issue resolution rather than single-shot patch generation.

Who should read

Researchers and builders who want the first serious picture of repo navigation plus repair.

Read paper → Read next: Executable Code Actions Elicit Better LLM Agents →

Agents Move Beyond Autocomplete

Executable Code Actions Elicit Better LLM Agents

Xingyao Wang et al.

2024Must readIntermediate

Coding agents get stronger when code execution is part of the action space, not just the output format.

Why it matters

This paper is a strong articulation of why executable actions matter for robust coding agents and agent loops more broadly.

What changed after this

Action spaces, execution feedback, and tool-mediated code interaction became even more central to coding-agent design.

Who should read

Anyone who wants to understand why coding agents are a natural home for tool-using LLMs.

Read paper → Read next: Agentless: Demystifying LLM-based Software Engineering Agents →

Agents Move Beyond Autocomplete

Agentless: Demystifying LLM-based Software Engineering Agents

Chunqiu Steven Xia et al.

2024Must readBeginner

More agent complexity is not automatically better; simpler pipelines can compete surprisingly well on real SWE tasks.

Why it matters

Agentless is important because it forces the field to ask whether “agentic” complexity is always necessary for software tasks.

What changed after this

The community got more serious about ablations, simplification, and when multi-step agents are actually worth the overhead.

Who should read

Builders who want to distinguish necessary agent structure from fashion-driven complexity.

Read paper → Read next: SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering →

Agents Move Beyond Autocomplete

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang et al.

2024Must readBeginner

Coding agents became commercially legible once they could operate tooling and repositories through purpose-built interfaces.

Why it matters

SWE-agent is one of the clearest papers showing how coding agents become real systems, not just code generators.

What changed after this

The industry shifted hard toward repo agents, issue agents, and engineering workflows as one of the first major agent businesses.

Who should read

Everyone who wants to understand why coding agents became such a big category so quickly.

Read paper → Read next: Introducing SWE-bench Verified →

Agents Move Beyond Autocomplete

Introducing SWE-bench Verified

Neil Chowdhury et al.

2024Should knowBeginner

Benchmark credibility matters: if evaluation is noisy, claimed coding-agent progress can be badly overstated.

Why it matters

SWE-bench Verified matters because it tightened the benchmark itself, which is essential in a field moving this quickly.

What changed after this

Evaluation quality, verified subsets, and more careful result interpretation became much harder to ignore.

Who should read

Anyone using benchmark numbers to judge coding-agent progress.

4 builder bonus papers

The core 12 explain how coding agents formed. These four bonus papers explain debugging, code reasoning, broader code benchmarks, and how close the field is getting to real paid engineering work.

Builder Bonus

Teaching Large Language Models to Self-Debug

Xinyun Chen et al.

2023Must readBeginner

A better coding system often improves by debugging its own outputs instead of only generating new ones from scratch.

Why it matters

Self-debugging is one of the cleanest practical tricks in coding systems and helped normalize feedback loops over raw generation.

What changed after this

Retry, critique, and execution-based refinement became much more common coding-agent patterns.

Who should read

Builders who care about turning “almost right” code into usable code.

Read paper → Read next: Executable Code Actions Elicit Better LLM Agents →

Builder Bonus

DS-1000: A Natural and Reliable Benchmark for Code Generation

Yiyang Zhang et al.

2023Should knowIntermediate

Coding agents do not live only in algorithms and interview questions; data-science workflows expose a different class of failure modes.

Why it matters

DS-1000 broadens the picture of code capability beyond toy coding tasks and into realistic library-heavy work.

What changed after this

People became more careful about claiming broad coding competence from narrow benchmarks.

Who should read

Builders whose users live in notebooks, analytics stacks, or library-heavy scripting workflows.

Read paper → Read next: SWE-bench: Can Language Models Resolve Real-World GitHub Issues? →

Builder Bonus

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

Alex Gu et al.

2024Must readIntermediate

A model that can emit code is not necessarily a model that truly understands what the code does.

Why it matters

CRUXEval is a strong reminder that code reasoning, execution understanding, and repair ability are different skills.

What changed after this

The field became more careful about separating code generation quality from code understanding quality.

Who should read

Anyone benchmarking coding systems who wants a better mental model than “passes unit tests = understands code.”

Read paper → Read next: Agentless: Demystifying LLM-based Software Engineering Agents →

Builder Bonus

SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

Samuel Miserendino et al.

2025Must readBeginner

The coding-agent question is no longer only “can it solve a benchmark?” but “can it do economically valuable engineering work?”

Why it matters

SWE-Lancer is one of the clearest papers connecting coding-agent progress to real market value and real paid work.

What changed after this

The conversation moved even closer to business outcomes, labor substitution, and where coding agents make or lose money.

Who should read

Anyone thinking about the business side of coding agents rather than only their leaderboard scores.

Read paper → Read next: SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering →

How to read this page

A beginner, repo-tool builder, and SWE-agent researcher should not read the same subset in the same order. Pick the path that matches the kind of coding workflow you care about.

Absolute beginner

Read only the shortest set that explains code LLMs, real-world SWE benchmarks, repository context, and one strong software agent system.

Evaluating Large Language Models Trained on Code RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation SWE-bench: Can Language Models Resolve Real-World GitHub Issues?Agentless: Demystifying LLM-based Software Engineering Agents SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

Builder path

Read the papers that explain repository grounding, evaluation quality, action spaces, and how coding agents become real tools.

Evaluating Large Language Models Trained on Code RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems SWE-bench: Can Language Models Resolve Real-World GitHub Issues?Executable Code Actions Elicit Better LLM Agents Agentless: Demystifying LLM-based Software Engineering Agents SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering Introducing SWE-bench Verified SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

Researcher path

Read the papers in order to see how the field moved from code generation to repository understanding to benchmarked software agents.

Evaluating Large Language Models Trained on Code Measuring Coding Challenge Competence With APPS Competitive programming with AlphaCode CodeT5+: Open Code Large Language Models for Code Understanding and Generation RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems SWE-bench: Can Language Models Resolve Real-World GitHub Issues?AutoCodeRover: Autonomous Program Improvement Executable Code Actions Elicit Better LLM Agents Agentless: Demystifying LLM-based Software Engineering Agents SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering Introducing SWE-bench Verified

Where to go after AI coding agents

Once you understand coding agents, the next question is usually what gives them leverage: stronger general agent loops, richer computer-use interfaces, or retrieval and memory over codebases.

AI Agents

Read now

The broader planning, tool use, and orchestration patterns that coding agents inherit from general-purpose agents.

Computer Use Agents

Read now

How agent systems move from repositories and terminals into browsers, GUIs, and full operating-system environments.

RAG

Read now

How retrieval and memory over documentation, issues, and codebases become leverage for stronger coding systems.

FAQ

Answers for the questions readers are most likely to search before or after landing on this page.

What is the most important paper for understanding AI coding agents?

If you only read one systems paper, start with SWE-bench or SWE-agent. SWE-bench defines the real-world task; SWE-agent shows what an actual software agent system looks like.

Are coding agents just better code autocomplete?

No. Modern coding agents usually need repository context, issue understanding, tool actions, execution feedback, and validation loops. That is much broader than autocomplete.

Why are benchmarks so central in this field?

Because coding agents are easy to over-demo. Benchmarks like APPS, RepoBench, SWE-bench, and SWE-bench Verified are what turned the field into something comparable and investable.

What should I read after this page?

If you want the broader systems picture, go to AI Agents. If you care about interfaces, go to Computer Use Agents. If you care about memory over repos and docs, go to RAG.

Want the daily version of this judgment?

This guide explains the long arc. Our daily feed explains what matters now.

Browse today’s papers → Browse all guides →