Landmark Guide

12 Papers That Built Modern LLMs

A beginner-friendly map of the papers that shaped transformers, scaling, alignment, and the open-source LLM era.

If you are trying to understand where modern large language models came from, this is the shortest serious reading path we know. Instead of giving you a flat archive of old papers, this page highlights the handful of papers that changed the direction of the field — and explains why they still matter today.

Timeline

2013–2023

Core papers

12

Builder bonus

5

If you only read 3 papers, start here

Most readers do not need the full history on day one. If you want the shortest path to understanding modern LLMs, start with these three papers first.

The shortest useful history of LLMs

Modern LLMs did not appear all at once. They emerged in layers: first better language representations, then encoder-decoder generation, then attention, then the Transformer, then large-scale pretraining, and finally alignment and open-weight ecosystems.

Before Transformers

Language modeling moved from embeddings to sequence-to-sequence generation and then to attention.

Transformer Takes Over

Self-attention replaced recurrence and became the default architecture for modern language AI.

Scale Becomes the Strategy

Model size, data size, and compute budget became strategic levers rather than afterthoughts.

From Research to Products

Instruction tuning, RLHF, and open weights turned language models into usable developer platforms.

The 12 core papers

These are the papers we would keep if we had to explain the rise of modern LLMs to a smart newcomer in one page. They are the papers that changed what people built next.

2013Must readBeginner

Dense word vectors turned language into a geometry that models could actually learn from.

Why it matters

This paper made pretrained word representations practical at scale and pushed the field toward reusable language representations instead of task-specific features.

What changed after this

Embeddings became the default starting point for NLP, and the field got comfortable with the idea that language structure could be learned from raw text.

Who should read

Beginners who want the prehistory of modern language models.

Before Transformers

Sequence to Sequence Learning with Neural Networks

Ilya Sutskever et al.

2014Must readIntermediate

One neural network can learn to map one sequence into another.

Why it matters

This paper made end-to-end neural generation real and gave the field a clean encoder-decoder recipe for translation and text generation.

What changed after this

It set the template for neural machine translation, summarization, and later generative modeling workflows.

Who should read

Builders and researchers who want to understand the road to generative LLMs.

2014Must readIntermediate

Attention solved the bottleneck of compressing a whole sentence into one fixed vector.

Why it matters

This is the moment attention became a first-class idea in sequence modeling, making long-context generation much more viable.

What changed after this

Attention stopped being a side trick and became the core mechanism that the Transformer later built around.

Who should read

Anyone who wants to understand why attention was such a big deal before Transformers existed.

Transformer Takes Over

Attention Is All You Need

Ashish Vaswani et al.

2017Must readIntermediate

Self-attention can replace recurrence and become the new default architecture.

Why it matters

This paper introduced the Transformer, the architecture behind modern LLMs, BERT, GPT, ViT, and much of multimodal AI.

What changed after this

Language modeling became easier to parallelize, easier to scale, and dramatically more capable over long contexts.

Who should read

Everyone. This is the single most important architecture paper on the page.

2018Must readIntermediate

Bidirectional transformer pretraining rewrote the NLP benchmark landscape.

Why it matters

BERT showed that massive pretraining plus light task-specific fine-tuning could outperform older supervised pipelines across a wide range of NLP tasks.

What changed after this

Pretrain-then-finetune became the default NLP recipe, and encoder-style transformers became the standard for understanding tasks.

Who should read

Readers who want to understand the branch of transformer history that shaped search, classification, and language understanding.

Transformer Takes Over

Language Models are Unsupervised Multitask Learners

Alec Radford et al.

2019Must readIntermediate

Large autoregressive language models can generalize across tasks without explicit supervised training for each one.

Why it matters

GPT-2 made generative pretraining feel like a general-purpose engine rather than just another benchmark technique.

What changed after this

The decoder-only branch became the direct line to today’s chat models and API-first LLM products.

Who should read

Anyone who wants to understand why GPT-style models became the center of the industry.

2020Should knowIntermediate

If every task becomes text-to-text, the whole NLP stack becomes easier to unify.

Why it matters

T5 made task formatting itself part of the modeling breakthrough and strongly influenced later instruction-style interfaces.

What changed after this

It normalized the idea that one model could serve many tasks through a single textual interface, which later fit naturally with prompting and instruction tuning.

Who should read

Builders and researchers who care about model interfaces, transfer, and task unification.

Scale Becomes the Strategy

Scaling Laws for Neural Language Models

Jared Kaplan et al.

2020Must readIntermediate

Model performance improves in predictable ways as you scale model size, data, and compute.

Why it matters

This paper turned scaling from “try bigger models and hope” into a semi-systematic engineering strategy.

What changed after this

Large-model training became more deliberate, more capital-intensive, and much more central to frontier model planning.

Who should read

Researchers, infra engineers, and anyone who wants to understand why scale became the dominant LLM playbook.

Scale Becomes the Strategy

Language Models are Few-Shot Learners

Tom Brown et al.

2020Must readIntermediate

Large models can learn new tasks from prompts and examples instead of gradient updates.

Why it matters

GPT-3 moved LLMs from an academic research line into a general-purpose platform story.

What changed after this

Prompting, API products, in-context learning, and the commercial LLM wave all accelerated from here.

Who should read

Everyone, especially product-minded readers who want the moment LLMs became real.

2022Must readIntermediate

Useful assistants are not just scaled models; they are models trained to follow human intent.

Why it matters

InstructGPT operationalized instruction tuning and RLHF, explaining why product assistants behave differently from raw base models.

What changed after this

Alignment, preference training, safety behavior, and chat-style assistant UX became core parts of the LLM stack.

Who should read

Builders, PMs, and researchers who care about why ChatGPT-like systems feel usable.

Scale Becomes the Strategy

Training Compute-Optimal Large Language Models

Jordan Hoffmann et al.

2022Must readIntermediate

Bigger models are not enough; training is only efficient when model size and data scale are balanced.

Why it matters

Chinchilla reset the conversation from “just make the model larger” to “train the right-sized model on enough data.”

What changed after this

Frontier labs and open-model builders had to rethink training economics, data budgets, and what efficient scaling really means.

Who should read

Anyone trying to understand the economics and strategy of LLM training.

From Research to Products

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron et al.

2023Must readBeginner

Strong open-weight LLMs changed who gets to build, experiment, and fine-tune.

Why it matters

LLaMA kicked off the modern open-weight LLM wave and made the ecosystem much broader than a few frontier closed labs.

What changed after this

Open-source fine-tuning, local deployment, model merging, community benchmarks, and the builder ecosystem all accelerated.

Who should read

Everyone, especially anyone who cares about the open model ecosystem rather than only API-based LLMs.

5 builder bonus papers

The core 12 explain where LLMs came from. These five bonus papers explain why modern LLM products work the way they do in practice: prompting, retrieval, adaptation, quantization, and tool use.

2022Must readBeginner

Sometimes better reasoning does not require a new model — it requires a better prompt.

Why it matters

This paper showed that step-by-step prompting can unlock much stronger reasoning behavior from large models without changing weights.

What changed after this

Prompt engineering became a serious product and evaluation layer rather than a bag of tricks.

Who should read

Builders and PMs who want to understand why prompt design matters so much in real LLM systems.

2020Must readIntermediate

A language model becomes much more useful when it can look things up instead of only remembering them.

Why it matters

This paper laid the conceptual foundation for today’s RAG systems, which power knowledge assistants, document QA, and enterprise copilots.

What changed after this

External retrieval became a standard answer to freshness, factuality, and domain-specific knowledge gaps in LLMs.

Who should read

Anyone building assistants on private data, documents, or changing knowledge bases.

2021Must readIntermediate

You do not always need to fine-tune every parameter to adapt a large model well.

Why it matters

LoRA made LLM adaptation dramatically cheaper and more practical, which is one reason the open model ecosystem became so productive so quickly.

What changed after this

Fine-tuning moved from expensive full-model retraining toward lighter adapter-style workflows that normal teams could actually run.

Who should read

Builders who want to customize models without frontier-lab budgets.

Builder Bonus

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers et al.

2023Must readIntermediate

High-quality LLM adaptation became possible on much smaller hardware budgets.

Why it matters

QLoRA turned open-weight model customization into something a much broader builder audience could actually try.

What changed after this

Single-GPU and low-cost LLM fine-tuning became mainstream, accelerating open-source experimentation and product prototyping.

Who should read

Builders who care about cost, local experimentation, and practical model adaptation.

2023Must readIntermediate

Language models get much more powerful when they stop being closed text machines and start calling external tools.

Why it matters

Toolformer is one of the cleanest bridges from plain LLMs to tool-using systems and, later, agentic workflows.

What changed after this

Tool use, API calling, calculator access, retrieval calls, and agent-style orchestration all became much easier to reason about as first-class system design patterns.

Who should read

Anyone interested in copilots, workflows, and the bridge from LLMs to AI agents.

How to read this page

Different readers need different paths. A beginner does not need the same reading order as a research engineer or a product builder.

Where to go after LLMs

Once you understand the transformer and LLM story, the next branches become much easier to navigate.

FAQ

Answers for the questions readers are most likely to search before or after landing on this page.

What paper introduced transformers?

The standard answer is Attention Is All You Need (2017), which introduced the Transformer architecture and replaced recurrence with self-attention.

What are the most important LLM papers for beginners?

A strong beginner set is Attention Is All You Need, GPT-3, InstructGPT, and LLaMA. Together they explain architecture, scaling, alignment, and the open-weight era.

Do I need to read BERT if I only care about ChatGPT-style models?

Yes, at least at a high level. BERT represents the other major branch of transformer pretraining and helps you understand why encoder-style and decoder-style models evolved differently.

What should I read after this page?

If you care about products, go to RAG or AI Agents. If you care about research direction, continue into multimodal models, reasoning, and alignment.

Want the daily version of this judgment?

This guide explains the long arc. Our daily feed explains what matters now.

© 2026 A2A.pub — AI to Action. From papers to practice, daily.
Summaries are AI-assistedPrivacyTerms