Landmark Guide

12 Papers That Built Modern LLMs

A beginner-friendly map of the papers that shaped transformers, scaling, alignment, and the open-source LLM era.

If you are trying to understand where modern large language models came from, this is the shortest serious reading path we know. Instead of giving you a flat archive of old papers, this page highlights the handful of papers that changed the direction of the field — and explains why they still matter today.

Timeline

2013–2023

Core papers

Builder bonus

Start with the 3 must-read papers → Jump to full timeline →

If you only read 3 papers, start here

Most readers do not need the full history on day one. If you want the shortest path to understanding modern LLMs, start with these three papers first.

Attention Is All You Need

The paper that introduced the Transformer and changed the default architecture for modern language models.

Language Models are Few-Shot Learners

The paper that turned scaling into a general-purpose interface and pushed LLMs into the mainstream.

Training language models to follow instructions with human feedback

The paper that explains why product LLMs feel useful, aligned, and assistant-like instead of raw.

The shortest useful history of LLMs

Modern LLMs did not appear all at once. They emerged in layers: first better language representations, then encoder-decoder generation, then attention, then the Transformer, then large-scale pretraining, and finally alignment and open-weight ecosystems.

Before Transformers

Language modeling moved from embeddings to sequence-to-sequence generation and then to attention.

Transformer Takes Over

Self-attention replaced recurrence and became the default architecture for modern language AI.

Scale Becomes the Strategy

Model size, data size, and compute budget became strategic levers rather than afterthoughts.

From Research to Products

Instruction tuning, RLHF, and open weights turned language models into usable developer platforms.

The 12 core papers

These are the papers we would keep if we had to explain the rise of modern LLMs to a smart newcomer in one page. They are the papers that changed what people built next.

Before Transformers

Distributed Representations of Words and Phrases and their Compositionality

Tomas Mikolov et al.

2013Must readBeginner

Dense word vectors turned language into a geometry that models could actually learn from.

Why it matters

This paper made pretrained word representations practical at scale and pushed the field toward reusable language representations instead of task-specific features.

What changed after this

Embeddings became the default starting point for NLP, and the field got comfortable with the idea that language structure could be learned from raw text.

Who should read

Beginners who want the prehistory of modern language models.

Read paper → Read next: Sequence to Sequence Learning with Neural Networks →

Before Transformers

Sequence to Sequence Learning with Neural Networks

Ilya Sutskever et al.

2014Must readIntermediate

One neural network can learn to map one sequence into another.

Why it matters

This paper made end-to-end neural generation real and gave the field a clean encoder-decoder recipe for translation and text generation.

What changed after this

It set the template for neural machine translation, summarization, and later generative modeling workflows.

Who should read

Builders and researchers who want to understand the road to generative LLMs.

Read paper → Read next: Neural Machine Translation by Jointly Learning to Align and Translate →

Before Transformers

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau et al.

2014Must readIntermediate

Attention solved the bottleneck of compressing a whole sentence into one fixed vector.

Why it matters

This is the moment attention became a first-class idea in sequence modeling, making long-context generation much more viable.

What changed after this

Attention stopped being a side trick and became the core mechanism that the Transformer later built around.

Who should read

Anyone who wants to understand why attention was such a big deal before Transformers existed.

Read paper → Read next: Attention Is All You Need →

Transformer Takes Over

Attention Is All You Need

Ashish Vaswani et al.

2017Must readIntermediate

Self-attention can replace recurrence and become the new default architecture.

Why it matters

This paper introduced the Transformer, the architecture behind modern LLMs, BERT, GPT, ViT, and much of multimodal AI.

What changed after this

Language modeling became easier to parallelize, easier to scale, and dramatically more capable over long contexts.

Who should read

Everyone. This is the single most important architecture paper on the page.

Read paper → Read next: BERT →

Transformer Takes Over

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin et al.

2018Must readIntermediate

Bidirectional transformer pretraining rewrote the NLP benchmark landscape.

Why it matters

BERT showed that massive pretraining plus light task-specific fine-tuning could outperform older supervised pipelines across a wide range of NLP tasks.

What changed after this

Pretrain-then-finetune became the default NLP recipe, and encoder-style transformers became the standard for understanding tasks.

Who should read

Readers who want to understand the branch of transformer history that shaped search, classification, and language understanding.

Read paper → Read next: Language Models are Unsupervised Multitask Learners →

Transformer Takes Over

Language Models are Unsupervised Multitask Learners

Alec Radford et al.

2019Must readIntermediate

Large autoregressive language models can generalize across tasks without explicit supervised training for each one.

Why it matters

GPT-2 made generative pretraining feel like a general-purpose engine rather than just another benchmark technique.

What changed after this

The decoder-only branch became the direct line to today’s chat models and API-first LLM products.

Who should read

Anyone who wants to understand why GPT-style models became the center of the industry.

Read paper → Read next: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer →

Transformer Takes Over

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel et al.

2020Should knowIntermediate

If every task becomes text-to-text, the whole NLP stack becomes easier to unify.

Why it matters

T5 made task formatting itself part of the modeling breakthrough and strongly influenced later instruction-style interfaces.

What changed after this

It normalized the idea that one model could serve many tasks through a single textual interface, which later fit naturally with prompting and instruction tuning.

Who should read

Builders and researchers who care about model interfaces, transfer, and task unification.

Read paper → Read next: Scaling Laws for Neural Language Models →

Scale Becomes the Strategy

Scaling Laws for Neural Language Models

Jared Kaplan et al.

2020Must readIntermediate

Model performance improves in predictable ways as you scale model size, data, and compute.

Why it matters

This paper turned scaling from “try bigger models and hope” into a semi-systematic engineering strategy.

What changed after this

Large-model training became more deliberate, more capital-intensive, and much more central to frontier model planning.

Who should read

Researchers, infra engineers, and anyone who wants to understand why scale became the dominant LLM playbook.

Read paper → Read next: Language Models are Few-Shot Learners →

Scale Becomes the Strategy

Language Models are Few-Shot Learners

Tom Brown et al.

2020Must readIntermediate

Large models can learn new tasks from prompts and examples instead of gradient updates.

Why it matters

GPT-3 moved LLMs from an academic research line into a general-purpose platform story.

What changed after this

Prompting, API products, in-context learning, and the commercial LLM wave all accelerated from here.

Who should read

Everyone, especially product-minded readers who want the moment LLMs became real.

Read paper → Read next: Training language models to follow instructions with human feedback →

From Research to Products

Training language models to follow instructions with human feedback

Long Ouyang et al.

2022Must readIntermediate

Useful assistants are not just scaled models; they are models trained to follow human intent.

Why it matters

InstructGPT operationalized instruction tuning and RLHF, explaining why product assistants behave differently from raw base models.

What changed after this

Alignment, preference training, safety behavior, and chat-style assistant UX became core parts of the LLM stack.

Who should read

Builders, PMs, and researchers who care about why ChatGPT-like systems feel usable.

Read paper → Read next: Training Compute-Optimal Large Language Models →

Scale Becomes the Strategy

Training Compute-Optimal Large Language Models

Jordan Hoffmann et al.

2022Must readIntermediate

Bigger models are not enough; training is only efficient when model size and data scale are balanced.

Why it matters

Chinchilla reset the conversation from “just make the model larger” to “train the right-sized model on enough data.”

What changed after this

Frontier labs and open-model builders had to rethink training economics, data budgets, and what efficient scaling really means.

Who should read

Anyone trying to understand the economics and strategy of LLM training.

Read paper → Read next: LLaMA: Open and Efficient Foundation Language Models →

From Research to Products

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron et al.

2023Must readBeginner

Strong open-weight LLMs changed who gets to build, experiment, and fine-tune.

Why it matters

LLaMA kicked off the modern open-weight LLM wave and made the ecosystem much broader than a few frontier closed labs.

What changed after this

Open-source fine-tuning, local deployment, model merging, community benchmarks, and the builder ecosystem all accelerated.

Who should read

Everyone, especially anyone who cares about the open model ecosystem rather than only API-based LLMs.

Read paper → Read next: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models →

5 builder bonus papers

The core 12 explain where LLMs came from. These five bonus papers explain why modern LLM products work the way they do in practice: prompting, retrieval, adaptation, quantization, and tool use.

Builder Bonus

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei et al.

2022Must readBeginner

Sometimes better reasoning does not require a new model — it requires a better prompt.

Why it matters

This paper showed that step-by-step prompting can unlock much stronger reasoning behavior from large models without changing weights.

What changed after this

Prompt engineering became a serious product and evaluation layer rather than a bag of tricks.

Who should read

Builders and PMs who want to understand why prompt design matters so much in real LLM systems.

Read paper → Read next: Toolformer: Language Models Can Teach Themselves to Use Tools →

Builder Bonus

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis et al.

2020Must readIntermediate

A language model becomes much more useful when it can look things up instead of only remembering them.

Why it matters

This paper laid the conceptual foundation for today’s RAG systems, which power knowledge assistants, document QA, and enterprise copilots.

What changed after this

External retrieval became a standard answer to freshness, factuality, and domain-specific knowledge gaps in LLMs.

Who should read

Anyone building assistants on private data, documents, or changing knowledge bases.

Read paper → Read next: Toolformer: Language Models Can Teach Themselves to Use Tools →

Builder Bonus

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu et al.

2021Must readIntermediate

You do not always need to fine-tune every parameter to adapt a large model well.

Why it matters

LoRA made LLM adaptation dramatically cheaper and more practical, which is one reason the open model ecosystem became so productive so quickly.

What changed after this

Fine-tuning moved from expensive full-model retraining toward lighter adapter-style workflows that normal teams could actually run.

Who should read

Builders who want to customize models without frontier-lab budgets.

Read paper → Read next: QLoRA: Efficient Finetuning of Quantized LLMs →

Builder Bonus

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers et al.

2023Must readIntermediate

High-quality LLM adaptation became possible on much smaller hardware budgets.

Why it matters

QLoRA turned open-weight model customization into something a much broader builder audience could actually try.

What changed after this

Single-GPU and low-cost LLM fine-tuning became mainstream, accelerating open-source experimentation and product prototyping.

Who should read

Builders who care about cost, local experimentation, and practical model adaptation.

Read paper → Read next: Toolformer: Language Models Can Teach Themselves to Use Tools →

Builder Bonus

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick et al.

2023Must readIntermediate

Language models get much more powerful when they stop being closed text machines and start calling external tools.

Why it matters

Toolformer is one of the cleanest bridges from plain LLMs to tool-using systems and, later, agentic workflows.

What changed after this

Tool use, API calling, calculator access, retrieval calls, and agent-style orchestration all became much easier to reason about as first-class system design patterns.

Who should read

Anyone interested in copilots, workflows, and the bridge from LLMs to AI agents.

Read paper → Read next: AI Agents landmark guide →

How to read this page

Different readers need different paths. A beginner does not need the same reading order as a research engineer or a product builder.

Absolute beginner

Read only the shortest set that explains architecture, scale, alignment, and the open-weight era.

Attention Is All You Need Language Models are Few-Shot Learners Training language models to follow instructions with human feedback LLaMA: Open and Efficient Foundation Language Models Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Builder path

Read the core 12, then the bonus papers that explain prompting, retrieval, adaptation, and tool use.

Attention Is All You Need Language Models are Unsupervised Multitask Learners Language Models are Few-Shot Learners Training language models to follow instructions with human feedback LLaMA: Open and Efficient Foundation Language Models Chain-of-Thought Prompting Elicits Reasoning in Large Language Models Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks LoRA: Low-Rank Adaptation of Large Language Models QLoRA: Efficient Finetuning of Quantized LLMs Toolformer: Language Models Can Teach Themselves to Use Tools

Researcher path

Read the papers in order to see how the ideas compound over time.

Distributed Representations of Words and Phrases and their Compositionality Sequence to Sequence Learning with Neural Networks Neural Machine Translation by Jointly Learning to Align and Translate Attention Is All You Need BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Language Models are Unsupervised Multitask Learners Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer Scaling Laws for Neural Language Models Language Models are Few-Shot Learners Training language models to follow instructions with human feedback Training Compute-Optimal Large Language Models LLaMA: Open and Efficient Foundation Language Models

Where to go after LLMs

Once you understand the transformer and LLM story, the next branches become much easier to navigate.

RAG

Read now

How language models became useful on private knowledge and real documents.

AI Agents

Read now

How language models moved from answering prompts to planning, acting, and using tools.

Multimodal AI

Coming next

How the transformer worldview expanded from text into images, audio, and video.

FAQ

Answers for the questions readers are most likely to search before or after landing on this page.

What paper introduced transformers?

The standard answer is Attention Is All You Need (2017), which introduced the Transformer architecture and replaced recurrence with self-attention.

What are the most important LLM papers for beginners?

A strong beginner set is Attention Is All You Need, GPT-3, InstructGPT, and LLaMA. Together they explain architecture, scaling, alignment, and the open-weight era.

Do I need to read BERT if I only care about ChatGPT-style models?

Yes, at least at a high level. BERT represents the other major branch of transformer pretraining and helps you understand why encoder-style and decoder-style models evolved differently.

What should I read after this page?

If you care about products, go to RAG or AI Agents. If you care about research direction, continue into multimodal models, reasoning, and alignment.

Want the daily version of this judgment?

This guide explains the long arc. Our daily feed explains what matters now.

Browse today’s papers → Explore by topic →