Before Transformers
Distributed Representations of Words and Phrases and their Compositionality
Tomas Mikolov et al.
Dense word vectors turned language into a geometry that models could actually learn from.
Why it matters
This paper made pretrained word representations practical at scale and pushed the field toward reusable language representations instead of task-specific features.
What changed after this
Embeddings became the default starting point for NLP, and the field got comfortable with the idea that language structure could be learned from raw text.
Who should read
Beginners who want the prehistory of modern language models.