Landmark Guide

12 Papers That Shaped Modern Computer Use Agents

A beginner-friendly map of the papers behind web agents, GUI grounding, smartphone control, and full computer-use benchmarks.

If you are trying to understand why people suddenly believe models can operate browsers and computers directly, this is the shortest serious reading path we know. We are not trying to cover the full history of HCI or robotics here. This page focuses on the modern browser, GUI, and OS-agent line: systems that see interfaces, choose actions, and execute multi-step tasks in realistic computer environments.

Timeline

2021–2024

Core papers

Builder bonus

Start with the 3 must-read papers → Jump to full timeline →

If you only read 3 papers, start here

Most readers do not need every dataset and benchmark on day one. If you want the shortest path to modern computer-use agents, start with the first browser bridge, the first realistic web benchmark, and the first full computer benchmark.

WebGPT: Browser-assisted question-answering with human feedback

The early bridge from text-only language models to browser-using systems.

WebArena: A Realistic Web Environment for Building Autonomous Agents

The paper that made web agents feel benchmarkable instead of demo-driven.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

The clearest marker that the field moved from browser tasks to full computer use.

The shortest useful history of computer-use agents

Modern computer-use agents did not emerge from one model release. They emerged when web control became benchmarkable, multimodal models got better at grounding interface elements, and researchers moved from website demos to full operating-system environments.

From Browser Assistants to Agents

Language models first became computer-use candidates when they started interacting with browsers instead of only answering from memory.

The Web Becomes a Benchmark

The field matured once web interaction became measurable with realistic tasks, datasets, and knowledge-work settings.

Grounding Moves to Pixels

Multimodal models made it possible to reason over screenshots and GUI elements instead of depending only on clean HTML or backend access.

From Websites to Full Computers

The field now cares about desktop, mobile, and OS-level tasks where action spaces are larger and success is much harder to fake.

The 12 core papers

These are the papers we would keep if we had to explain modern browser, GUI, and OS agents to a smart newcomer in one page. They are the papers that changed what people built next.

From Browser Assistants to Agents

MiniWoB++: A Benchmark Suite for Multi-Step Web Interaction

Victor Zhong et al.

2022Should knowBeginner

Web agents became easier to study once researchers had a reusable benchmark for multi-step browser interaction.

Why it matters

MiniWoB++ is useful prehistory because it shows the web-control problem existed before today’s multimodal model wave.

What changed after this

Web interaction stopped being only a robotics or scripted automation problem and became a benchmarkable agent task.

Who should read

Readers who want the minimum benchmark prehistory before the LLM-era story starts.

From Browser Assistants to Agents

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano et al.

2021Must readBeginner

A language model becomes more useful when it can browse, inspect sources, and act in a browser instead of answering closed-book.

Why it matters

WebGPT is one of the clearest early demonstrations that browser interaction could be part of the model product itself.

What changed after this

Web browsing, source-backed answers, and browser control started to feel like plausible agent capabilities rather than one-off demos.

Who should read

Anyone who wants the bridge from LLM assistants to browser-using agents.

Read paper → Read next: ReAct: Synergizing Reasoning and Acting in Language Models →

From Browser Assistants to Agents

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao et al.

2022Must readBeginner

Modern computer-use agents are easier to understand when you see them as loops of thought, action, and observation.

Why it matters

ReAct gave the field a simple loop that many web and computer-use agents still inherit, directly or indirectly.

What changed after this

Agent design moved from single prompts toward iterative control loops that can recover from bad steps.

Who should read

Everyone. This is still the cleanest mental model for how many agents operate.

Read paper → Read next: Mind2Web: Towards a Generalist Agent for the Web →

The Web Becomes a Benchmark

Mind2Web: Towards a Generalist Agent for the Web

Xiang Deng et al.

2023Must readIntermediate

Web agents need broad, realistic demonstrations if they are supposed to generalize beyond toy websites.

Why it matters

Mind2Web made the web-agent problem feel more like a serious generalization challenge and less like a hand-built demo track.

What changed after this

Generalist web-agent datasets, offline learning setups, and more serious comparisons across websites became possible.

Who should read

Builders who want to understand where web-agent data and evaluation got more realistic.

Read paper → Read next: WebArena: A Realistic Web Environment for Building Autonomous Agents →

The Web Becomes a Benchmark

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou et al.

2024Must readBeginner

Web agents only became believable once they were tested on realistic sites with realistic tasks.

Why it matters

WebArena is one of the papers that made web agents feel measurable and comparable instead of anecdotal.

What changed after this

Live-ish environments, reproducible tasks, and real website structures became the standard way to discuss web-agent capability.

Who should read

Anyone who wants a serious benchmark for what web agents can and cannot do.

Read paper → Read next: WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? →

The Web Becomes a Benchmark

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin et al.

2024Must readIntermediate

The web-agent story matters more once the tasks look like real knowledge work instead of synthetic click puzzles.

Why it matters

WorkArena tied web agents to enterprise-style workflows, which makes the field much more commercially legible.

What changed after this

The conversation shifted toward practical workplace tasks, longer workflows, and business-relevant web automation.

Who should read

PMs and builders who care about whether web agents can do valuable work, not just win benchmarks.

Read paper → Read next: GPT-4V(ision) is a Generalist Web Agent, if Grounded →

Grounding Moves to Pixels

GPT-4V(ision) is a Generalist Web Agent, if Grounded

Boyuan Zheng et al.

2024Must readIntermediate

The bottleneck for computer-use agents is often not raw model intelligence but reliable grounding on interface elements.

Why it matters

This paper captures a central truth of the field: frontier multimodal models become much stronger computer users once grounding is handled carefully.

What changed after this

Grounding, coordinate selection, and screenshot-based action planning became much more central to agent design.

Who should read

Anyone trying to understand why GUI agents live or die on perception and grounding quality.

Read paper → Read next: WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models →

Grounding Moves to Pixels

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

Hongliang He et al.

2024Must readIntermediate

Large multimodal models can drive an end-to-end web agent when perception, planning, and evaluation are wired together carefully.

Why it matters

WebVoyager made the multimodal web-agent story more concrete and helped show what end-to-end web control could look like.

What changed after this

Live web-agent demos, multimodal grounding stacks, and automated evaluation protocols became more common.

Who should read

Builders who care about real web execution rather than HTML-only or offline settings.

Read paper → Read next: AppAgent: Multimodal Agents as Smartphone Users →

From Websites to Full Computers

AppAgent: Multimodal Agents as Smartphone Users

Chi Zhang et al.

2023Must readIntermediate

Once agents can act like smartphone users, computer use stops being only a web-browser story.

Why it matters

AppAgent widened the field from websites toward app ecosystems and showed that UI control matters beyond the browser tab.

What changed after this

Mobile control, app exploration, and knowledge-base-driven interface learning became clearer agent directions.

Who should read

Anyone who wants to see computer-use agents move beyond websites into general application control.

Read paper → Read next: Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs →

From Websites to Full Computers

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Keen You et al.

2024Must readIntermediate

Before an agent can use an interface well, it has to understand UI elements as grounded objects rather than vague pixels.

Why it matters

Ferret-UI is one of the strongest papers on the perception layer that modern GUI agents depend on.

What changed after this

UI grounding datasets, element understanding, and multimodal interface parsing became much more important building blocks.

Who should read

Builders who care about the perception side of GUI automation, not only planning loops.

Read paper → Read next: OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web →

From Websites to Full Computers

OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web

Raghav Kapoor et al.

2024Should knowIntermediate

Computer-use agents need benchmarks that span both desktop and web, not one interface surface at a time.

Why it matters

OmniACT is a strong marker of the field broadening from web agents toward more general computer tasks.

What changed after this

Desktop-and-web benchmarks, executable-program supervision, and multimodal generalist evaluation became easier to discuss as one problem.

Who should read

Researchers who want to understand the move toward more general computer-use evaluation.

Read paper → Read next: OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments →

From Websites to Full Computers

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Tianbao Xie et al.

2024Must readBeginner

The field became a true computer-use story once agents were evaluated on open-ended tasks inside real operating-system environments.

Why it matters

OSWorld is one of the clearest benchmark papers showing that browser control is only one slice of the broader computer-use problem.

What changed after this

Desktop tasks, larger action spaces, and real OS environments became the next serious frontier for agent evaluation.

Who should read

Everyone who wants to understand why “computer use” is now a distinct category beyond web agents.

Read paper → Read next: WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks →

4 builder bonus papers

The core 12 explain how the field formed. These four bonus papers explain harder enterprise tasks, GUI action models, adjacent terminal workflows, and fast ways to improve agents on real computer benchmarks.

Builder Bonus

WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks

Léo Boisvert et al.

2024Must readIntermediate

Enterprise-style web agents get much harder once tasks demand composition, planning, and reasoning instead of short scripted flows.

Why it matters

WorkArena++ is useful because it explains why current web agents still struggle on the kinds of tasks buyers actually care about.

What changed after this

Compositional enterprise benchmarks, longer task chains, and planning-heavy evaluation became more prominent.

Who should read

Builders who want a more realistic picture of where web agents still break in knowledge-work settings.

Builder Bonus

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Kevin Qinghong Lin et al.

2024Should knowIntermediate

Computer-use agents increasingly want a tighter fusion of seeing the screen and deciding the next action.

Why it matters

ShowUI is a strong signal that the field is moving toward more integrated vision-language-action models for GUI control.

What changed after this

Researchers started thinking more about unified action models instead of stitching together separate perception and control stages.

Who should read

Anyone building next-generation GUI agents rather than only prompt-orchestrated wrappers.

Read paper → Read next: Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs →

Builder Bonus

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang et al.

2024Must readBeginner

Not all computer-use agents act on GUIs; some of the most valuable ones act in high-value tool environments like software engineering.

Why it matters

SWE-agent belongs here as an adjacent lesson: once the interface and feedback loop are good enough, computer-use agents become commercially legible very fast.

What changed after this

People took agent-computer interfaces much more seriously as a product surface, not just a research curiosity.

Who should read

Builders who care about where computer-use agents start turning into businesses.

Read paper → Read next: AI Agents landmark guide →

Builder Bonus

GUIDE: Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play Annotation

Jiho Park et al.

2024ReferenceAdvanced

One practical way to improve GUI agents is to augment them with outside demonstrations instead of only changing the base model.

Why it matters

GUIDE is a useful builder paper because it shows a plug-and-play path to better agent behavior on hard desktop tasks without retraining everything.

What changed after this

Retrieval-augmented agent improvement, demonstration reuse, and environment-specific augmentation became more attractive ideas for computer-use systems.

Who should read

Researchers and builders who want to improve GUI agents without assuming endless fine-tuning budget.

How to read this page

A beginner, web-agent builder, and OS-agent researcher should not read the same subset in the same order. Pick the path that matches the interface surface you care about.

Absolute beginner

Read only the shortest set that explains browser use, realistic web benchmarks, multimodal grounding, and full computer environments.

WebGPT: Browser-assisted question-answering with human feedback ReAct: Synergizing Reasoning and Acting in Language Models WebArena: A Realistic Web Environment for Building Autonomous Agents WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Builder path

Read the papers that explain practical web tasks, grounding, multimodal control, and where current agents still break in knowledge-work settings.

Mind2Web: Towards a Generalist Agent for the Web WebArena: A Realistic Web Environment for Building Autonomous Agents WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?GPT-4V(ision) is a Generalist Web Agent, if Grounded WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Researcher path

Read the papers in order to see how the field moved from benchmarked web interaction to realistic computer-use environments.

MiniWoB++: A Benchmark Suite for Multi-Step Web Interaction WebGPT: Browser-assisted question-answering with human feedback ReAct: Synergizing Reasoning and Acting in Language Models Mind2Web: Towards a Generalist Agent for the Web WebArena: A Realistic Web Environment for Building Autonomous Agents WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?GPT-4V(ision) is a Generalist Web Agent, if Grounded WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models AppAgent: Multimodal Agents as Smartphone Users Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Where to go after computer-use agents

Once you understand computer-use agents, the next question is usually what powers them underneath: stronger agent loops, better retrieval and memory, or richer multimodal model foundations.

AI Agents

Read now

The broader planning, tool use, memory, and multi-agent ideas that sit underneath computer-use systems.

RAG

Read now

How retrieval becomes memory, grounding, and task context inside more capable agents.

AI Coding Agents

Read now

How agent systems turn from interface users into economically valuable software workers.

FAQ

Answers for the questions readers are most likely to search before or after landing on this page.

What is the most important paper for understanding computer-use agents?

If you only read one paper, start with WebArena or OSWorld depending on your interest. WebArena explains realistic web agents; OSWorld explains the leap from web agents to full computer-use environments.

Are browser agents and computer-use agents the same thing?

Not quite. Browser agents operate inside websites, while computer-use agents span broader surfaces like desktop apps, system dialogs, mobile apps, and operating-system workflows.

Why are there so many benchmark papers on this page?

Because this field matured through benchmarks. Benchmarks are what turned flashy demos into something researchers and builders could compare, measure, and productize.

What should I read after this page?

If you want the wider systems story, go to AI Agents. If you care about memory and grounding, go to RAG. If you care about model capability, the future branch is multimodal AI.

Want the daily version of this judgment?

This guide explains the long arc. Our daily feed explains what matters now.

Browse today’s papers → Browse all guides →