OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models

Xiaomeng Hu, Yinger Zhang, Fei Huang, Jianhong Tu, Yang Su, Lianghao Deng, Yuxuan Liu, Yantao Liu, Dayiheng Liu, Tsung-Yi Ho

Recommendation Score

breakthrough🟡 IntermediateReasoning & Agents AI Agents World ModelsBenchmarkUseful for both

Research context

Primary field

Reasoning & Agents

Reasoning, planning, tool use, and agentic workflows.

Topics

AI Agents, World Models

Paper type

Benchmark

Best for

Useful for both

arXiv categories

cs.CLcs.CL

Why It Matters

OccuBench is a 100-scenario benchmark for professional agents across 65 domains that also injects hidden environment faults, exposing how brittle frontier models still are in real work settings.

Abstract

AI agents are expected to perform professional work across hundreds of occupational domains (from emergency department triage to nuclear reactor safety monitoring to customs import processing), yet existing benchmarks can only evaluate agents in the few domains where public environments exist. We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language World Models (LWMs) that simulate domain-specific environments through LLM-driven tool response generation. Our multi-agent synthesis pipeline automatically produces evaluation instances with guaranteed solvability, calibrated difficulty, and document-grounded diversity. OccuBench evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled fault injection (explicit errors, implicit data degradation, and mixed faults). We evaluate 15 frontier models across 8 model families and find that: (1) no single model dominates all industries, as each has a distinct occupational capability profile; (2) implicit faults (truncated data, missing fields) are harder than both explicit errors (timeouts, 500s) and mixed faults, because they lack overt error signals and require the agent to independently detect data degradation; (3) larger models, newer generations, and higher reasoning effort consistently improve performance. GPT-5.2 improves by 27.5 points from minimal to maximum reasoning effort; and (4) strong agents are not necessarily strong environment simulators. Simulator quality is critical for LWM-based evaluation reliability. OccuBench provides the first systematic cross-industry evaluation of AI agents on professional occupational tasks.

More in Reasoning & Agents → More on AI Agents →

View on arXiv → Download PDF →

Published April 13, 2026