ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search

Myungchul Kim, Kwanyong Park, Junmo Kim, In So Kweon

Recommendation Score

breakthrough🔴 AdvancedReasoning & Agents AI AgentsBenchmarkUseful for both

Research context

Primary field

Reasoning & Agents

Reasoning, planning, tool use, and agentic workflows.

Topics

AI Agents

Paper type

Benchmark

Best for

Useful for both

arXiv categories

cs.CVcs.AIcs.MAcs.CV

Why It Matters

ARGOS frames person search as an interactive agent task with questioning and reasoning—enabling real-world surveillance systems to operate under ambiguity with minimal human input.

Abstract

We introduce ARGOS, the first benchmark and framework that reformulates multi-camera person search as an interactive reasoning problem requiring an agent to plan, question, and eliminate candidates under information asymmetry. An ARGOS agent receives a vague witness statement and must decide what to ask, when to invoke spatial or temporal tools, and how to interpret ambiguous responses, all within a limited turn budget. Reasoning is grounded in a Spatio-Temporal Topology Graph (STTG) encoding camera connectivity and empirically validated transition times. The benchmark comprises 2,691 tasks across 14 real-world scenarios in three progressive tracks: semantic perception (Who), spatial reasoning (Where), and temporal reasoning (When). Experiments with four LLM backbones show the benchmark is far from solved (best TWS: 0.383 on Track 2, 0.590 on Track 3), and ablations confirm that removing domain-specific tools drops accuracy by up to 49.6 percentage points.

More in Reasoning & Agents → More on AI Agents →

View on arXiv → Download PDF →

Published April 14, 2026