ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search
Myungchul Kim, Kwanyong Park, Junmo Kim, In So Kweon
Recommendation Score
Research context
Topics
AI Agents
Paper type
Benchmark
Best for
Useful for both
arXiv categories
Why It Matters
ARGOS frames person search as an interactive agent task with questioning and reasoning—enabling real-world surveillance systems to operate under ambiguity with minimal human input.
Abstract
We introduce ARGOS, the first benchmark and framework that reformulates multi-camera person search as an interactive reasoning problem requiring an agent to plan, question, and eliminate candidates under information asymmetry. An ARGOS agent receives a vague witness statement and must decide what to ask, when to invoke spatial or temporal tools, and how to interpret ambiguous responses, all within a limited turn budget. Reasoning is grounded in a Spatio-Temporal Topology Graph (STTG) encoding camera connectivity and empirically validated transition times. The benchmark comprises 2,691 tasks across 14 real-world scenarios in three progressive tracks: semantic perception (Who), spatial reasoning (Where), and temporal reasoning (When). Experiments with four LLM backbones show the benchmark is far from solved (best TWS: 0.383 on Track 2, 0.590 on Track 3), and ablations confirm that removing domain-specific tools drops accuracy by up to 49.6 percentage points.