← Back to archive day

QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals

Jeremy Qin, Maksym Andriushchenko

36

Recommendation Score

breakthrough🟡 IntermediateNLPLLM ReasoningBenchmarkUseful for both

Research context

Primary field

NLP

Language understanding, generation, extraction, and evaluation.

Topics

LLM Reasoning

Paper type

Benchmark

Best for

Useful for both

arXiv categories

cs.LGcs.AIcs.LG

Why It Matters

Introduces the first benchmark for evaluating LLMs on continuous numerical forecasting with prediction intervals, exposing critical gaps in real-world reasoning—essential for deploying LLMs in finance, healthcare, and policy decision systems.

Abstract

Forecasting has become a natural benchmark for reasoning under uncertainty. Yet existing evaluations of large language models remain limited to judgmental tasks in simple formats, such as binary or multiple-choice questions. In practice, however, forecasting spans a far broader scope. Across domains such as economics, public health, and social demographics, decisions hinge on numerical estimates over continuous quantities, a capability that current benchmarks do not capture. Evaluating such estimates requires a format that makes uncertainty explicit and testable. We propose prediction intervals as a natural and rigorous interface for this purpose. They demand scale awareness, internal consistency across confidence levels, and calibration over a continuum of outcomes, making them a more suitable evaluation format than point estimates for numerical forecasting. To assess this capability, we introduce a new benchmark QuantSightBench, and evaluate frontier models under multiple settings, assessing both empirical coverage and interval sharpness. Our results show that none of the 11 evaluated frontier and open-weight models achieves the 90\% coverage target, with the top performers Gemini 3.1 Pro (79.1\%), Grok 4 (76.4\%), and GPT-5.4 (75.3\%) all falling at least 10 percentage points short. Calibration degrades sharply at extreme magnitudes, revealing systematic overconfidence across all evaluated models.

Published April 17, 2026
© 2026 A2A.pub — AI to Action. From papers to practice, daily.
Summaries are AI-assistedPrivacyTerms