← Back to archive day

Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text

Chengyu Huang, Sheng-Yen Chou, Zhengxin Zhang, Claire Cardie

35

Recommendation Score

breakthrough🔴 AdvancedNLPLLM ReasoningSystemBest for builders

Research context

Primary field

NLP

Language understanding, generation, extraction, and evaluation.

Topics

LLM Reasoning

Paper type

System

Best for

Best for builders

arXiv categories

cs.CLcs.LGcs.CL

Why It Matters

Introduces rubric-based self-play on pre-training text to bootstrap LLM reasoning without external reward models—enabling cost-efficient, scalable improvement of open-ended task performance with minimal supervision.

Abstract

Self-play has recently emerged as a promising paradigm to train Large Language Models (LLMs). In self-play, the target LLM creates the task input (e.g., ask a question), which it then addresses itself by producing a task output (e.g., give an answer). A reward model evaluates the output, and the rewards are then used to train the LLM, typically via Reinforcement Learning (RL). Self-play incurs minimal supervision costs, and this is especially helpful for post-training LLMs, which require high-quality input-output pairs that traditionally have to be written by humans or expensive proprietary models. However, existing work explores self-play only for verifiable tasks such as math and coding. Instead, we seek to extend it to more realistic open-ended tasks. In particular, we propose POP, a self-play framework that uses the same LLM to synthesize evaluation rubrics, along with input-output pairs, for each example. The rubric is then used to evaluate outputs and train the model. We further ground the framework on a content-rich pretraining corpus to (1) ensure a generation-verification gap and reduce reward hacking, and (2) prevent mode collapse. On Qwen-2.5-7B, POP increases performance of both pretrained and instruction-tuned models, across different tasks ranging from long-form Healthcare QA to creative writing and instruction following.

Published April 21, 2026
© 2026 A2A.pub — AI to Action. From papers to practice, daily.
Summaries are AI-assistedPrivacyTerms