Topic
LLM Reasoning
Papers about structured reasoning, proof solving, and long-chain problem solving.
23 papers · latest 2026-04-14
Most active fields for this topic
Liujie Zhang, Benzhe Ning, Rui Yang et al.
Relax is an open asynchronous RL engine for omni-modal post-training that doubles throughput on Qwen3-Omni-scale runs without sacrificing convergence.
Solomon Messing
This work shows how prompt wording, judge choice, and temperature can flip LLM eval results, then gives a budget-aware recipe that materially reduces benchmark noise and gaming surface.
Hadas Orgad, Boyi Wei, Kaden Zheng et al.
This mechanistic safety paper argues harmful generation is concentrated in a compact, reusable weight subspace, offering a concrete explanation for why narrow fine-tuning can trigger broad misalignment.
Chenhao Ye, Huaizheng Zhang, Mingcong Han et al.
TensorHub attacks a painful RL-systems bottleneck by serving model weights from replicas already resident on GPUs, dramatically reducing rollout stalls in elastic and cross-datacenter training.
Peng Ding
LLM-Rosetta introduces a neutral intermediate representation for major LLM APIs, giving builders a credible path away from brittle one-off provider adapters and vendor lock-in.
Runpeng Geng, Chenlong Yin, Yanting Wang et al.
A unified prompt-injection evaluation platform with adaptive attacks that exposes how brittle many current defenses remain across tasks, making it useful core infrastructure for teams shipping tool-using or retrieval-augmented agents.
SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions
Ashima Suvarna, Kendrick Phan, Mehrab Beikzadeh et al.
SUPERNOVA turns natural-instruction datasets into RL-ready supervision for general reasoning, delivering large gains beyond math and code and giving post-training teams a practical recipe for broader reasoning improvement.
Wenbo Hu, Xin Chen, Yan Gao-Tian et al.
OpenVLThinkerV2 introduces a more stable RL objective and task-shaping recipe for open multimodal reasoning, helping a generalist model balance perception with multi-step thinking across 18 visual benchmarks.
Addison J. Wu, Ryan Liu, Shuyue Stella Li et al.
This paper turns chatbot advertising into a concrete alignment problem, probing how model behavior shifts when user benefit and platform revenue diverge.
Sai Srinivas Kancheti, Aditya Kanade, Rohit Sinha et al.
Faithful GRPO adds consistency and grounding constraints to multimodal RL training, sharply reducing unfaithful visual reasoning traces while also improving final spatial reasoning accuracy.
Jianhui Liu, Haoze Sun, Wenbo Li et al.
An open-source data engine and 3M-sample dataset for spatial intelligence that lifts performance across multiple benchmarks, giving multimodal and robotics builders a reusable foundation instead of task-by-task data silos.
Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang et al.
The first benchmark for mid-trajectory agent safety shows tool-calling guardrails often fail for structural reasons like JSON handling, not just refusal behavior, giving agent builders a more realistic red-team harness.
Ryan Lingo, Rajeev Chhajer
A simple API-only recipe for synthetic data generation that combines memory, deduplication, and prompt evolution to stop cross-batch mode collapse and keep large generation jobs diverse.
Nathan Lambert, Florian Brand
Maps the open-model ecosystem across downloads, derivatives, inference share, and performance, useful for choosing which families are winning real adoption rather than just benchmarks.
Tom A. Lamb, Desi R. Ivanova, Philip H. S. Torr et al.
Shows token-level temperature scaling can materially improve semantic calibration and discrimination in QA, giving builders a low-friction way to make LLM confidence scores more trustworthy.
Gustav Keppler, Moritz Gstür, Veit Hagenmeyer
CritBench is the first benchmark evaluating LLM agents on OT protocols like IEC 61850, exposing critical cybersecurity gaps in industrial systems. Essential for deploying LLMs in critical infrastructure safely.
Renxuan Tan, Rongpeng Li, Zhifeng Zhao et al.
Introduces Pareto-lenient consensus to avoid premature convergence in multi-preference LLM alignment—enables robust, nuanced value alignment without sacrificing performance on conflicting human preferences.
Hamed Jelodar, Samita Bai, Tochukwu Emmanuel Nwankwo et al.
LLM4CodeRE adapts LLMs specifically for malware decompilation, significantly improving reverse engineering accuracy on obfuscated code—critical for automated threat analysis in cybersecurity operations.
Xiaojie Gu, Ziying Huang, Weicong Hong et al.
Exposes how LLMs mimic edits without true memory updates, revealing dangerous surface compliance—vital for builders deploying knowledge-editing tools where factual reliability is non-negotiable.
Tianyi Zhao, Yinhan He, Wendy Zheng et al.
MCircKE mechanistically edits LLM knowledge to fix reasoning gaps, ensuring edited facts propagate in multi-step chains for reliable deployments.
LM-Provers, Yuxiao Qu, Amrith Setlur et al.
QED-Nano proves complex math theorems using a tiny, open model—no giant AI needed. This matters because it makes high-level reasoning accessible to anyone, enabling reproducible, affordable AI that can be inspected, improved, and deployed without cloud costs.
Yang Li, Qiang Sheng, Zhengjia Wang et al.
This is the first system that can tell if text was written by a human, edited by an LLM, written by an LLM, or polished by a human—critical for content moderation and legal compliance. You can no longer rely on simple 'AI or human' detectors; this gives you real nuance.
Kanishk Jain, Qian Yang, Shravan Nayak et al.
Finding specific weaknesses in vision-language models usually requires slow, manual testing. This paper uses reinforcement learning to automatically discover scenarios where models fail, such as spatial reasoning errors. This automation allows teams to rapidly identify and fix blind spots that human testers might miss.