← Back to topics

Topic

LLM Reasoning

Papers about structured reasoning, proof solving, and long-chain problem solving.

63 papers · latest 2026-04-23

Most active fields for this topic

Nattavudh Powdthavee

breakthrough🔴 AdvancedNLPLLM Reasoning
cs.AIcs.HCcs.AI

LLMs detect fraud better than humans and resist investor bias, challenging assumptions about AI limitations. This means AI advisors could be more reliable in high-stakes financial decisions.

Mikko Lempinen, Joni Kemppainen, Niklas Raesalmi

significant🟡 IntermediateNLPLLM Reasoning
cs.CRcs.AIcs.CL

Provides a modular framework for identifying and evaluating AI security vulnerabilities, helping developers build more robust and safer AI systems in critical applications.

Jingyi Zheng, Tianyi Hu, Yule Liu et al.

breakthrough🟡 IntermediateNLPLLM Reasoning
cs.LGcs.AIcs.CL

Creates the first benchmark dataset for detecting covert advertisements on social media, addressing a critical gap in content moderation and enabling better evaluation of multimodal AI systems.

Ryo Tamura, Haruhiko Morito, Yuna Oikawa et al.

breakthrough🟡 IntermediateNLPLLM Reasoning
cs.AI

Demonstrates LLMs can guide high-throughput experiments for phase diagram construction, significantly accelerating materials discovery workflows.

Yilun Liu, Chunguang Zhao, Mengyao Piao et al.

significant🔴 AdvancedNLPLLM Reasoning
cs.CLcs.CL

Comprehensive benchmark evaluating LLM multilingual and multicultural capabilities with deep cultural analysis, essential for developing globally competent AI systems.

Chenxi Zhou, Pengfei Cao, Jiang Li et al.

breakthrough🔴 AdvancedNLPLLM ReasoningModel Compression
cs.CLcs.AIcs.LG

Uncovers two distinct failure modes in 2-bit LLM quantization—enabling builders to diagnose and mitigate performance cliffs, crucial for efficient deployment of compressed models.

Yusuf Kesmen, Fay Elhassan, Jiayi Ma et al.

breakthrough🟡 IntermediateNLPLLM Reasoning
cs.LGcs.AIcs.CL

Separates LLM dialogue from probabilistic reasoning via BMBE, enabling reliable medical diagnostics by decoupling language fluency from clinical inference—essential for safe AI-assisted healthcare systems.

Yadong Li, Guoxin Wu, Haiping Hou et al.

breakthrough🔴 AdvancedNLPLLM Reasoning
cs.AIcs.SDcs.AI

UAF unifies full-duplex speech processing into a single audio LLM, eliminating pipeline latency and error propagation—transformative for building truly natural, real-time conversational AI with minimal latency and high fidelity.

Ruibing Hou, Mingyue Zhou, Yuwei Gui et al.

cs.CVcs.CV

EgoMotion introduces the first diffusion-based framework for egocentric vision-language motion generation, enabling realistic 3D human motion synthesis from first-person views—critical for immersive VR, robotics, and human-robot interaction systems.

Chengyu Huang, Sheng-Yen Chou, Zhengxin Zhang et al.

breakthrough🔴 AdvancedNLPLLM Reasoning
cs.CLcs.LGcs.CL

Introduces rubric-based self-play on pre-training text to bootstrap LLM reasoning without external reward models—enabling cost-efficient, scalable improvement of open-ended task performance with minimal supervision.

Yiwen Qiu, Linjuan Wu, Yizhou Liu et al.

significant🔴 AdvancedNLPLLM Reasoning
cs.CLcs.CL

Introduces inferential boundary awareness to prevent LLMs from fabricating answers under incomplete inputs—critical for builders deploying reliable reasoning systems in real-world applications where hallucinations risk safety and trust.

Ziyang Liu

breakthrough🔴 AdvancedNLPLLM Reasoning
cs.CLcs.AIcs.CL

Copy-as-Decode revolutionizes LLM editing by replacing full regeneration with grammar-constrained copy-gen operations, slashing latency and improving precision—critical for real-time code/text editing systems.

Chenming Tang, Hsiu-Yuan Huang, Weijie Liu et al.

breakthrough🟡 IntermediateNLPLLM Reasoning
cs.LGcs.CLcs.LG

TRUSTEE trains tool-calling agents without labeled data or commercial models, using dynamic environment synthesis with only an 8B LLM—democratizing powerful agent training for any builder with minimal resources.

Dongxin Guo, Jikun Wu, Siu Ming Yiu

breakthrough🟡 IntermediateNLPLLM ReasoningAlignment & Safety
cs.LGcs.AIcs.LG

SafeAnchor reveals LLM safety is fragile and erodes cumulatively during domain adaptation. Practitioners must now actively preserve safety across updates—this is the first method to do so systematically in continual settings.

Eranga Bandara, Asanga Gunaratna, Ross Gore et al.

breakthrough🔴 AdvancedNLPLLM Reasoning
cs.AIcs.AI

First on-device LLM deployment for psychiatric decision support that eliminates cloud egress, preserving patient privacy in high-risk settings. Enables real-time, compliant mental health AI without data leakage risks.

Mina Gabriel, Pei Wang

significant🔴 AdvancedReasoning & AgentsLLM Reasoning
cs.AIcs.AI

Presents a neuro-symbolic pipeline translating natural language into Narsese, enabling interpretable, uncertainty-aware reasoning—vital for building trustworthy AI systems requiring explicit logic over LLM hallucinations.

Zhenwen Liang, Yujun Zhou, Sidi Lu et al.

breakthrough🔴 AdvancedReasoning & AgentsLLM Reasoning
cs.LGcs.LG

CUTS solves RL mode collapse in saturated reasoning by sampling from constrained top-K outputs, enabling continued learning even when models are already correct—vital for improving LLM reasoning robustness without manual data curation.

Sankalp Gilda, Shlok Gilda

breakthrough🔴 AdvancedReasoning & AgentsLLM Reasoning
cs.AIcs.LGcs.LO

Embeds Peircean reasoning as algebraic invariants in LLMs, enforcing logical structure—vital for builders of reliable reasoning agents where correctness, not just fluency, is non-negotiable.

Sai Srinivas Kancheti, Aditya Sanjiv Kanade, Vineeth N. Balasubramanian et al.

breakthrough🟡 IntermediateReasoning & AgentsLLM Reasoning
cs.CVcs.AIcs.CV

Reveals CoT prompting harms visual spatial reasoning in multimodal LLMs—forcing a rethink of reasoning paradigms in robotics, AR/VR, and vision-language systems where spatial accuracy is non-negotiable.

Jeremy Qin, Maksym Andriushchenko

breakthrough🟡 IntermediateNLPLLM Reasoning
cs.LGcs.AIcs.LG

Introduces the first benchmark for evaluating LLMs on continuous numerical forecasting with prediction intervals, exposing critical gaps in real-world reasoning—essential for deploying LLMs in finance, healthcare, and policy decision systems.

Yueyang Feng, Dipesh Kafle, Vladimir Gladshtein et al.

breakthrough🔴 AdvancedReasoning & AgentsLLM Reasoning
cs.SEcs.AIcs.PL

This work introduces a multi-modal verifier that dynamically adjusts LLM-generated specs to be both implementable and formally sound—enabling trustworthy, automated code generation for safety-critical systems.

Xidong Wu, Yukuan Zhang, Yuqiong Ji et al.

breakthrough🔴 AdvancedNLPLLM Reasoning
cs.CRcs.AIcs.CR

Introduces privacy-preserving LLM routing using MPC, preventing data exposure during model selection—essential for enterprises deploying multi-provider LLM APIs under strict compliance regimes.

Yao Chen, Jiawei Sheng, Wenyuan Zhang et al.

breakthrough🔴 AdvancedNLPLLM ReasoningModel Compression
cs.CLcs.CL

Proposes stepwise attention distillation to transfer dynamic reasoning focus from large to small models, significantly improving small-model reasoning without requiring larger architectures—key for efficient deployment in resource-constrained systems.

Yang Wu, Jinhong Yu, Jingwei Xiong et al.

breakthrough🟡 IntermediateNLPLLM Reasoning
cs.CLcs.AIcs.HC

CoLabScience introduces proactive LLM collaboration in science, autonomously suggesting insights—transforming how researchers interact with AI, moving beyond reactive queries to true co-discovery.

Walaa Amer, Uday das, Fadi Kurdahi

breakthrough🔴 AdvancedNLPLLM Reasoning
cs.LGcs.CLcs.LG

ConfLayers dynamically skips LLM layers based on confidence, accelerating speculative decoding without quality loss. This directly reduces inference cost for production LLM systems, making real-time reasoning more scalable and efficient.

Yang Li, Zirui Zhang, Yang Liu et al.

breakthrough🔴 AdvancedNLPFine-tuning & PEFTLLM Reasoning
cs.AIcs.AI

LACE enables LLM reasoning paths to share insights via cross-thread attention, dramatically reducing redundant failures and improving solution quality—essential for building robust, scalable reasoning systems.

Zixuan Weng, Jinghuai Zhang, Kunlin Cai et al.

breakthrough🔴 AdvancedNLPLLM ReasoningEfficient Inference
cs.LGcs.AIcs.CL

FineSteer enables precise, adaptive steering of LLM behavior at inference time without retraining, offering a unified, utility-preserving method to fix hallucinations and safety issues—critical for deploying reliable AI in production.

Manan Gupta, Inderjeet Nair, Lu Wang et al.

breakthrough🔴 AdvancedNLPLLM Reasoning
cs.AIcs.CLcs.LG

Exposes how LLM judges are manipulated by stakes signaling, undermining automated evaluation reliability—essential for anyone building or trusting LLM benchmarks, as evaluation integrity is now proven fragile.

Zihan Liang, Yufei Ma, Ben Chen et al.

significant🔴 AdvancedReasoning & AgentsLLM Reasoning
cs.AIcs.CLcs.IR

IG-Search introduces step-level information gain rewards to precisely guide LLM search queries in reasoning tasks, avoiding gradient collapse—critical for building reliable search-augmented agents that avoid redundant or vague queries.

Jack Wei Lun Shi, Minghao Dang, Wawan Solihin et al.

breakthrough🔴 AdvancedNLPLLM ReasoningAlignment & Safety
cs.CLcs.AIcs.LG

First perturbation-based attribution analysis of LLMs in code compliance, revealing how fine-tuning strategies alter interpretability—essential for building trustworthy, auditable code-review AI systems.

Pushpa Kumar Balan, Aijing Feng

breakthrough🔴 AdvancedReasoning & AgentsLLM Reasoning
cs.AI

Mamba-SSM + LLM CoT filters confounding genes via causal reasoning, boosting biomarker specificity—enabling reliable, interpretable genomic discovery without manual curation, directly impacting precision medicine pipelines.

Zerun Ma, Guoqiang Wang, Xinchen Xie et al.

breakthrough🔴 AdvancedNLPLLM ReasoningAI Agents
cs.AIcs.CLcs.AI

TREX automates end-to-end LLM fine-tuning using multi-agent collaboration, eliminating manual hyperparameter tuning and workflow design—critical for teams scaling LLM deployment without expert ML engineers.

Haoran Lou, Ziyan Liu, Chunxiao Fan et al.

breakthrough🔴 AdvancedNLPRAGLLM Reasoning
cs.CVcs.CV

SLQ enables retrieval with frozen MLLMs via shared latent queries—preserving pre-trained knowledge while avoiding costly fine-tuning, a game-changer for scalable, stable multimodal retrieval systems.

Aviral Dawar, Roshan Karanth, Vikram Goyal et al.

breakthrough🟡 IntermediateNLPLLM Reasoning
cs.CLcs.AIcs.DB

First multilingual Text-to-SQL benchmark for Indian languages with real-world schemas, exposing critical LLM biases and enabling equitable NLP deployment in underrepresented linguistic contexts.

Songping Peng, Zhiheng Zhang, Daojian Zeng et al.

breakthrough🔴 AdvancedNLPLLM ReasoningAlignment & Safety
cs.AIcs.AI

Coupled weight-activation constraints prevent safety drift during LLM fine-tuning, offering a theoretically grounded defense—essential for deploying reliable, safe LLMs in production without unintended harmful behavior emergence.

Lei Lin, Jizhao Zhu, Yong Liu et al.

breakthrough🔴 AdvancedReasoning & AgentsLLM Reasoning
cs.AIcs.AI

HCoT injects expert system heuristics into LLM reasoning, replacing stochastic sampling with structured, deterministic planning—transforming LLMs into reliable agents for high-stakes decision systems.

Daniil Gurgurov, Tom Röhr, Sebastian von Rohrscheidt et al.

breakthrough🔴 AdvancedNLPLLM Reasoning
cs.CLcs.CL

ReasonXL enables non-English LLMs to reason natively in their target language without performance loss—essential for global deployment of reasoning agents.

Xu Zhang, Xudong Gong, Jiacheng Qin et al.

breakthrough🔴 AdvancedNLPLLM Reasoning
cs.AIcs.AI

Replaces single LLM scores with a 35-dimension diagnostic taxonomy for fine-grained ability analysis—essential for researchers and engineers needing to diagnose and select models based on specific cognitive strengths.

Erfan Baghaei Potraghloo, Seyedarmin Azizi, Souvik Kundu et al.

breakthrough🟡 IntermediateNLPLLM Reasoning
cs.CLcs.AIcs.CL

A single banned token can collapse LLM helpfulness—revealing dangerous fragility in instruction-tuned models. Practitioners must harden prompts and test for lexical vulnerabilities before deployment.

Joongmin Shin, Chanjun Park, Jeongbae Park et al.

breakthrough🟡 IntermediateNLPRAGMultimodal Understanding
cs.AIcs.CLcs.AI

MultiDocFusion integrates vision and text to preserve structural context in long industrial documents, dramatically improving RAG accuracy—essential for enterprises relying on precise QA from complex PDFs, manuals, and reports.

Liujie Zhang, Benzhe Ning, Rui Yang et al.

significant🔴 AdvancedNLPLLM Reasoning
cs.CLcs.CL

Relax is an open asynchronous RL engine for omni-modal post-training that doubles throughput on Qwen3-Omni-scale runs without sacrificing convergence.

Solomon Messing

significant🟡 IntermediateNLPLLM Reasoning
cs.CLcs.CL

This work shows how prompt wording, judge choice, and temperature can flip LLM eval results, then gives a budget-aware recipe that materially reduces benchmark noise and gaming surface.

Hadas Orgad, Boyi Wei, Kaden Zheng et al.

breakthrough🔴 AdvancedNLPLLM Reasoning
cs.CLcs.AIcs.LG

This mechanistic safety paper argues harmful generation is concentrated in a compact, reusable weight subspace, offering a concrete explanation for why narrow fine-tuning can trigger broad misalignment.

Chenhao Ye, Huaizheng Zhang, Mingcong Han et al.

significant🔴 AdvancedNLPLLM Reasoning
cs.DCcs.AIcs.DC

TensorHub attacks a painful RL-systems bottleneck by serving model weights from replicas already resident on GPUs, dramatically reducing rollout stalls in elastic and cross-datacenter training.

Peng Ding

significant🟡 IntermediateNLPLLM Reasoning
cs.SEcs.AIcs.SE

LLM-Rosetta introduces a neutral intermediate representation for major LLM APIs, giving builders a credible path away from brittle one-off provider adapters and vendor lock-in.

Runpeng Geng, Chenlong Yin, Yanting Wang et al.

breakthrough🟡 IntermediateNLPLLM Reasoning
cs.CRcs.AIcs.CL

A unified prompt-injection evaluation platform with adaptive attacks that exposes how brittle many current defenses remain across tasks, making it useful core infrastructure for teams shipping tool-using or retrieval-augmented agents.

Ashima Suvarna, Kendrick Phan, Mehrab Beikzadeh et al.

breakthrough🔴 AdvancedReasoning & AgentsLLM Reasoning
cs.AIcs.LGcs.AI

SUPERNOVA turns natural-instruction datasets into RL-ready supervision for general reasoning, delivering large gains beyond math and code and giving post-training teams a practical recipe for broader reasoning improvement.

Wenbo Hu, Xin Chen, Yan Gao-Tian et al.

significant🔴 AdvancedReasoning & AgentsLLM Reasoning
cs.CVcs.AIcs.CL

OpenVLThinkerV2 introduces a more stable RL objective and task-shaping recipe for open multimodal reasoning, helping a generalist model balance perception with multi-step thinking across 18 visual benchmarks.

Addison J. Wu, Ryan Liu, Shuyue Stella Li et al.

significant🟡 IntermediateNLPLLM Reasoning
cs.AIcs.CLcs.CY

This paper turns chatbot advertising into a concrete alignment problem, probing how model behavior shifts when user benefit and platform revenue diverge.

Sai Srinivas Kancheti, Aditya Kanade, Rohit Sinha et al.

significant🔴 AdvancedReasoning & AgentsLLM Reasoning
cs.CVcs.AIcs.CV

Faithful GRPO adds consistency and grounding constraints to multimodal RL training, sharply reducing unfaithful visual reasoning traces while also improving final spatial reasoning accuracy.

Jianhui Liu, Haoze Sun, Wenbo Li et al.

breakthrough🟡 IntermediateNLPLLM Reasoning
cs.CLcs.CL

An open-source data engine and 3M-sample dataset for spatial intelligence that lifts performance across multiple benchmarks, giving multimodal and robotics builders a reusable foundation instead of task-by-task data silos.

Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang et al.

breakthrough🟡 IntermediateNLPLLM Reasoning
cs.CRcs.AIcs.CL

The first benchmark for mid-trajectory agent safety shows tool-calling guardrails often fail for structural reasons like JSON handling, not just refusal behavior, giving agent builders a more realistic red-team harness.

Ryan Lingo, Rajeev Chhajer

significant🟡 IntermediateNLPLLM Reasoning
cs.CLcs.AIcs.LG

A simple API-only recipe for synthetic data generation that combines memory, deduplication, and prompt evolution to stop cross-batch mode collapse and keep large generation jobs diverse.

Nathan Lambert, Florian Brand

significant🟢 BeginnerNLPLLM Reasoning
cs.CYcs.AIcs.LG

Maps the open-model ecosystem across downloads, derivatives, inference share, and performance, useful for choosing which families are winning real adoption rather than just benchmarks.

Tom A. Lamb, Desi R. Ivanova, Philip H. S. Torr et al.

significant🟡 IntermediateNLPLLM Reasoning
cs.LGcs.LG

Shows token-level temperature scaling can materially improve semantic calibration and discrimination in QA, giving builders a low-friction way to make LLM confidence scores more trustworthy.

Gustav Keppler, Moritz Gstür, Veit Hagenmeyer

breakthrough🔴 AdvancedReasoning & AgentsLLM Reasoning
cs.CRcs.AIcs.CR

CritBench is the first benchmark evaluating LLM agents on OT protocols like IEC 61850, exposing critical cybersecurity gaps in industrial systems. Essential for deploying LLMs in critical infrastructure safely.

Renxuan Tan, Rongpeng Li, Zhifeng Zhao et al.

breakthrough🔴 AdvancedNLPAlignment & SafetyLLM Reasoning
cs.AIcs.AI

Introduces Pareto-lenient consensus to avoid premature convergence in multi-preference LLM alignment—enables robust, nuanced value alignment without sacrificing performance on conflicting human preferences.

Hamed Jelodar, Samita Bai, Tochukwu Emmanuel Nwankwo et al.

breakthrough🔴 AdvancedNLPLLM Reasoning
cs.CRcs.AIcs.CR

LLM4CodeRE adapts LLMs specifically for malware decompilation, significantly improving reverse engineering accuracy on obfuscated code—critical for automated threat analysis in cybersecurity operations.

Xiaojie Gu, Ziying Huang, Weicong Hong et al.

breakthrough🔴 AdvancedNLPLLM ReasoningAlignment & Safety
cs.CLcs.AIcs.LG

Exposes how LLMs mimic edits without true memory updates, revealing dangerous surface compliance—vital for builders deploying knowledge-editing tools where factual reliability is non-negotiable.

Tianyi Zhao, Yinhan He, Wendy Zheng et al.

significant🔴 AdvancedNLPLLM Reasoning
cs.CLcs.CL

MCircKE mechanistically edits LLM knowledge to fix reasoning gaps, ensuring edited facts propagate in multi-step chains for reliable deployments.

LM-Provers, Yuxiao Qu, Amrith Setlur et al.

breakthrough🔴 AdvancedReasoning & AgentsLLM Reasoning
cs.AIcs.CLcs.LG

QED-Nano proves complex math theorems using a tiny, open model—no giant AI needed. This matters because it makes high-level reasoning accessible to anyone, enabling reproducible, affordable AI that can be inspected, improved, and deployed without cloud costs.

Yang Li, Qiang Sheng, Zhengjia Wang et al.

breakthrough🟡 IntermediateNLPLLM Reasoning
cs.CLcs.CL

This is the first system that can tell if text was written by a human, edited by an LLM, written by an LLM, or polished by a human—critical for content moderation and legal compliance. You can no longer rely on simple 'AI or human' detectors; this gives you real nuance.

Kanishk Jain, Qian Yang, Shravan Nayak et al.

significant🟡 IntermediateReasoning & AgentsLLM Reasoning
cs.CVcs.AIcs.CV

Finding specific weaknesses in vision-language models usually requires slow, manual testing. This paper uses reinforcement learning to automatically discover scenarios where models fail, such as spatial reasoning errors. This automation allows teams to rapidly identify and fix blind spots that human testers might miss.

© 2026 A2A.pub — AI to Action. From papers to practice, daily.
Summaries are AI-assistedPrivacyTerms