Topic

LLM Reasoning

Papers about structured reasoning, proof solving, and long-chain problem solving.

63 papers · latest 2026-04-23

Most active fields for this topic

NLP · 48 Reasoning & Agents · 15

Large Language Models Outperform Humans in Fraud Detection and Resistance to Motivated Investor Pressure

Nattavudh Powdthavee

breakthrough🔴 AdvancedNLP LLM Reasoning

cs.AIcs.HCcs.AI

LLMs detect fraud better than humans and resist investor bias, challenging assumptions about AI limitations. This means AI advisors could be more reliable in high-stakes financial decisions.

Details → arXiv →

AVISE: Framework for Evaluating the Security of AI Systems

Mikko Lempinen, Joni Kemppainen, Niklas Raesalmi

significant🟡 IntermediateNLP LLM Reasoning

cs.CRcs.AIcs.CL

Provides a modular framework for identifying and evaluating AI security vulnerabilities, helping developers build more robust and safer AI systems in critical applications.

Details → arXiv →

CHASM: Unveiling Covert Advertisements on Chinese Social Media

Jingyi Zheng, Tianyi Hu, Yule Liu et al.

breakthrough🟡 IntermediateNLP LLM Reasoning

cs.LGcs.AIcs.CL

Creates the first benchmark dataset for detecting covert advertisements on social media, addressing a critical gap in content moderation and enabling better evaluation of multimodal AI systems.

Details → arXiv →

LLM-guided phase diagram construction through high-throughput experimentation

Ryo Tamura, Haruhiko Morito, Yuna Oikawa et al.

breakthrough🟡 IntermediateNLP LLM Reasoning

cs.AI

Demonstrates LLMs can guide high-throughput experiments for phase diagram construction, significantly accelerating materials discovery workflows.

Details → arXiv →

The GaoYao Benchmark: A Comprehensive Framework for Evaluating Multilingual and Multicultural Abilities of Large Language Models

Yilun Liu, Chunguang Zhao, Mengyao Piao et al.

significant🔴 AdvancedNLP LLM Reasoning

cs.CLcs.CL

Comprehensive benchmark evaluating LLM multilingual and multicultural capabilities with deep cultural analysis, essential for developing globally competent AI systems.

Details → arXiv →

From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization

Chenxi Zhou, Pengfei Cao, Jiang Li et al.

breakthrough🔴 AdvancedNLP LLM Reasoning Model Compression

cs.CLcs.AIcs.LG

Uncovers two distinct failure modes in 2-bit LLM quantization—enabling builders to diagnose and mitigate performance cliffs, crucial for efficient deployment of compressed models.

Details → arXiv →

Statistics, Not Scale: Modular Medical Dialogue with Bayesian Belief Engine

Yusuf Kesmen, Fay Elhassan, Jiayi Ma et al.

breakthrough🟡 IntermediateNLP LLM Reasoning

cs.LGcs.AIcs.CL

Separates LLM dialogue from probabilistic reasoning via BMBE, enabling reliable medical diagnostics by decoupling language fluency from clinical inference—essential for safe AI-assisted healthcare systems.

Details → arXiv →

UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction

Yadong Li, Guoxin Wu, Haiping Hou et al.

breakthrough🔴 AdvancedNLP LLM Reasoning

cs.AIcs.SDcs.AI

UAF unifies full-duplex speech processing into a single audio LLM, eliminating pipeline latency and error propagation—transformative for building truly natural, real-time conversational AI with minimal latency and high fidelity.

Details → arXiv →

EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation

Ruibing Hou, Mingyue Zhou, Yuwei Gui et al.

breakthrough🔴 AdvancedReasoning & Agents LLM Reasoning Diffusion Models

cs.CVcs.CV

EgoMotion introduces the first diffusion-based framework for egocentric vision-language motion generation, enabling realistic 3D human motion synthesis from first-person views—critical for immersive VR, robotics, and human-robot interaction systems.

Details → arXiv →

Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text

Chengyu Huang, Sheng-Yen Chou, Zhengxin Zhang et al.

breakthrough🔴 AdvancedNLP LLM Reasoning

cs.CLcs.LGcs.CL

Introduces rubric-based self-play on pre-training text to bootstrap LLM reasoning without external reward models—enabling cost-efficient, scalable improvement of open-ended task performance with minimal supervision.

Details → arXiv →

Pause or Fabricate? Training Language Models for Grounded Reasoning

Yiwen Qiu, Linjuan Wu, Yizhou Liu et al.

significant🔴 AdvancedNLP LLM Reasoning

cs.CLcs.CL

Introduces inferential boundary awareness to prevent LLMs from fabricating answers under incomplete inputs—critical for builders deploying reliable reasoning systems in real-world applications where hallucinations risk safety and trust.

Details → arXiv →

Copy-as-Decode: Grammar-Constrained Parallel Prefill for LLM Editing

Ziyang Liu

breakthrough🔴 AdvancedNLP LLM Reasoning

cs.CLcs.AIcs.CL

Copy-as-Decode revolutionizes LLM editing by replacing full regeneration with grammar-constrained copy-gen operations, slashing latency and improving precision—critical for real-time code/text editing systems.

Details → arXiv →

Tool Learning Needs Nothing More Than a Free 8B Language Model

Chenming Tang, Hsiu-Yuan Huang, Weijie Liu et al.

breakthrough🟡 IntermediateNLP LLM Reasoning

cs.LGcs.CLcs.LG

TRUSTEE trains tool-calling agents without labeled data or commercial models, using dynamic environment synthesis with only an 8B LLM—democratizing powerful agent training for any builder with minimal resources.

Details → arXiv →

SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models

Dongxin Guo, Jikun Wu, Siu Ming Yiu

breakthrough🟡 IntermediateNLP LLM Reasoning Alignment & Safety

cs.LGcs.AIcs.LG

SafeAnchor reveals LLM safety is fragile and erodes cumulatively during domain adaptation. Practitioners must now actively preserve safety across updates—this is the first method to do so systematically in continual settings.

Details → arXiv →

Toward Zero-Egress Psychiatric AI: On-Device LLM Deployment for Privacy-Preserving Mental Health Decision Support

Eranga Bandara, Asanga Gunaratna, Ross Gore et al.

breakthrough🔴 AdvancedNLP LLM Reasoning

cs.AIcs.AI

First on-device LLM deployment for psychiatric decision support that eliminates cloud egress, preserving patient privacy in high-risk settings. Enables real-time, compliant mental health AI without data leakage risks.

Details → arXiv →

From Natural Language to Executable Narsese: A Neuro-Symbolic Benchmark and Pipeline for Reasoning with NARS

Mina Gabriel, Pei Wang

significant🔴 AdvancedReasoning & Agents LLM Reasoning

cs.AIcs.AI

Presents a neuro-symbolic pipeline translating natural language into Narsese, enabling interpretable, uncertainty-aware reasoning—vital for building trustworthy AI systems requiring explicit logic over LLM hallucinations.

Details → arXiv →

Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data

Zhenwen Liang, Yujun Zhou, Sidi Lu et al.

breakthrough🔴 AdvancedReasoning & Agents LLM Reasoning

cs.LGcs.LG

CUTS solves RL mode collapse in saturated reasoning by sampling from constrained top-K outputs, enabling continued learning even when models are already correct—vital for improving LLM reasoning robustness without manual data curation.

Details → arXiv →

Structured Abductive-Deductive-Inductive Reasoning for LLMs via Algebraic Invariants

Sankalp Gilda, Shlok Gilda

breakthrough🔴 AdvancedReasoning & Agents LLM Reasoning

cs.AIcs.LGcs.LO

Embeds Peircean reasoning as algebraic invariants in LLMs, enforcing logical structure—vital for builders of reliable reasoning agents where correctness, not just fluency, is non-negotiable.

Details → arXiv →

Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

Sai Srinivas Kancheti, Aditya Sanjiv Kanade, Vineeth N. Balasubramanian et al.

breakthrough🟡 IntermediateReasoning & Agents LLM Reasoning

cs.CVcs.AIcs.CV

Reveals CoT prompting harms visual spatial reasoning in multimodal LLMs—forcing a rethink of reasoning paradigms in robotics, AR/VR, and vision-language systems where spatial accuracy is non-negotiable.

Details → arXiv →

QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals

Jeremy Qin, Maksym Andriushchenko

breakthrough🟡 IntermediateNLP LLM Reasoning

cs.LGcs.AIcs.LG

Introduces the first benchmark for evaluating LLMs on continuous numerical forecasting with prediction intervals, exposing critical gaps in real-world reasoning—essential for deploying LLMs in finance, healthcare, and policy decision systems.

Details → arXiv →

Certified Program Synthesis with a Multi-Modal Verifier

Yueyang Feng, Dipesh Kafle, Vladimir Gladshtein et al.

breakthrough🔴 AdvancedReasoning & Agents LLM Reasoning

cs.SEcs.AIcs.PL

This work introduces a multi-modal verifier that dynamically adjusts LLM-generated specs to be both implementable and formally sound—enabling trustworthy, automated code generation for safety-critical systems.

Details → arXiv →

Privacy-Preserving LLMs Routing

Xidong Wu, Yukuan Zhang, Yuqiong Ji et al.

breakthrough🔴 AdvancedNLP LLM Reasoning

cs.CRcs.AIcs.CR

Introduces privacy-preserving LLM routing using MPC, preventing data exposure during model selection—essential for enterprises deploying multi-provider LLM APIs under strict compliance regimes.

Details → arXiv →

Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information

Yao Chen, Jiawei Sheng, Wenyuan Zhang et al.

breakthrough🔴 AdvancedNLP LLM Reasoning Model Compression

cs.CLcs.CL

Proposes stepwise attention distillation to transfer dynamic reasoning focus from large to small models, significantly improving small-model reasoning without requiring larger architectures—key for efficient deployment in resource-constrained systems.

Details → arXiv →

"Excuse me, may I say something..." CoLabScience, A Proactive AI Assistant for Biomedical Discovery and LLM-Expert Collaborations

Yang Wu, Jinhong Yu, Jingwei Xiong et al.

breakthrough🟡 IntermediateNLP LLM Reasoning

cs.CLcs.AIcs.HC

CoLabScience introduces proactive LLM collaboration in science, autonomously suggesting insights—transforming how researchers interact with AI, moving beyond reactive queries to true co-discovery.

Details → arXiv →

ConfLayers: Adaptive Confidence-based Layer Skipping for Self-Speculative Decoding

Walaa Amer, Uday das, Fadi Kurdahi

breakthrough🔴 AdvancedNLP LLM Reasoning

cs.LGcs.CLcs.LG

ConfLayers dynamically skips LLM layers based on confidence, accelerating speculative decoding without quality loss. This directly reduces inference cost for production LLM systems, making real-time reasoning more scalable and efficient.

Details → arXiv →

LACE: Lattice Attention for Cross-thread Exploration

Yang Li, Zirui Zhang, Yang Liu et al.

breakthrough🔴 AdvancedNLP Fine-tuning & PEFT LLM Reasoning

cs.AIcs.AI

LACE enables LLM reasoning paths to share insights via cross-thread attention, dramatically reducing redundant failures and improving solution quality—essential for building robust, scalable reasoning systems.

Details → arXiv →

FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models

Zixuan Weng, Jinghuai Zhang, Kunlin Cai et al.

breakthrough🔴 AdvancedNLP LLM Reasoning Efficient Inference

cs.LGcs.AIcs.CL

FineSteer enables precise, adaptive steering of LLM behavior at inference time without retraining, offering a unified, utility-preserving method to fix hallucinations and safety issues—critical for deploying reliable AI in production.

Details → arXiv →

Context Over Content: Exposing Evaluation Faking in Automated Judges

Manan Gupta, Inderjeet Nair, Lu Wang et al.

breakthrough🔴 AdvancedNLP LLM Reasoning

cs.AIcs.CLcs.LG

Exposes how LLM judges are manipulated by stakes signaling, undermining automated evaluation reliability—essential for anyone building or trusting LLM benchmarks, as evaluation integrity is now proven fragile.

Details → arXiv →

IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning

Zihan Liang, Yufei Ma, Ben Chen et al.

significant🔴 AdvancedReasoning & Agents LLM Reasoning

cs.AIcs.CLcs.IR

IG-Search introduces step-level information gain rewards to precisely guide LLM search queries in reasoning tasks, avoiding gradient collapse—critical for building reliable search-augmented agents that avoid redundant or vague queries.

Details → arXiv →

LLM attribution analysis across different fine-tuning strategies and model scales for automated code compliance

Jack Wei Lun Shi, Minghao Dang, Wawan Solihin et al.

breakthrough🔴 AdvancedNLP LLM Reasoning Alignment & Safety

cs.CLcs.AIcs.LG

First perturbation-based attribution analysis of LLMs in code compliance, revealing how fine-tuning strategies alter interpretability—essential for building trustworthy, auditable code-review AI systems.

Details → arXiv →

Mamba-SSM with LLM Reasoning for Biomarker Discovery: Causal Feature Refinement via Chain-of-Thought Gene Evaluation

Pushpa Kumar Balan, Aijing Feng

breakthrough🔴 AdvancedReasoning & Agents LLM Reasoning

cs.AI

Mamba-SSM + LLM CoT filters confounding genes via causal reasoning, boosting biomarker specificity—enabling reliable, interpretable genomic discovery without manual curation, directly impacting precision medicine pipelines.

Details → arXiv →

TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration

Zerun Ma, Guoqiang Wang, Xinchen Xie et al.

breakthrough🔴 AdvancedNLP LLM Reasoning AI Agents

cs.AIcs.CLcs.AI

TREX automates end-to-end LLM fine-tuning using multi-agent collaboration, eliminating manual hyperparameter tuning and workflow design—critical for teams scaling LLM deployment without expert ML engineers.

Details → arXiv →

SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs

Haoran Lou, Ziyan Liu, Chunxiao Fan et al.

breakthrough🔴 AdvancedNLP RAG LLM Reasoning

cs.CVcs.CV

SLQ enables retrieval with frozen MLLMs via shared latent queries—preserving pre-trained knowledge while avoiding costly fine-tuning, a game-changer for scalable, stable multimodal retrieval systems.

Details → arXiv →

IndicDB -- Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages

Aviral Dawar, Roshan Karanth, Vikram Goyal et al.

breakthrough🟡 IntermediateNLP LLM Reasoning

cs.CLcs.AIcs.DB

First multilingual Text-to-SQL benchmark for Indian languages with real-world schemas, exposing critical LLM biases and enabling equitable NLP deployment in underrepresented linguistic contexts.

Details → arXiv →

Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints

Songping Peng, Zhiheng Zhang, Daojian Zeng et al.

breakthrough🔴 AdvancedNLP LLM Reasoning Alignment & Safety

cs.AIcs.AI

Coupled weight-activation constraints prevent safety drift during LLM fine-tuning, offering a theoretically grounded defense—essential for deploying reliable, safe LLMs in production without unintended harmful behavior emergence.

Details → arXiv →

Heuristic Classification of Thoughts Prompting (HCoT): Integrating Expert System Heuristics for Structured Reasoning into Large Language Models

Lei Lin, Jizhao Zhu, Yong Liu et al.

breakthrough🔴 AdvancedReasoning & Agents LLM Reasoning

cs.AIcs.AI

HCoT injects expert system heuristics into LLM reasoning, replacing stochastic sampling with structured, deterministic planning—transforming LLMs into reliable agents for high-stakes decision systems.

Details → arXiv →

ReasonXL: Shifting LLM Reasoning Language Without Sacrificing Performance

Daniil Gurgurov, Tom Röhr, Sebastian von Rohrscheidt et al.

breakthrough🔴 AdvancedNLP LLM Reasoning

cs.CLcs.CL

ReasonXL enables non-English LLMs to reason natively in their target language without performance loss—essential for global deployment of reasoning agents.

Details → arXiv →

Beyond Scores: Diagnostic LLM Evaluation via Fine-Grained Abilities

Xu Zhang, Xudong Gong, Jiacheng Qin et al.

breakthrough🔴 AdvancedNLP LLM Reasoning

cs.AIcs.AI

Replaces single LLM scores with a 35-dimension diagnostic taxonomy for fine-grained ability analysis—essential for researchers and engineers needing to diagnose and select models based on specific cognitive strengths.

Details → arXiv →

One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness

Erfan Baghaei Potraghloo, Seyedarmin Azizi, Souvik Kundu et al.

breakthrough🟡 IntermediateNLP LLM Reasoning

cs.CLcs.AIcs.CL

A single banned token can collapse LLM helpfulness—revealing dangerous fragility in instruction-tuned models. Practitioners must harden prompts and test for lexical vulnerabilities before deployment.

Details → arXiv →

MultiDocFusion: Hierarchical and Multimodal Chunking Pipeline for Enhanced RAG on Long Industrial Documents

Joongmin Shin, Chanjun Park, Jeongbae Park et al.

breakthrough🟡 IntermediateNLP RAG Multimodal Understanding

cs.AIcs.CLcs.AI

MultiDocFusion integrates vision and text to preserve structural context in long industrial documents, dramatically improving RAG accuracy—essential for enterprises relying on precise QA from complex PDFs, manuals, and reports.

Details → arXiv →

Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale

Liujie Zhang, Benzhe Ning, Rui Yang et al.

significant🔴 AdvancedNLP LLM Reasoning

cs.CLcs.CL

Relax is an open asynchronous RL engine for omni-modal post-training that doubles throughput on Qwen3-Omni-scale runs without sacrificing convergence.

Details → arXiv →

Decomposing and Reducing Hidden Measurement Error in LLM Evaluation Pipelines

Solomon Messing

significant🟡 IntermediateNLP LLM Reasoning

cs.CLcs.CL

This work shows how prompt wording, judge choice, and temperature can flip LLM eval results, then gives a budget-aware recipe that materially reduces benchmark noise and gaming surface.

Details → arXiv →

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

Hadas Orgad, Boyi Wei, Kaden Zheng et al.

breakthrough🔴 AdvancedNLP LLM Reasoning

cs.CLcs.AIcs.LG

This mechanistic safety paper argues harmful generation is concentrated in a compact, reusable weight subspace, offering a concrete explanation for why narrow fine-tuning can trigger broad misalignment.

Details → arXiv →

TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training

Chenhao Ye, Huaizheng Zhang, Mingcong Han et al.

significant🔴 AdvancedNLP LLM Reasoning

cs.DCcs.AIcs.DC

TensorHub attacks a painful RL-systems bottleneck by serving model weights from replicas already resident on GPUs, dramatically reducing rollout stalls in elastic and cross-datacenter training.

Details → arXiv →

LLM-Rosetta: A Hub-and-Spoke Intermediate Representation for Cross-Provider LLM API Translation

Peng Ding

significant🟡 IntermediateNLP LLM Reasoning

cs.SEcs.AIcs.SE

LLM-Rosetta introduces a neutral intermediate representation for major LLM APIs, giving builders a credible path away from brittle one-off provider adapters and vendor lock-in.

Details → arXiv →

PIArena: A Platform for Prompt Injection Evaluation

Runpeng Geng, Chenlong Yin, Yanting Wang et al.

breakthrough🟡 IntermediateNLP LLM Reasoning

cs.CRcs.AIcs.CL

A unified prompt-injection evaluation platform with adaptive attacks that exposes how brittle many current defenses remain across tasks, making it useful core infrastructure for teams shipping tool-using or retrieval-augmented agents.

Details → arXiv →

SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

Ashima Suvarna, Kendrick Phan, Mehrab Beikzadeh et al.

breakthrough🔴 AdvancedReasoning & Agents LLM Reasoning

cs.AIcs.LGcs.AI

SUPERNOVA turns natural-instruction datasets into RL-ready supervision for general reasoning, delivering large gains beyond math and code and giving post-training teams a practical recipe for broader reasoning improvement.

Details → arXiv →

OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

Wenbo Hu, Xin Chen, Yan Gao-Tian et al.

significant🔴 AdvancedReasoning & Agents LLM Reasoning

cs.CVcs.AIcs.CL

OpenVLThinkerV2 introduces a more stable RL objective and task-shaping recipe for open multimodal reasoning, helping a generalist model balance perception with multi-step thinking across 18 visual benchmarks.

Details → arXiv →

Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest

Addison J. Wu, Ryan Liu, Shuyue Stella Li et al.

significant🟡 IntermediateNLP LLM Reasoning

cs.AIcs.CLcs.CY

This paper turns chatbot advertising into a concrete alignment problem, probing how model behavior shifts when user benefit and platform revenue diverge.

Details → arXiv →

Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

Sai Srinivas Kancheti, Aditya Kanade, Rohit Sinha et al.

significant🔴 AdvancedReasoning & Agents LLM Reasoning

cs.CVcs.AIcs.CV

Faithful GRPO adds consistency and grounding constraints to multimodal RL training, sharply reducing unfaithful visual reasoning traces while also improving final spatial reasoning accuracy.

Details → arXiv →

OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence

Jianhui Liu, Haoze Sun, Wenbo Li et al.

breakthrough🟡 IntermediateNLP LLM Reasoning

cs.CLcs.CL

An open-source data engine and 3M-sample dataset for spatial intelligence that lifts performance across multiple benchmarks, giving multimodal and robotics builders a reusable foundation instead of task-by-task data silos.

Details → arXiv →

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang et al.

breakthrough🟡 IntermediateNLP LLM Reasoning

cs.CRcs.AIcs.CL

The first benchmark for mid-trajectory agent safety shows tool-calling guardrails often fail for structural reasons like JSON handling, not just refusal behavior, giving agent builders a more realistic red-team harness.

Details → arXiv →

Dynamic Context Evolution for Scalable Synthetic Data Generation

Ryan Lingo, Rajeev Chhajer

significant🟡 IntermediateNLP LLM Reasoning

cs.CLcs.AIcs.LG

A simple API-only recipe for synthetic data generation that combines memory, deduplication, and prompt evolution to stop cross-batch mode collapse and keep large generation jobs diverse.

Details → arXiv →

The ATOM Report: Measuring the Open Language Model Ecosystem

Nathan Lambert, Florian Brand

significant🟢 BeginnerNLP LLM Reasoning

cs.CYcs.AIcs.LG

Maps the open-model ecosystem across downloads, derivatives, inference share, and performance, useful for choosing which families are winning real adoption rather than just benchmarks.

Details → arXiv →

Improving Semantic Uncertainty Quantification in Language Model Question-Answering via Token-Level Temperature Scaling

Tom A. Lamb, Desi R. Ivanova, Philip H. S. Torr et al.

significant🟡 IntermediateNLP LLM Reasoning

cs.LGcs.LG

Shows token-level temperature scaling can materially improve semantic calibration and discrimination in QA, giving builders a low-friction way to make LLM confidence scores more trustworthy.

Details → arXiv →

CritBench: A Framework for Evaluating Cybersecurity Capabilities of Large Language Models in IEC 61850 Digital Substation Environments

Gustav Keppler, Moritz Gstür, Veit Hagenmeyer

breakthrough🔴 AdvancedReasoning & Agents LLM Reasoning

cs.CRcs.AIcs.CR

CritBench is the first benchmark evaluating LLM agents on OT protocols like IEC 61850, exposing critical cybersecurity gaps in industrial systems. Essential for deploying LLMs in critical infrastructure safely.

Details → arXiv →

Beyond Compromise: Pareto-Lenient Consensus for Efficient Multi-Preference LLM Alignment

Renxuan Tan, Rongpeng Li, Zhifeng Zhao et al.

breakthrough🔴 AdvancedNLP Alignment & Safety LLM Reasoning

cs.AIcs.AI

Introduces Pareto-lenient consensus to avoid premature convergence in multi-preference LLM alignment—enables robust, nuanced value alignment without sacrificing performance on conflicting human preferences.

Details → arXiv →

LLM4CodeRE: Generative AI for Code Decompilation Analysis and Reverse Engineering

Hamed Jelodar, Samita Bai, Tochukwu Emmanuel Nwankwo et al.

breakthrough🔴 AdvancedNLP LLM Reasoning

cs.CRcs.AIcs.CR

LLM4CodeRE adapts LLMs specifically for malware decompilation, significantly improving reverse engineering accuracy on obfuscated code—critical for automated threat analysis in cybersecurity operations.

Details → arXiv →

The Model Agreed, But Didn't Learn: Diagnosing Surface Compliance in Large Language Models

Xiaojie Gu, Ziying Huang, Weicong Hong et al.

breakthrough🔴 AdvancedNLP LLM Reasoning Alignment & Safety

cs.CLcs.AIcs.LG

Exposes how LLMs mimic edits without true memory updates, revealing dangerous surface compliance—vital for builders deploying knowledge-editing tools where factual reliability is non-negotiable.

Details → arXiv →

Mechanistic Circuit-Based Knowledge Editing in Large Language Models

Tianyi Zhao, Yinhan He, Wendy Zheng et al.

significant🔴 AdvancedNLP LLM Reasoning

cs.CLcs.CL

MCircKE mechanistically edits LLM knowledge to fix reasoning gaps, ensuring edited facts propagate in multi-step chains for reliable deployments.

Details → arXiv →

QED-Nano: Teaching a Tiny Model to Prove Hard Theorems

LM-Provers, Yuxiao Qu, Amrith Setlur et al.

breakthrough🔴 AdvancedReasoning & Agents LLM Reasoning

cs.AIcs.CLcs.LG

QED-Nano proves complex math theorems using a tiny, open model—no giant AI needed. This matters because it makes high-level reasoning accessible to anyone, enabling reproducible, affordable AI that can be inspected, improved, and deployed without cloud costs.

Details → arXiv →

Beyond the Final Actor: Modeling the Dual Roles of Creator and Editor for Fine-Grained LLM-Generated Text Detection

Yang Li, Qiang Sheng, Zhengjia Wang et al.

breakthrough🟡 IntermediateNLP LLM Reasoning

cs.CLcs.CL

This is the first system that can tell if text was written by a human, edited by an LLM, written by an LLM, or polished by a human—critical for content moderation and legal compliance. You can no longer rely on simple 'AI or human' detectors; this gives you real nuance.

Details → arXiv →

Discovering Failure Modes in Vision-Language Models using RL

Kanishk Jain, Qian Yang, Shravan Nayak et al.

significant🟡 IntermediateReasoning & Agents LLM Reasoning

cs.CVcs.AIcs.CV

Finding specific weaknesses in vision-language models usually requires slow, manual testing. This paper uses reinforcement learning to automatically discover scenarios where models fail, such as spatial reasoning errors. This automation allows teams to rapidly identify and fix blind spots that human testers might miss.

Details → arXiv →