← Back to fields

Field

NLP

Language understanding, generation, extraction, and evaluation.

62 papers · latest 2026-04-23

Common topics in this field

Nattavudh Powdthavee

breakthrough🔴 AdvancedNLPLLM Reasoning
cs.AIcs.HCcs.AI

LLMs detect fraud better than humans and resist investor bias, challenging assumptions about AI limitations. This means AI advisors could be more reliable in high-stakes financial decisions.

Mikko Lempinen, Joni Kemppainen, Niklas Raesalmi

significant🟡 IntermediateNLPLLM Reasoning
cs.CRcs.AIcs.CL

Provides a modular framework for identifying and evaluating AI security vulnerabilities, helping developers build more robust and safer AI systems in critical applications.

Jingyi Zheng, Tianyi Hu, Yule Liu et al.

breakthrough🟡 IntermediateNLPLLM Reasoning
cs.LGcs.AIcs.CL

Creates the first benchmark dataset for detecting covert advertisements on social media, addressing a critical gap in content moderation and enabling better evaluation of multimodal AI systems.

Peng Peng, Weiwei Lin, Wentai Wu et al.

significant🔴 AdvancedNLPRAG
cs.IRcs.CLcs.IR

Proposes HaS, a speculative retrieval method that accelerates RAG systems by leveraging homology-aware caching, reducing latency without accuracy loss in large-scale knowledge retrieval.

Ryo Tamura, Haruhiko Morito, Yuna Oikawa et al.

breakthrough🟡 IntermediateNLPLLM Reasoning
cs.AI

Demonstrates LLMs can guide high-throughput experiments for phase diagram construction, significantly accelerating materials discovery workflows.

Yilun Liu, Chunguang Zhao, Mengyao Piao et al.

significant🔴 AdvancedNLPLLM Reasoning
cs.CLcs.CL

Comprehensive benchmark evaluating LLM multilingual and multicultural capabilities with deep cultural analysis, essential for developing globally competent AI systems.

Chenxi Zhou, Pengfei Cao, Jiang Li et al.

breakthrough🔴 AdvancedNLPLLM ReasoningModel Compression
cs.CLcs.AIcs.LG

Uncovers two distinct failure modes in 2-bit LLM quantization—enabling builders to diagnose and mitigate performance cliffs, crucial for efficient deployment of compressed models.

Yusuf Kesmen, Fay Elhassan, Jiayi Ma et al.

breakthrough🟡 IntermediateNLPLLM Reasoning
cs.LGcs.AIcs.CL

Separates LLM dialogue from probabilistic reasoning via BMBE, enabling reliable medical diagnostics by decoupling language fluency from clinical inference—essential for safe AI-assisted healthcare systems.

Yadong Li, Guoxin Wu, Haiping Hou et al.

breakthrough🔴 AdvancedNLPLLM Reasoning
cs.AIcs.SDcs.AI

UAF unifies full-duplex speech processing into a single audio LLM, eliminating pipeline latency and error propagation—transformative for building truly natural, real-time conversational AI with minimal latency and high fidelity.

Chengyu Huang, Sheng-Yen Chou, Zhengxin Zhang et al.

breakthrough🔴 AdvancedNLPLLM Reasoning
cs.CLcs.LGcs.CL

Introduces rubric-based self-play on pre-training text to bootstrap LLM reasoning without external reward models—enabling cost-efficient, scalable improvement of open-ended task performance with minimal supervision.

Yiwen Qiu, Linjuan Wu, Yizhou Liu et al.

significant🔴 AdvancedNLPLLM Reasoning
cs.CLcs.CL

Introduces inferential boundary awareness to prevent LLMs from fabricating answers under incomplete inputs—critical for builders deploying reliable reasoning systems in real-world applications where hallucinations risk safety and trust.

Ziyang Liu

breakthrough🔴 AdvancedNLPLLM Reasoning
cs.CLcs.AIcs.CL

Copy-as-Decode revolutionizes LLM editing by replacing full regeneration with grammar-constrained copy-gen operations, slashing latency and improving precision—critical for real-time code/text editing systems.

Chenming Tang, Hsiu-Yuan Huang, Weijie Liu et al.

breakthrough🟡 IntermediateNLPLLM Reasoning
cs.LGcs.CLcs.LG

TRUSTEE trains tool-calling agents without labeled data or commercial models, using dynamic environment synthesis with only an 8B LLM—democratizing powerful agent training for any builder with minimal resources.

Dongxin Guo, Jikun Wu, Siu Ming Yiu

breakthrough🟡 IntermediateNLPLLM ReasoningAlignment & Safety
cs.LGcs.AIcs.LG

SafeAnchor reveals LLM safety is fragile and erodes cumulatively during domain adaptation. Practitioners must now actively preserve safety across updates—this is the first method to do so systematically in continual settings.

Eranga Bandara, Asanga Gunaratna, Ross Gore et al.

breakthrough🔴 AdvancedNLPLLM Reasoning
cs.AIcs.AI

First on-device LLM deployment for psychiatric decision support that eliminates cloud egress, preserving patient privacy in high-risk settings. Enables real-time, compliant mental health AI without data leakage risks.

Wentao Zhang, Yan Zhuang, ZhuHang Zheng et al.

breakthrough🔴 AdvancedNLPRAG
cs.CRcs.AIcs.CR

DEJA exposes stealthy RAG failures that mimic valid responses, forcing a paradigm shift in security evaluation—essential for deploying reliable RAG systems that must detect subtle, non-obvious degradation.

Jeremy Qin, Maksym Andriushchenko

breakthrough🟡 IntermediateNLPLLM Reasoning
cs.LGcs.AIcs.LG

Introduces the first benchmark for evaluating LLMs on continuous numerical forecasting with prediction intervals, exposing critical gaps in real-world reasoning—essential for deploying LLMs in finance, healthcare, and policy decision systems.

Hyunseok Park, Jihyeon Kim, Jongeun Kim et al.

breakthrough🟡 IntermediateNLPRAG
cs.CLcs.CL

CHOP reduces RAG hallucinations by iteratively chunking and reassembling documents with LLMs—directly improving factual accuracy in production systems without requiring retraining or new embeddings.

Xidong Wu, Yukuan Zhang, Yuqiong Ji et al.

breakthrough🔴 AdvancedNLPLLM Reasoning
cs.CRcs.AIcs.CR

Introduces privacy-preserving LLM routing using MPC, preventing data exposure during model selection—essential for enterprises deploying multi-provider LLM APIs under strict compliance regimes.

Yao Chen, Jiawei Sheng, Wenyuan Zhang et al.

breakthrough🔴 AdvancedNLPLLM ReasoningModel Compression
cs.CLcs.CL

Proposes stepwise attention distillation to transfer dynamic reasoning focus from large to small models, significantly improving small-model reasoning without requiring larger architectures—key for efficient deployment in resource-constrained systems.

Yang Wu, Jinhong Yu, Jingwei Xiong et al.

breakthrough🟡 IntermediateNLPLLM Reasoning
cs.CLcs.AIcs.HC

CoLabScience introduces proactive LLM collaboration in science, autonomously suggesting insights—transforming how researchers interact with AI, moving beyond reactive queries to true co-discovery.

Walaa Amer, Uday das, Fadi Kurdahi

breakthrough🔴 AdvancedNLPLLM Reasoning
cs.LGcs.CLcs.LG

ConfLayers dynamically skips LLM layers based on confidence, accelerating speculative decoding without quality loss. This directly reduces inference cost for production LLM systems, making real-time reasoning more scalable and efficient.

Yang Li, Zirui Zhang, Yang Liu et al.

breakthrough🔴 AdvancedNLPFine-tuning & PEFTLLM Reasoning
cs.AIcs.AI

LACE enables LLM reasoning paths to share insights via cross-thread attention, dramatically reducing redundant failures and improving solution quality—essential for building robust, scalable reasoning systems.

Zixuan Weng, Jinghuai Zhang, Kunlin Cai et al.

breakthrough🔴 AdvancedNLPLLM ReasoningEfficient Inference
cs.LGcs.AIcs.CL

FineSteer enables precise, adaptive steering of LLM behavior at inference time without retraining, offering a unified, utility-preserving method to fix hallucinations and safety issues—critical for deploying reliable AI in production.

Manan Gupta, Inderjeet Nair, Lu Wang et al.

breakthrough🔴 AdvancedNLPLLM Reasoning
cs.AIcs.CLcs.LG

Exposes how LLM judges are manipulated by stakes signaling, undermining automated evaluation reliability—essential for anyone building or trusting LLM benchmarks, as evaluation integrity is now proven fragile.

Jack Wei Lun Shi, Minghao Dang, Wawan Solihin et al.

breakthrough🔴 AdvancedNLPLLM ReasoningAlignment & Safety
cs.CLcs.AIcs.LG

First perturbation-based attribution analysis of LLMs in code compliance, revealing how fine-tuning strategies alter interpretability—essential for building trustworthy, auditable code-review AI systems.

Zerun Ma, Guoqiang Wang, Xinchen Xie et al.

breakthrough🔴 AdvancedNLPLLM ReasoningAI Agents
cs.AIcs.CLcs.AI

TREX automates end-to-end LLM fine-tuning using multi-agent collaboration, eliminating manual hyperparameter tuning and workflow design—critical for teams scaling LLM deployment without expert ML engineers.

Zekai Lin, Chao Xue, Di Liang et al.

breakthrough🔴 AdvancedNLPFine-tuning & PEFT
cs.LGcs.CLcs.LG

Demonstrates parameter importance evolves during fine-tuning, introducing dynamic isolation that outperforms static PEFT methods—essential for efficient, stable LLM adaptation in production.

Haoran Lou, Ziyan Liu, Chunxiao Fan et al.

breakthrough🔴 AdvancedNLPRAGLLM Reasoning
cs.CVcs.CV

SLQ enables retrieval with frozen MLLMs via shared latent queries—preserving pre-trained knowledge while avoiding costly fine-tuning, a game-changer for scalable, stable multimodal retrieval systems.

Aviral Dawar, Roshan Karanth, Vikram Goyal et al.

breakthrough🟡 IntermediateNLPLLM Reasoning
cs.CLcs.AIcs.DB

First multilingual Text-to-SQL benchmark for Indian languages with real-world schemas, exposing critical LLM biases and enabling equitable NLP deployment in underrepresented linguistic contexts.

Sunkyung Lee, Jihye Back, Donghyeon Jeon et al.

breakthrough🟡 IntermediateNLPRAG
cs.IRcs.CLcs.IR

Introduces authority-aware generation in retrieval, directly improving trustworthiness in high-stakes domains by biasing LLMs toward credible sources—not just relevance—enabling safer deployment in healthcare and finance.

Zihao Liu, Hantao Zhou, Jiguo Li et al.

breakthrough🟡 IntermediateNLPAlignment & Safety
cs.CLcs.CL

MUSE delivers consistent, multi-domain Chinese user simulations via self-evolving profiles. Practitioners building chat systems for Chinese markets can now train and evaluate agents at scale with realistic personas.

Sohyun An, Hayeon Lee, Shuibenyang Yuan et al.

breakthrough🔴 AdvancedNLPRAG
cs.IRcs.AIcs.IR

FRESCO introduces dynamic evaluation for RAG re-rankers under evolving data, exposing severe performance drops in static benchmarks. Builders must test re-rankers with temporal drift to ensure real-world reliability.

Songping Peng, Zhiheng Zhang, Daojian Zeng et al.

breakthrough🔴 AdvancedNLPLLM ReasoningAlignment & Safety
cs.AIcs.AI

Coupled weight-activation constraints prevent safety drift during LLM fine-tuning, offering a theoretically grounded defense—essential for deploying reliable, safe LLMs in production without unintended harmful behavior emergence.

Vishal Pramanik, Maisha Maliha, Nathaniel D. Bastian et al.

breakthrough🔴 AdvancedNLPAlignment & Safety
cs.CLcs.AIcs.CL

HETA introduces the first Hessian-based attribution method for autoregressive LLMs, capturing non-linear causal dependencies in token generation—essential for building reliable, interpretable generative systems in production.

Daniil Gurgurov, Tom Röhr, Sebastian von Rohrscheidt et al.

breakthrough🔴 AdvancedNLPLLM Reasoning
cs.CLcs.CL

ReasonXL enables non-English LLMs to reason natively in their target language without performance loss—essential for global deployment of reasoning agents.

Xu Zhang, Xudong Gong, Jiacheng Qin et al.

breakthrough🔴 AdvancedNLPLLM Reasoning
cs.AIcs.AI

Replaces single LLM scores with a 35-dimension diagnostic taxonomy for fine-grained ability analysis—essential for researchers and engineers needing to diagnose and select models based on specific cognitive strengths.

Erfan Baghaei Potraghloo, Seyedarmin Azizi, Souvik Kundu et al.

breakthrough🟡 IntermediateNLPLLM Reasoning
cs.CLcs.AIcs.CL

A single banned token can collapse LLM helpfulness—revealing dangerous fragility in instruction-tuned models. Practitioners must harden prompts and test for lexical vulnerabilities before deployment.

Joongmin Shin, Chanjun Park, Jeongbae Park et al.

breakthrough🟡 IntermediateNLPRAGMultimodal Understanding
cs.AIcs.CLcs.AI

MultiDocFusion integrates vision and text to preserve structural context in long industrial documents, dramatically improving RAG accuracy—essential for enterprises relying on precise QA from complex PDFs, manuals, and reports.

Liujie Zhang, Benzhe Ning, Rui Yang et al.

significant🔴 AdvancedNLPLLM Reasoning
cs.CLcs.CL

Relax is an open asynchronous RL engine for omni-modal post-training that doubles throughput on Qwen3-Omni-scale runs without sacrificing convergence.

Bo Li, Mingda Wang, Gexiang Fang et al.

significant🔴 AdvancedNLPRAG
cs.CLcs.AIcs.CL

GRIP turns retrieval into a native decoding action so the model can decide when to search, rewrite queries, and stop inside one reasoning trace instead of bolting on a controller.

Artem Gadzhiev, Andrew Kislov

significant🟡 IntermediateNLPRAG
cs.CLcs.AIcs.LG

Synthius-Mem replaces retrieval-heavy agent memory with structured persona memory, improving both long-term recall and adversarial robustness against invented facts.

Solomon Messing

significant🟡 IntermediateNLPLLM Reasoning
cs.CLcs.CL

This work shows how prompt wording, judge choice, and temperature can flip LLM eval results, then gives a budget-aware recipe that materially reduces benchmark noise and gaming surface.

Kyle Whitecross, Negin Rahimi

significant🔴 AdvancedNLPRAG
cs.CLcs.AIcs.IR

RecaLLM tackles the lost-in-thought problem by interleaving reasoning with explicit in-context retrieval, giving long-context models a practical way to stay grounded at up to 128K tokens.

Hadas Orgad, Boyi Wei, Kaden Zheng et al.

breakthrough🔴 AdvancedNLPLLM Reasoning
cs.CLcs.AIcs.LG

This mechanistic safety paper argues harmful generation is concentrated in a compact, reusable weight subspace, offering a concrete explanation for why narrow fine-tuning can trigger broad misalignment.

Chenhao Ye, Huaizheng Zhang, Mingcong Han et al.

significant🔴 AdvancedNLPLLM Reasoning
cs.DCcs.AIcs.DC

TensorHub attacks a painful RL-systems bottleneck by serving model weights from replicas already resident on GPUs, dramatically reducing rollout stalls in elastic and cross-datacenter training.

Peng Ding

significant🟡 IntermediateNLPLLM Reasoning
cs.SEcs.AIcs.SE

LLM-Rosetta introduces a neutral intermediate representation for major LLM APIs, giving builders a credible path away from brittle one-off provider adapters and vendor lock-in.

Runpeng Geng, Chenlong Yin, Yanting Wang et al.

breakthrough🟡 IntermediateNLPLLM Reasoning
cs.CRcs.AIcs.CL

A unified prompt-injection evaluation platform with adaptive attacks that exposes how brittle many current defenses remain across tasks, making it useful core infrastructure for teams shipping tool-using or retrieval-augmented agents.

Addison J. Wu, Ryan Liu, Shuyue Stella Li et al.

significant🟡 IntermediateNLPLLM Reasoning
cs.AIcs.CLcs.CY

This paper turns chatbot advertising into a concrete alignment problem, probing how model behavior shifts when user benefit and platform revenue diverge.

Jianhui Liu, Haoze Sun, Wenbo Li et al.

breakthrough🟡 IntermediateNLPLLM Reasoning
cs.CLcs.CL

An open-source data engine and 3M-sample dataset for spatial intelligence that lifts performance across multiple benchmarks, giving multimodal and robotics builders a reusable foundation instead of task-by-task data silos.

Qiyao Ma, Dechen Gao, Rui Cai et al.

breakthrough🟡 IntermediateNLPAlignment & Safety
cs.CLcs.LGcs.CL

A benchmark for personalized reward modeling that tracks downstream BoN and PPO performance, showing today's reward models still struggle to capture user-specific preferences that matter for aligned products.

Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang et al.

breakthrough🟡 IntermediateNLPLLM Reasoning
cs.CRcs.AIcs.CL

The first benchmark for mid-trajectory agent safety shows tool-calling guardrails often fail for structural reasons like JSON handling, not just refusal behavior, giving agent builders a more realistic red-team harness.

Ryan Lingo, Rajeev Chhajer

significant🟡 IntermediateNLPLLM Reasoning
cs.CLcs.AIcs.LG

A simple API-only recipe for synthetic data generation that combines memory, deduplication, and prompt evolution to stop cross-batch mode collapse and keep large generation jobs diverse.

Nathan Lambert, Florian Brand

significant🟢 BeginnerNLPLLM Reasoning
cs.CYcs.AIcs.LG

Maps the open-model ecosystem across downloads, derivatives, inference share, and performance, useful for choosing which families are winning real adoption rather than just benchmarks.

Mohamed Darwish Mounis, Mohamed Mahmoud, Shaimaa Sedek et al.

significant🟡 IntermediateNLPRAGAlignment & Safety
cs.IRcs.CVcs.IR

Shows multimodal retrieval is often a query-alignment problem, not an encoder problem, and beats strong baselines by rewriting image-text queries into retrieval-optimized text.

Tom A. Lamb, Desi R. Ivanova, Philip H. S. Torr et al.

significant🟡 IntermediateNLPLLM Reasoning
cs.LGcs.LG

Shows token-level temperature scaling can materially improve semantic calibration and discrimination in QA, giving builders a low-friction way to make LLM confidence scores more trustworthy.

Nusrat Sultana, Abdullah Muhammad Moosa, Kazi Afzalur Rahman et al.

incremental🟡 IntermediateNLPRAG
cs.CLcs.AIcs.LG

A careful 40-setting RAG study shows dense retrieval, query reformulation, and reranking matter more than many heavyweight choices, offering practical tuning guidance that extends beyond medical QA.

Renxuan Tan, Rongpeng Li, Zhifeng Zhao et al.

breakthrough🔴 AdvancedNLPAlignment & SafetyLLM Reasoning
cs.AIcs.AI

Introduces Pareto-lenient consensus to avoid premature convergence in multi-preference LLM alignment—enables robust, nuanced value alignment without sacrificing performance on conflicting human preferences.

Hamed Jelodar, Samita Bai, Tochukwu Emmanuel Nwankwo et al.

breakthrough🔴 AdvancedNLPLLM Reasoning
cs.CRcs.AIcs.CR

LLM4CodeRE adapts LLMs specifically for malware decompilation, significantly improving reverse engineering accuracy on obfuscated code—critical for automated threat analysis in cybersecurity operations.

Xiaojie Gu, Ziying Huang, Weicong Hong et al.

breakthrough🔴 AdvancedNLPLLM ReasoningAlignment & Safety
cs.CLcs.AIcs.LG

Exposes how LLMs mimic edits without true memory updates, revealing dangerous surface compliance—vital for builders deploying knowledge-editing tools where factual reliability is non-negotiable.

Tianyi Zhao, Yinhan He, Wendy Zheng et al.

significant🔴 AdvancedNLPLLM Reasoning
cs.CLcs.CL

MCircKE mechanistically edits LLM knowledge to fix reasoning gaps, ensuring edited facts propagate in multi-step chains for reliable deployments.

Yang Li, Qiang Sheng, Zhengjia Wang et al.

breakthrough🟡 IntermediateNLPLLM Reasoning
cs.CLcs.CL

This is the first system that can tell if text was written by a human, edited by an LLM, written by an LLM, or polished by a human—critical for content moderation and legal compliance. You can no longer rely on simple 'AI or human' detectors; this gives you real nuance.

© 2026 A2A.pub — AI to Action. From papers to practice, daily.
Summaries are AI-assistedPrivacyTerms