Field
NLP
Language understanding, generation, extraction, and evaluation.
23 papers · latest 2026-04-14
Common topics in this field
Liujie Zhang, Benzhe Ning, Rui Yang et al.
Relax is an open asynchronous RL engine for omni-modal post-training that doubles throughput on Qwen3-Omni-scale runs without sacrificing convergence.
Bo Li, Mingda Wang, Gexiang Fang et al.
GRIP turns retrieval into a native decoding action so the model can decide when to search, rewrite queries, and stop inside one reasoning trace instead of bolting on a controller.
Artem Gadzhiev, Andrew Kislov
Synthius-Mem replaces retrieval-heavy agent memory with structured persona memory, improving both long-term recall and adversarial robustness against invented facts.
Solomon Messing
This work shows how prompt wording, judge choice, and temperature can flip LLM eval results, then gives a budget-aware recipe that materially reduces benchmark noise and gaming surface.
Kyle Whitecross, Negin Rahimi
RecaLLM tackles the lost-in-thought problem by interleaving reasoning with explicit in-context retrieval, giving long-context models a practical way to stay grounded at up to 128K tokens.
Hadas Orgad, Boyi Wei, Kaden Zheng et al.
This mechanistic safety paper argues harmful generation is concentrated in a compact, reusable weight subspace, offering a concrete explanation for why narrow fine-tuning can trigger broad misalignment.
Chenhao Ye, Huaizheng Zhang, Mingcong Han et al.
TensorHub attacks a painful RL-systems bottleneck by serving model weights from replicas already resident on GPUs, dramatically reducing rollout stalls in elastic and cross-datacenter training.
Peng Ding
LLM-Rosetta introduces a neutral intermediate representation for major LLM APIs, giving builders a credible path away from brittle one-off provider adapters and vendor lock-in.
Runpeng Geng, Chenlong Yin, Yanting Wang et al.
A unified prompt-injection evaluation platform with adaptive attacks that exposes how brittle many current defenses remain across tasks, making it useful core infrastructure for teams shipping tool-using or retrieval-augmented agents.
Addison J. Wu, Ryan Liu, Shuyue Stella Li et al.
This paper turns chatbot advertising into a concrete alignment problem, probing how model behavior shifts when user benefit and platform revenue diverge.
Jianhui Liu, Haoze Sun, Wenbo Li et al.
An open-source data engine and 3M-sample dataset for spatial intelligence that lifts performance across multiple benchmarks, giving multimodal and robotics builders a reusable foundation instead of task-by-task data silos.
Qiyao Ma, Dechen Gao, Rui Cai et al.
A benchmark for personalized reward modeling that tracks downstream BoN and PPO performance, showing today's reward models still struggle to capture user-specific preferences that matter for aligned products.
Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang et al.
The first benchmark for mid-trajectory agent safety shows tool-calling guardrails often fail for structural reasons like JSON handling, not just refusal behavior, giving agent builders a more realistic red-team harness.
Ryan Lingo, Rajeev Chhajer
A simple API-only recipe for synthetic data generation that combines memory, deduplication, and prompt evolution to stop cross-batch mode collapse and keep large generation jobs diverse.
Nathan Lambert, Florian Brand
Maps the open-model ecosystem across downloads, derivatives, inference share, and performance, useful for choosing which families are winning real adoption rather than just benchmarks.
Mohamed Darwish Mounis, Mohamed Mahmoud, Shaimaa Sedek et al.
Shows multimodal retrieval is often a query-alignment problem, not an encoder problem, and beats strong baselines by rewriting image-text queries into retrieval-optimized text.
Tom A. Lamb, Desi R. Ivanova, Philip H. S. Torr et al.
Shows token-level temperature scaling can materially improve semantic calibration and discrimination in QA, giving builders a low-friction way to make LLM confidence scores more trustworthy.
A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering
Nusrat Sultana, Abdullah Muhammad Moosa, Kazi Afzalur Rahman et al.
A careful 40-setting RAG study shows dense retrieval, query reformulation, and reranking matter more than many heavyweight choices, offering practical tuning guidance that extends beyond medical QA.
Renxuan Tan, Rongpeng Li, Zhifeng Zhao et al.
Introduces Pareto-lenient consensus to avoid premature convergence in multi-preference LLM alignment—enables robust, nuanced value alignment without sacrificing performance on conflicting human preferences.
Hamed Jelodar, Samita Bai, Tochukwu Emmanuel Nwankwo et al.
LLM4CodeRE adapts LLMs specifically for malware decompilation, significantly improving reverse engineering accuracy on obfuscated code—critical for automated threat analysis in cybersecurity operations.
Xiaojie Gu, Ziying Huang, Weicong Hong et al.
Exposes how LLMs mimic edits without true memory updates, revealing dangerous surface compliance—vital for builders deploying knowledge-editing tools where factual reliability is non-negotiable.
Tianyi Zhao, Yinhan He, Wendy Zheng et al.
MCircKE mechanistically edits LLM knowledge to fix reasoning gaps, ensuring edited facts propagate in multi-step chains for reliable deployments.
Yang Li, Qiang Sheng, Zhengjia Wang et al.
This is the first system that can tell if text was written by a human, edited by an LLM, written by an LLM, or polished by a human—critical for content moderation and legal compliance. You can no longer rely on simple 'AI or human' detectors; this gives you real nuance.