Topic

Alignment & Safety

Alignment, preference learning, robustness, and safe deployment.

17 papers · latest 2026-04-21

Most active fields for this topic

NLP · 9 Reasoning & Agents · 8

SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models

Dongxin Guo, Jikun Wu, Siu Ming Yiu

breakthrough🟡 IntermediateNLP LLM Reasoning Alignment & Safety

cs.LGcs.AIcs.LG

SafeAnchor reveals LLM safety is fragile and erodes cumulatively during domain adaptation. Practitioners must now actively preserve safety across updates—this is the first method to do so systematically in continual settings.

Details → arXiv →

Using large language models for embodied planning introduces systematic safety risks

Tao Zhang, Kaixian Qu, Zhibin Li et al.

breakthrough🔴 AdvancedReasoning & Agents Alignment & Safety Embodied Agents

cs.AIcs.LGcs.RO

DESPITE reveals that even highly accurate LLM planners can systematically fail safety-critical tasks, exposing a critical gap between planning accuracy and real-world safety—essential for deploying robots in human environments.

Details → arXiv →

Mind DeepResearch Technical Report

MindDR Team, Li Auto Inc

breakthrough🔴 AdvancedReasoning & Agents Alignment & Safety

cs.AIcs.AI

Demonstrates leading deep research performance with 30B models via a novel three-agent architecture and specialized training—proving high capability doesn't require trillion-parameter models, reshaping cost-efficiency in autonomous AI systems.

Details → arXiv →

LLM attribution analysis across different fine-tuning strategies and model scales for automated code compliance

Jack Wei Lun Shi, Minghao Dang, Wawan Solihin et al.

breakthrough🔴 AdvancedNLP LLM Reasoning Alignment & Safety

cs.CLcs.AIcs.LG

First perturbation-based attribution analysis of LLMs in code compliance, revealing how fine-tuning strategies alter interpretability—essential for building trustworthy, auditable code-review AI systems.

Details → arXiv →

Drowsiness-Aware Adaptive Autonomous Braking System based on Deep Reinforcement Learning for Enhanced Road Safety

Hossem Eddine Hafidi, Elisabetta De Giovanni, Teodoro Montanaro et al.

breakthrough🔴 AdvancedReasoning & Agents Alignment & Safety

cs.LGcs.LG

First DRL system integrating real-time drowsiness detection with adaptive braking, directly enhancing road safety—practitioners should adopt this to build life-critical AI systems that respond to human state.

Details → arXiv →

MUSE: Multi-Domain Chinese User Simulation via Self-Evolving Profiles and Rubric-Guided Alignment

Zihao Liu, Hantao Zhou, Jiguo Li et al.

breakthrough🟡 IntermediateNLP Alignment & Safety

cs.CLcs.CL

MUSE delivers consistent, multi-domain Chinese user simulations via self-evolving profiles. Practitioners building chat systems for Chinese markets can now train and evaluate agents at scale with realistic personas.

Details → arXiv →

SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models

You Qin, Linqing Wang, Hao Fei et al.

breakthrough🔴 AdvancedReasoning & Agents Alignment & Safety

cs.LGcs.AIcs.LG

SOAR closes the SFT-RL gap in diffusion models by enabling self-correction during inference, improving alignment and robustness—critical for deploying safe, reliable generative systems under real-world distribution shifts.

Details → arXiv →

Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints

Songping Peng, Zhiheng Zhang, Daojian Zeng et al.

breakthrough🔴 AdvancedNLP LLM Reasoning Alignment & Safety

cs.AIcs.AI

Coupled weight-activation constraints prevent safety drift during LLM fine-tuning, offering a theoretically grounded defense—essential for deploying reliable, safe LLMs in production without unintended harmful behavior emergence.

Details → arXiv →

Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs

Vishal Pramanik, Maisha Maliha, Nathaniel D. Bastian et al.

breakthrough🔴 AdvancedNLP Alignment & Safety

cs.CLcs.AIcs.CL

HETA introduces the first Hessian-based attribution method for autoregressive LLMs, capturing non-linear causal dependencies in token generation—essential for building reliable, interpretable generative systems in production.

Details → arXiv →

ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models

Nastaran Darabi, Amit Ranjan Trivedi

significant🔴 AdvancedReasoning & Agents Embodied Agents Alignment & Safety

cs.ROcs.CLcs.CV

ProGAL-VLA adds verified grounding and prospective sub-goals to VLA robots, sharply improving instruction sensitivity, ambiguity handling, and robustness under perturbation.

Details → arXiv →

Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

Qiyao Ma, Dechen Gao, Rui Cai et al.

breakthrough🟡 IntermediateNLP Alignment & Safety

cs.CLcs.LGcs.CL

A benchmark for personalized reward modeling that tracks downstream BoN and PPO performance, showing today's reward models still struggle to capture user-specific preferences that matter for aligned products.

Details → arXiv →

BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment

Mohamed Darwish Mounis, Mohamed Mahmoud, Shaimaa Sedek et al.

significant🟡 IntermediateNLP RAG Alignment & Safety

cs.IRcs.CVcs.IR

Shows multimodal retrieval is often a query-alignment problem, not an encoder problem, and beats strong baselines by rewriting image-text queries into retrieval-optimized text.

Details → arXiv →

Beyond Compromise: Pareto-Lenient Consensus for Efficient Multi-Preference LLM Alignment

Renxuan Tan, Rongpeng Li, Zhifeng Zhao et al.

breakthrough🔴 AdvancedNLP Alignment & Safety LLM Reasoning

cs.AIcs.AI

Introduces Pareto-lenient consensus to avoid premature convergence in multi-preference LLM alignment—enables robust, nuanced value alignment without sacrificing performance on conflicting human preferences.

Details → arXiv →

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Bowen Ye, Rang Li, Qibin Yang et al.

breakthrough🟡 IntermediateReasoning & Agents AI Agents Alignment & Safety

cs.AIcs.AI

Claw-Eval introduces transparent, safety-aware, multimodal evaluation for autonomous agents, addressing critical gaps in benchmarking—essential for building trustworthy, real-world AI agents.

Details → arXiv →

The Model Agreed, But Didn't Learn: Diagnosing Surface Compliance in Large Language Models

Xiaojie Gu, Ziying Huang, Weicong Hong et al.

breakthrough🔴 AdvancedNLP LLM Reasoning Alignment & Safety

cs.CLcs.AIcs.LG

Exposes how LLMs mimic edits without true memory updates, revealing dangerous surface compliance—vital for builders deploying knowledge-editing tools where factual reliability is non-negotiable.

Details → arXiv →

AI Assistance Reduces Persistence and Hurts Independent Performance

Grace Liu, Brian Christian, Tsvetomira Dumbalska et al.

breakthrough🟡 IntermediateReasoning & Agents Alignment & Safety

cs.AIcs.AI

AI assistants that always answer quickly make users dependent and worse at thinking alone. This is the first solid evidence that good AI should sometimes say 'figure it out'—a wake-up call for designers building educational or productivity tools.

Details → arXiv →

Learning, Potential, and Retention: An Approach for Evaluating Adaptive AI-Enabled Medical Devices

Alexis Burgon, Berkman Sahiner, Nicholas A Petrick et al.

significant🟡 IntermediateReasoning & Agents Alignment & Safety

cs.AIcs.PFcs.AI

This work introduces a standardized framework to evaluate AI medical devices that learn and adapt over time, solving a major regulatory bottleneck. It provides clear metrics to distinguish between a model actually improving versus just memorizing new data, which is critical for getting adaptive AI approved for clinical use.

Details → arXiv →