← Back to topics

Topic

Alignment & Safety

Alignment, preference learning, robustness, and safe deployment.

17 papers · latest 2026-04-21

Most active fields for this topic

Dongxin Guo, Jikun Wu, Siu Ming Yiu

breakthrough🟡 IntermediateNLPLLM ReasoningAlignment & Safety
cs.LGcs.AIcs.LG

SafeAnchor reveals LLM safety is fragile and erodes cumulatively during domain adaptation. Practitioners must now actively preserve safety across updates—this is the first method to do so systematically in continual settings.

Tao Zhang, Kaixian Qu, Zhibin Li et al.

cs.AIcs.LGcs.RO

DESPITE reveals that even highly accurate LLM planners can systematically fail safety-critical tasks, exposing a critical gap between planning accuracy and real-world safety—essential for deploying robots in human environments.

MindDR Team, Li Auto Inc

breakthrough🔴 AdvancedReasoning & AgentsAlignment & Safety
cs.AIcs.AI

Demonstrates leading deep research performance with 30B models via a novel three-agent architecture and specialized training—proving high capability doesn't require trillion-parameter models, reshaping cost-efficiency in autonomous AI systems.

Jack Wei Lun Shi, Minghao Dang, Wawan Solihin et al.

breakthrough🔴 AdvancedNLPLLM ReasoningAlignment & Safety
cs.CLcs.AIcs.LG

First perturbation-based attribution analysis of LLMs in code compliance, revealing how fine-tuning strategies alter interpretability—essential for building trustworthy, auditable code-review AI systems.

Hossem Eddine Hafidi, Elisabetta De Giovanni, Teodoro Montanaro et al.

breakthrough🔴 AdvancedReasoning & AgentsAlignment & Safety
cs.LGcs.LG

First DRL system integrating real-time drowsiness detection with adaptive braking, directly enhancing road safety—practitioners should adopt this to build life-critical AI systems that respond to human state.

Zihao Liu, Hantao Zhou, Jiguo Li et al.

breakthrough🟡 IntermediateNLPAlignment & Safety
cs.CLcs.CL

MUSE delivers consistent, multi-domain Chinese user simulations via self-evolving profiles. Practitioners building chat systems for Chinese markets can now train and evaluate agents at scale with realistic personas.

You Qin, Linqing Wang, Hao Fei et al.

breakthrough🔴 AdvancedReasoning & AgentsAlignment & Safety
cs.LGcs.AIcs.LG

SOAR closes the SFT-RL gap in diffusion models by enabling self-correction during inference, improving alignment and robustness—critical for deploying safe, reliable generative systems under real-world distribution shifts.

Songping Peng, Zhiheng Zhang, Daojian Zeng et al.

breakthrough🔴 AdvancedNLPLLM ReasoningAlignment & Safety
cs.AIcs.AI

Coupled weight-activation constraints prevent safety drift during LLM fine-tuning, offering a theoretically grounded defense—essential for deploying reliable, safe LLMs in production without unintended harmful behavior emergence.

Vishal Pramanik, Maisha Maliha, Nathaniel D. Bastian et al.

breakthrough🔴 AdvancedNLPAlignment & Safety
cs.CLcs.AIcs.CL

HETA introduces the first Hessian-based attribution method for autoregressive LLMs, capturing non-linear causal dependencies in token generation—essential for building reliable, interpretable generative systems in production.

Nastaran Darabi, Amit Ranjan Trivedi

cs.ROcs.CLcs.CV

ProGAL-VLA adds verified grounding and prospective sub-goals to VLA robots, sharply improving instruction sensitivity, ambiguity handling, and robustness under perturbation.

Qiyao Ma, Dechen Gao, Rui Cai et al.

breakthrough🟡 IntermediateNLPAlignment & Safety
cs.CLcs.LGcs.CL

A benchmark for personalized reward modeling that tracks downstream BoN and PPO performance, showing today's reward models still struggle to capture user-specific preferences that matter for aligned products.

Mohamed Darwish Mounis, Mohamed Mahmoud, Shaimaa Sedek et al.

significant🟡 IntermediateNLPRAGAlignment & Safety
cs.IRcs.CVcs.IR

Shows multimodal retrieval is often a query-alignment problem, not an encoder problem, and beats strong baselines by rewriting image-text queries into retrieval-optimized text.

Renxuan Tan, Rongpeng Li, Zhifeng Zhao et al.

breakthrough🔴 AdvancedNLPAlignment & SafetyLLM Reasoning
cs.AIcs.AI

Introduces Pareto-lenient consensus to avoid premature convergence in multi-preference LLM alignment—enables robust, nuanced value alignment without sacrificing performance on conflicting human preferences.

Bowen Ye, Rang Li, Qibin Yang et al.

cs.AIcs.AI

Claw-Eval introduces transparent, safety-aware, multimodal evaluation for autonomous agents, addressing critical gaps in benchmarking—essential for building trustworthy, real-world AI agents.

Xiaojie Gu, Ziying Huang, Weicong Hong et al.

breakthrough🔴 AdvancedNLPLLM ReasoningAlignment & Safety
cs.CLcs.AIcs.LG

Exposes how LLMs mimic edits without true memory updates, revealing dangerous surface compliance—vital for builders deploying knowledge-editing tools where factual reliability is non-negotiable.

Grace Liu, Brian Christian, Tsvetomira Dumbalska et al.

breakthrough🟡 IntermediateReasoning & AgentsAlignment & Safety
cs.AIcs.AI

AI assistants that always answer quickly make users dependent and worse at thinking alone. This is the first solid evidence that good AI should sometimes say 'figure it out'—a wake-up call for designers building educational or productivity tools.

Alexis Burgon, Berkman Sahiner, Nicholas A Petrick et al.

significant🟡 IntermediateReasoning & AgentsAlignment & Safety
cs.AIcs.PFcs.AI

This work introduces a standardized framework to evaluate AI medical devices that learn and adapt over time, solving a major regulatory bottleneck. It provides clear metrics to distinguish between a model actually improving versus just memorizing new data, which is critical for getting adaptive AI approved for clinical use.

© 2026 A2A.pub — AI to Action. From papers to practice, daily.
Summaries are AI-assistedPrivacyTerms