Topic
Alignment & Safety
Alignment, preference learning, robustness, and safe deployment.
8 papers · latest 2026-04-13
Most active fields for this topic
Nastaran Darabi, Amit Ranjan Trivedi
ProGAL-VLA adds verified grounding and prospective sub-goals to VLA robots, sharply improving instruction sensitivity, ambiguity handling, and robustness under perturbation.
Qiyao Ma, Dechen Gao, Rui Cai et al.
A benchmark for personalized reward modeling that tracks downstream BoN and PPO performance, showing today's reward models still struggle to capture user-specific preferences that matter for aligned products.
Mohamed Darwish Mounis, Mohamed Mahmoud, Shaimaa Sedek et al.
Shows multimodal retrieval is often a query-alignment problem, not an encoder problem, and beats strong baselines by rewriting image-text queries into retrieval-optimized text.
Renxuan Tan, Rongpeng Li, Zhifeng Zhao et al.
Introduces Pareto-lenient consensus to avoid premature convergence in multi-preference LLM alignment—enables robust, nuanced value alignment without sacrificing performance on conflicting human preferences.
Bowen Ye, Rang Li, Qibin Yang et al.
Claw-Eval introduces transparent, safety-aware, multimodal evaluation for autonomous agents, addressing critical gaps in benchmarking—essential for building trustworthy, real-world AI agents.
Xiaojie Gu, Ziying Huang, Weicong Hong et al.
Exposes how LLMs mimic edits without true memory updates, revealing dangerous surface compliance—vital for builders deploying knowledge-editing tools where factual reliability is non-negotiable.
Grace Liu, Brian Christian, Tsvetomira Dumbalska et al.
AI assistants that always answer quickly make users dependent and worse at thinking alone. This is the first solid evidence that good AI should sometimes say 'figure it out'—a wake-up call for designers building educational or productivity tools.
Learning, Potential, and Retention: An Approach for Evaluating Adaptive AI-Enabled Medical Devices
Alexis Burgon, Berkman Sahiner, Nicholas A Petrick et al.
This work introduces a standardized framework to evaluate AI medical devices that learn and adapt over time, solving a major regulatory bottleneck. It provides clear metrics to distinguish between a model actually improving versus just memorizing new data, which is critical for getting adaptive AI approved for clinical use.