SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models

Dongxin Guo, Jikun Wu, Siu Ming Yiu

Recommendation Score

breakthrough🟡 IntermediateNLP LLM Reasoning Alignment & SafetyBenchmarkUseful for both

Research context

Primary field

NLP

Language understanding, generation, extraction, and evaluation.

Topics

LLM Reasoning, Alignment & Safety

Paper type

Benchmark

Best for

Useful for both

arXiv categories

cs.LGcs.AIcs.LG

Why It Matters

SafeAnchor reveals LLM safety is fragile and erodes cumulatively during domain adaptation. Practitioners must now actively preserve safety across updates—this is the first method to do so systematically in continual settings.

Abstract

Safety alignment in large language models is remarkably shallow: it is concentrated in the first few output tokens and reversible by fine-tuning on as few as 100 adversarial examples. This fragility becomes critical in real-world deployment, where models undergo sequential adaptation across domains such as medicine, law, and code, causing safety guardrails to erode cumulatively. Yet all existing safety-preserving methods target only single-task fine-tuning, leaving the multi-domain sequential setting entirely unaddressed. We introduce SafeAnchor, a framework that anchors safety in place throughout continual adaptation. SafeAnchor first identifies low-rank safety subspaces in LoRA parameter space via Fisher Information eigendecomposition, then constrains domain-specific gradient updates to the orthogonal complement of these subspaces, and finally monitors for residual safety drift with threshold-triggered corrective replay. Evaluated on Llama-2-7B-Chat and Mistral-7B-Instruct across a three-domain pipeline and eight benchmarks, SafeAnchor retains 93.2% of original safety alignment, outperforming all baselines by 18-42 points, while matching unconstrained fine-tuning to within 1.5 points on domain tasks.

More in NLP → More on Alignment & Safety →

View on arXiv → Download PDF →

Published April 20, 2026