Topic

Model Compression

Quantization, pruning, distillation, and smaller deployment footprints.

8 papers · latest 2026-04-23

Most active fields for this topic

Machine Learning · 5 NLP · 2 Multimodal · 1

AAC: Admissible-by-Architecture Differentiable Landmark Compression for ALT

An T. Le, Vien Ngo

significant🟡 IntermediateMachine Learning Model Compression

cs.AIcs.LGcs.RO

Presents AAC, a differentiable landmark compressor for ALT heuristics that guarantees admissibility by design, enabling reliable pathfinding without calibration or convergence requirements.

Details → arXiv →

From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization

Chenxi Zhou, Pengfei Cao, Jiang Li et al.

breakthrough🔴 AdvancedNLP LLM Reasoning Model Compression

cs.CLcs.AIcs.LG

Uncovers two distinct failure modes in 2-bit LLM quantization—enabling builders to diagnose and mitigate performance cliffs, crucial for efficient deployment of compressed models.

Details → arXiv →

Stability Implies Redundancy: Delta Attention Selective Halting for Efficient Long-Context Prefilling

Yujie Chen, Tailai Chen, Yifeng Gao et al.

breakthrough🔴 AdvancedMachine Learning Model Compression

cs.AIcs.AI

Introduces delta attention halting that detects semantic fixing points to skip redundant token processing, enabling hardware-compatible efficiency gains in long-context LLMs without sacrificing accuracy—critical for deploying scalable inference.

Details → arXiv →

MoE-nD: Per-Layer Mixture-of-Experts Routing for Multi-Axis KV Cache Compression

Libo Sun, Peixiong He, Po-Wei Harn et al.

significant🔴 AdvancedMachine Learning Model Compression Efficient Inference

cs.LGcs.CLcs.LG

MoE-nD tailors KV cache compression per layer, boosting accuracy over uniform methods. Practitioners should care because it enables longer context inference with minimal memory overhead without retraining.

Details → arXiv →

Cross-Modal Bayesian Low-Rank Adaptation for Uncertainty-Aware Multimodal Learning

Habibeh Naderi, Behrouz Haji Soleimani, Stan Matwin

breakthrough🔴 AdvancedMultimodal Multimodal Understanding Model Compression

cs.LGcs.AIcs.LG

CALIBER introduces Bayesian low-rank adaptation for uncertainty-aware multimodal learning, enabling robust, efficient fine-tuning in low-resource settings—essential for builders deploying reliable multimodal systems under data scarcity.

Details → arXiv →

Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information

Yao Chen, Jiawei Sheng, Wenyuan Zhang et al.

breakthrough🔴 AdvancedNLP LLM Reasoning Model Compression

cs.CLcs.CL

Proposes stepwise attention distillation to transfer dynamic reasoning focus from large to small models, significantly improving small-model reasoning without requiring larger architectures—key for efficient deployment in resource-constrained systems.

Details → arXiv →

Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

Jiayuan Ye, Vitaly Feldman, Kunal Talwar

significant🟡 IntermediateMachine Learning Model Compression

cs.CLcs.CL

Pruning and rebalancing pretraining data can improve factual memorization enough for a 110M model to match a 1.3B baseline on entity facts, highlighting data mix as a real scaling lever.

Details → arXiv →

Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs

Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Patrick Woods et al.

breakthrough🔴 AdvancedMachine Learning Efficient Inference Model Compression

cs.CVcs.CV

This paper cuts memory use for on-device LLMs by dynamically quantizing the KV cache—no more fixed precision waste. For anyone deploying LLMs on phones or edge devices, this could mean 2x longer context or 50% smaller models without accuracy loss.

Details → arXiv →