Topic
Model Compression
Quantization, pruning, distillation, and smaller deployment footprints.
8 papers · latest 2026-04-23
Most active fields for this topic
An T. Le, Vien Ngo
Presents AAC, a differentiable landmark compressor for ALT heuristics that guarantees admissibility by design, enabling reliable pathfinding without calibration or convergence requirements.
Chenxi Zhou, Pengfei Cao, Jiang Li et al.
Uncovers two distinct failure modes in 2-bit LLM quantization—enabling builders to diagnose and mitigate performance cliffs, crucial for efficient deployment of compressed models.
Yujie Chen, Tailai Chen, Yifeng Gao et al.
Introduces delta attention halting that detects semantic fixing points to skip redundant token processing, enabling hardware-compatible efficiency gains in long-context LLMs without sacrificing accuracy—critical for deploying scalable inference.
Libo Sun, Peixiong He, Po-Wei Harn et al.
MoE-nD tailors KV cache compression per layer, boosting accuracy over uniform methods. Practitioners should care because it enables longer context inference with minimal memory overhead without retraining.
Habibeh Naderi, Behrouz Haji Soleimani, Stan Matwin
CALIBER introduces Bayesian low-rank adaptation for uncertainty-aware multimodal learning, enabling robust, efficient fine-tuning in low-resource settings—essential for builders deploying reliable multimodal systems under data scarcity.
Yao Chen, Jiawei Sheng, Wenyuan Zhang et al.
Proposes stepwise attention distillation to transfer dynamic reasoning focus from large to small models, significantly improving small-model reasoning without requiring larger architectures—key for efficient deployment in resource-constrained systems.
Jiayuan Ye, Vitaly Feldman, Kunal Talwar
Pruning and rebalancing pretraining data can improve factual memorization enough for a 110M model to match a 1.3B baseline on entity facts, highlighting data mix as a real scaling lever.
Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Patrick Woods et al.
This paper cuts memory use for on-device LLMs by dynamically quantizing the KV cache—no more fixed precision waste. For anyone deploying LLMs on phones or edge devices, this could mean 2x longer context or 50% smaller models without accuracy loss.