← Back to topics

Topic

Model Compression

Quantization, pruning, distillation, and smaller deployment footprints.

8 papers · latest 2026-04-23

Most active fields for this topic

An T. Le, Vien Ngo

significant🟡 IntermediateMachine LearningModel Compression
cs.AIcs.LGcs.RO

Presents AAC, a differentiable landmark compressor for ALT heuristics that guarantees admissibility by design, enabling reliable pathfinding without calibration or convergence requirements.

Chenxi Zhou, Pengfei Cao, Jiang Li et al.

breakthrough🔴 AdvancedNLPLLM ReasoningModel Compression
cs.CLcs.AIcs.LG

Uncovers two distinct failure modes in 2-bit LLM quantization—enabling builders to diagnose and mitigate performance cliffs, crucial for efficient deployment of compressed models.

Yujie Chen, Tailai Chen, Yifeng Gao et al.

breakthrough🔴 AdvancedMachine LearningModel Compression
cs.AIcs.AI

Introduces delta attention halting that detects semantic fixing points to skip redundant token processing, enabling hardware-compatible efficiency gains in long-context LLMs without sacrificing accuracy—critical for deploying scalable inference.

Libo Sun, Peixiong He, Po-Wei Harn et al.

cs.LGcs.CLcs.LG

MoE-nD tailors KV cache compression per layer, boosting accuracy over uniform methods. Practitioners should care because it enables longer context inference with minimal memory overhead without retraining.

Habibeh Naderi, Behrouz Haji Soleimani, Stan Matwin

cs.LGcs.AIcs.LG

CALIBER introduces Bayesian low-rank adaptation for uncertainty-aware multimodal learning, enabling robust, efficient fine-tuning in low-resource settings—essential for builders deploying reliable multimodal systems under data scarcity.

Yao Chen, Jiawei Sheng, Wenyuan Zhang et al.

breakthrough🔴 AdvancedNLPLLM ReasoningModel Compression
cs.CLcs.CL

Proposes stepwise attention distillation to transfer dynamic reasoning focus from large to small models, significantly improving small-model reasoning without requiring larger architectures—key for efficient deployment in resource-constrained systems.

Jiayuan Ye, Vitaly Feldman, Kunal Talwar

significant🟡 IntermediateMachine LearningModel Compression
cs.CLcs.CL

Pruning and rebalancing pretraining data can improve factual memorization enough for a 110M model to match a 1.3B baseline on entity facts, highlighting data mix as a real scaling lever.

Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Patrick Woods et al.

cs.CVcs.CV

This paper cuts memory use for on-device LLMs by dynamically quantizing the KV cache—no more fixed precision waste. For anyone deploying LLMs on phones or edge devices, this could mean 2x longer context or 50% smaller models without accuracy loss.

© 2026 A2A.pub — AI to Action. From papers to practice, daily.
Summaries are AI-assistedPrivacyTerms