← Back to topics
Topic
Model Compression
Quantization, pruning, distillation, and smaller deployment footprints.
2 papers · latest 2026-04-10
Most active fields for this topic
Jiayuan Ye, Vitaly Feldman, Kunal Talwar
cs.CLcs.CL
Pruning and rebalancing pretraining data can improve factual memorization enough for a 110M model to match a 1.3B baseline on entity facts, highlighting data mix as a real scaling lever.
Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Patrick Woods et al.
cs.CVcs.CV
This paper cuts memory use for on-device LLMs by dynamically quantizing the KV cache—no more fixed precision waste. For anyone deploying LLMs on phones or edge devices, this could mean 2x longer context or 50% smaller models without accuracy loss.