Topic
Efficient Inference
Latency, serving, cache efficiency, and practical inference speed.
10 papers · latest 2026-04-14
Most active fields for this topic
S. Aaron McClendon, Jorge Gallego-Feliciano, Stavros Zervoudakis et al.
By reusing one small model as summarizer, agent, and isolated code reviewer, this inference-time scaffold roughly doubles AppWorld performance on a single 24GB GPU.
Vasilis Kontonis, Yuchen Zeng, Shivam Garg et al.
MEMENTO trains reasoning models to summarize their own working state into reusable memory blocks, cutting KV-cache costs about 2.5x and boosting throughput without giving up math, science, or coding accuracy.
Andrey Bocharnikov, Ivan Ermakov, Denis Kuznedelev et al.
This study shows popular KV-cache offloading schemes break on context-intensive workloads like structured extraction, then offers a simpler strategy that preserves far more accuracy for long-context production inference.
Roberto Vercellino, Jared Willard, Gustavo Campos et al.
Provides public H100 power traces for training, fine-tuning, and vLLM inference, then links them to whole-facility planning—useful for sizing clusters, power delivery, and microgrid strategies.
Sam Gunn
Introduces a data-deletion scheme that approximates how a model would behave if specific training data were removed, an important building block for unlearning, auditing, and data attribution.
David Picard, Nicolas Dufour, Lucas Degeorge et al.
PoM replaces attention with a linear-time polynomial mixer, maintaining universal approximation while slashing compute—game-changing for scaling vision and language models on edge devices.
Guhao Feng, Shengjie Luo, Kai Hua et al.
In-Place Test-Time Training enables LLMs to adapt weights during inference, overcoming static deployment limits—vital for real-time systems needing continuous learning from streaming data without retraining.
Yulin Zou, Yan Chen, Wenyan Chen et al.
CoStream jointly optimizes video codec and multimodal inference to cut computational costs by 40%+—enabling scalable, real-time video analytics without sacrificing accuracy on vision-language models.
Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Patrick Woods et al.
This paper cuts memory use for on-device LLMs by dynamically quantizing the KV cache—no more fixed precision waste. For anyone deploying LLMs on phones or edge devices, this could mean 2x longer context or 50% smaller models without accuracy loss.
Mateusz Papierz, Asel Sagingalieva, Alix Benoit et al.
HQ-LP-FNO cuts the size and cost of AI models that simulate laser processing by using quantum-inspired mixing, making real-time simulation feasible on standard hardware. This lets manufacturers rapidly test laser parameters without waiting hours for physics simulations.