← Back to topics

Topic

Efficient Inference

Latency, serving, cache efficiency, and practical inference speed.

10 papers · latest 2026-04-14

Most active fields for this topic

S. Aaron McClendon, Jorge Gallego-Feliciano, Stavros Zervoudakis et al.

cs.AIcs.AI

By reusing one small model as summarizer, agent, and isolated code reviewer, this inference-time scaffold roughly doubles AppWorld performance on a single 24GB GPU.

Vasilis Kontonis, Yuchen Zeng, Shivam Garg et al.

breakthrough🔴 AdvancedMachine LearningEfficient Inference
cs.AIcs.LGcs.AI

MEMENTO trains reasoning models to summarize their own working state into reusable memory blocks, cutting KV-cache costs about 2.5x and boosting throughput without giving up math, science, or coding accuracy.

Andrey Bocharnikov, Ivan Ermakov, Denis Kuznedelev et al.

significant🟡 IntermediateMachine LearningEfficient Inference
cs.LGcs.AIcs.CL

This study shows popular KV-cache offloading schemes break on context-intensive workloads like structured extraction, then offers a simpler strategy that preserves far more accuracy for long-context production inference.

Roberto Vercellino, Jared Willard, Gustavo Campos et al.

significant🟡 IntermediateMachine LearningEfficient Inference
cs.DCcs.LG

Provides public H100 power traces for training, fine-tuning, and vLLM inference, then links them to whole-facility planning—useful for sizing clusters, power delivery, and microgrid strategies.

Sam Gunn

significant🔴 AdvancedMachine LearningEfficient Inference
cs.LGcs.LG

Introduces a data-deletion scheme that approximates how a model would behave if specific training data were removed, an important building block for unlearning, auditing, and data attribution.

David Picard, Nicolas Dufour, Lucas Degeorge et al.

breakthrough🔴 AdvancedMachine LearningEfficient Inference
cs.CVcs.AIcs.CV

PoM replaces attention with a linear-time polynomial mixer, maintaining universal approximation while slashing compute—game-changing for scaling vision and language models on edge devices.

Guhao Feng, Shengjie Luo, Kai Hua et al.

breakthrough🔴 AdvancedMachine LearningEfficient Inference
cs.LGcs.AIcs.CL

In-Place Test-Time Training enables LLMs to adapt weights during inference, overcoming static deployment limits—vital for real-time systems needing continuous learning from streaming data without retraining.

Yulin Zou, Yan Chen, Wenyan Chen et al.

breakthrough🟡 IntermediateMachine LearningEfficient Inference
cs.DCcs.CVcs.LG

CoStream jointly optimizes video codec and multimodal inference to cut computational costs by 40%+—enabling scalable, real-time video analytics without sacrificing accuracy on vision-language models.

Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Patrick Woods et al.

cs.CVcs.CV

This paper cuts memory use for on-device LLMs by dynamically quantizing the KV cache—no more fixed precision waste. For anyone deploying LLMs on phones or edge devices, this could mean 2x longer context or 50% smaller models without accuracy loss.

Mateusz Papierz, Asel Sagingalieva, Alix Benoit et al.

significant🔴 AdvancedMachine LearningEfficient Inference
cs.CEcs.LG

HQ-LP-FNO cuts the size and cost of AI models that simulate laser processing by using quantum-inspired mixing, making real-time simulation feasible on standard hardware. This lets manufacturers rapidly test laser parameters without waiting hours for physics simulations.

© 2026 A2A.pub — AI to Action. From papers to practice, daily.
Summaries are AI-assistedPrivacyTerms