Topic
Efficient Inference
Latency, serving, cache efficiency, and practical inference speed.
27 papers · latest 2026-04-22
Most active fields for this topic
Weijie Zhao, Mingquan Liu, Bolun Wang et al.
Nexusformer replaces linear attention projections with nonlinear expansions, enabling stable, inheritable Transformer scaling without retraining—revolutionizing model evolution for large-scale deployment.
SLAM Labs, :, Oleksiy Ostapenko et al.
Super Apriel enables dynamic, real-time switching between four attention mechanisms in a single checkpoint, drastically reducing deployment costs and latency for LLMs—practitioners can now serve multiple speed/accuracy presets without multiple models.
Jinyu Guo, Zhihan Zhang, Yutong Li et al.
DASH-KV slashes long-context inference costs via asymmetric KV hashing, preserving quality while cutting compute—critical for deploying LLMs in latency-sensitive production systems.
Suhaib Abdurahman, Etsuko Ishii, Katerina Margatina et al.
ETI improves multi-agent coordination by modeling psychological traits of partners, reducing goal drift and errors. Builders should integrate it to create reliable, human-like agent teams for complex collaborative tasks.
Chaitanya Dwivedi, Binxuan Huang, Himanshu Gupta et al.
Reduces MoE training costs by upcycling existing experts, enabling scalable, compute-efficient LLMs without new training—transformative for deploying large models on constrained infrastructure.
Zixuan Liu, Zhiyong Chen, Nan Xue et al.
WISV adapts speculative decoding verification to wireless conditions using semantic, not token-level, checks—dramatically improving edge-LLM latency and throughput in real-world mobile deployments.
Libo Sun, Peixiong He, Po-Wei Harn et al.
MoE-nD tailors KV cache compression per layer, boosting accuracy over uniform methods. Practitioners should care because it enables longer context inference with minimal memory overhead without retraining.
Xiao Wang, Zezhong Zhang, Isaac Lyngaas et al.
A linear-complexity global attention mechanism enables exascale generative data assimilation, dramatically improving uncertainty quantification in weather/climate models—critical for real-time extreme event prediction systems.
David Berghaus
EVIL replaces neural networks with evolved interpretable Python code for zero-shot time series inference, enabling deployable, transparent models without retraining—critical for real-time systems needing explainability and low resource use.
Hyeongmeen Baik, Hamed Poursiami, Maryam Parsa et al.
First spiking neural network for sub-mW power converter health monitoring that decouples physics enforcement from temporal processing, enabling real-time edge inference without GPUs—critical for industrial IoT systems needing ultra-low-power reliability.
Yifan Zhao, Yuchen Yang, Matei Budiu et al.
Nautilus automates GPU kernel optimization from high-level tensor algebra, eliminating manual tuning—enabling faster, portable ML system development without expert-level code.
Yukuan Zhang, Mengxin Zheng, Qian Lou
SecureRouter enables efficient encrypted inference by dynamically adapting model structure per query, slashing MPC overhead—making privacy-preserving AI feasible for real-time, high-throughput production systems.
Zixuan Weng, Jinghuai Zhang, Kunlin Cai et al.
FineSteer enables precise, adaptive steering of LLM behavior at inference time without retraining, offering a unified, utility-preserving method to fix hallucinations and safety issues—critical for deploying reliable AI in production.
Joongwon Kim, Wannan Yang, Kelvin Niu et al.
Scaling test-time compute for agentic coding introduces trajectory-based evaluation, enabling meaningful refinement of long-horizon code agents—key for autonomous dev tools.
Aditi De
This paper enables diffusion model inference without digital computation by leveraging thermodynamic equilibration, potentially slashing energy use 10,000x—revolutionizing edge AI deployment and sustainable inference infrastructure.
Mohammed Ezzaldin Babiker Abdullah, Rufaidah Abdallah Ibrahim Mohammed
Outperforms complex Transformers in solar forecasting using physics-guided CNN-BiLSTM, proving domain knowledge can beat architectural scale—critical for efficient, deployable grid stability systems.
Hongtao Xu, Jianchao Tan, Yuxuan Hu et al.
SparseBalance co-optimizes sequence length and sparsity heterogeneity in long-context training, dramatically improving efficiency and accuracy—essential for scalable LLM training on real-world data without costly over-provisioning.
S. Aaron McClendon, Jorge Gallego-Feliciano, Stavros Zervoudakis et al.
By reusing one small model as summarizer, agent, and isolated code reviewer, this inference-time scaffold roughly doubles AppWorld performance on a single 24GB GPU.
Vasilis Kontonis, Yuchen Zeng, Shivam Garg et al.
MEMENTO trains reasoning models to summarize their own working state into reusable memory blocks, cutting KV-cache costs about 2.5x and boosting throughput without giving up math, science, or coding accuracy.
Andrey Bocharnikov, Ivan Ermakov, Denis Kuznedelev et al.
This study shows popular KV-cache offloading schemes break on context-intensive workloads like structured extraction, then offers a simpler strategy that preserves far more accuracy for long-context production inference.
Roberto Vercellino, Jared Willard, Gustavo Campos et al.
Provides public H100 power traces for training, fine-tuning, and vLLM inference, then links them to whole-facility planning—useful for sizing clusters, power delivery, and microgrid strategies.
Sam Gunn
Introduces a data-deletion scheme that approximates how a model would behave if specific training data were removed, an important building block for unlearning, auditing, and data attribution.
David Picard, Nicolas Dufour, Lucas Degeorge et al.
PoM replaces attention with a linear-time polynomial mixer, maintaining universal approximation while slashing compute—game-changing for scaling vision and language models on edge devices.
Guhao Feng, Shengjie Luo, Kai Hua et al.
In-Place Test-Time Training enables LLMs to adapt weights during inference, overcoming static deployment limits—vital for real-time systems needing continuous learning from streaming data without retraining.
Yulin Zou, Yan Chen, Wenyan Chen et al.
CoStream jointly optimizes video codec and multimodal inference to cut computational costs by 40%+—enabling scalable, real-time video analytics without sacrificing accuracy on vision-language models.
Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Patrick Woods et al.
This paper cuts memory use for on-device LLMs by dynamically quantizing the KV cache—no more fixed precision waste. For anyone deploying LLMs on phones or edge devices, this could mean 2x longer context or 50% smaller models without accuracy loss.
Mateusz Papierz, Asel Sagingalieva, Alix Benoit et al.
HQ-LP-FNO cuts the size and cost of AI models that simulate laser processing by using quantum-inspired mixing, making real-time simulation feasible on standard hardware. This lets manufacturers rapidly test laser parameters without waiting hours for physics simulations.