Topic

Efficient Inference

Latency, serving, cache efficiency, and practical inference speed.

27 papers · latest 2026-04-22

Most active fields for this topic

Machine Learning · 24 Reasoning & Agents · 2 NLP · 1

Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling

Weijie Zhao, Mingquan Liu, Bolun Wang et al.

breakthrough🔴 AdvancedMachine Learning Efficient Inference

cs.LGcs.AIcs.LG

Nexusformer replaces linear attention projections with nonlinear expansions, enabling stable, inheritable Transformer scaling without retraining—revolutionizing model evolution for large-scale deployment.

Details → arXiv →

Super Apriel: One Checkpoint, Many Speeds

SLAM Labs, :, Oleksiy Ostapenko et al.

breakthrough🔴 AdvancedMachine Learning Efficient Inference

cs.LGcs.LG

Super Apriel enables dynamic, real-time switching between four attention mechanisms in a single checkpoint, drastically reducing deployment costs and latency for LLMs—practitioners can now serve multiple speed/accuracy presets without multiple models.

Details → arXiv →

DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing

Jinyu Guo, Zhihan Zhang, Yutong Li et al.

breakthrough🔴 AdvancedMachine Learning Efficient Inference

cs.CLcs.CL

DASH-KV slashes long-context inference costs via asymmetric KV hashing, preserving quality while cutting compute—critical for deploying LLMs in latency-sensitive production systems.

Details → arXiv →

Explicit Trait Inference for Multi-Agent Coordination

Suhaib Abdurahman, Etsuko Ishii, Katerina Margatina et al.

breakthrough🔴 AdvancedReasoning & Agents AI Agents Efficient Inference

cs.AIcs.MAcs.AI

ETI improves multi-agent coordination by modeling psychological traits of partners, reducing goal drift and errors. Builders should integrate it to create reliable, human-like agent teams for complex collaborative tasks.

Details → arXiv →

Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

Chaitanya Dwivedi, Binxuan Huang, Himanshu Gupta et al.

breakthrough🔴 AdvancedMachine Learning Efficient Inference

cs.LGcs.AIcs.LG

Reduces MoE training costs by upcycling existing experts, enabling scalable, compute-efficient LLMs without new training—transformative for deploying large models on constrained infrastructure.

Details → arXiv →

WISV: Wireless-Informed Semantic Verification for Distributed Speculative Decoding in Device-Edge LLM Inference

Zixuan Liu, Zhiyong Chen, Nan Xue et al.

breakthrough🔴 AdvancedMachine Learning Efficient Inference

cs.ITcs.AIcs.IT

WISV adapts speculative decoding verification to wireless conditions using semantic, not token-level, checks—dramatically improving edge-LLM latency and throughput in real-world mobile deployments.

Details → arXiv →

MoE-nD: Per-Layer Mixture-of-Experts Routing for Multi-Axis KV Cache Compression

Libo Sun, Peixiong He, Po-Wei Harn et al.

significant🔴 AdvancedMachine Learning Model Compression Efficient Inference

cs.LGcs.CLcs.LG

MoE-nD tailors KV cache compression per layer, boosting accuracy over uniform methods. Practitioners should care because it enables longer context inference with minimal memory overhead without retraining.

Details → arXiv →

Global Attention with Linear Complexity for Exascale Generative Data Assimilation in Earth System Prediction

Xiao Wang, Zezhong Zhang, Isaac Lyngaas et al.

breakthrough🔴 AdvancedMachine Learning Efficient Inference

cs.LGcs.AIcs.LG

A linear-complexity global attention mechanism enables exascale generative data assimilation, dramatically improving uncertainty quantification in weather/climate models—critical for real-time extreme event prediction systems.

Details → arXiv →

EVIL: Evolving Interpretable Algorithms for Zero-Shot Inference on Event Sequences and Time Series with LLMs

David Berghaus

breakthrough🔴 AdvancedMachine Learning Efficient Inference

cs.LGcs.AIcs.LG

EVIL replaces neural networks with evolved interpretable Python code for zero-shot time series inference, enabling deployable, transparent models without retraining—critical for real-time systems needing explainability and low resource use.

Details → arXiv →

Neuromorphic Parameter Estimation for Power Converter Health Monitoring Using Spiking Neural Networks

Hyeongmeen Baik, Hamed Poursiami, Maryam Parsa et al.

breakthrough🔴 AdvancedMachine Learning Efficient Inference

cs.NEcs.LGcs.NE

First spiking neural network for sub-mW power converter health monitoring that decouples physics enforcement from temporal processing, enabling real-time edge inference without GPUs—critical for industrial IoT systems needing ultra-low-power reliability.

Details → arXiv →

Nautilus: An Auto-Scheduling Tensor Compiler for Efficient Tiled GPU Kernels

Yifan Zhao, Yuchen Yang, Matei Budiu et al.

breakthrough🔴 AdvancedMachine Learning Efficient Inference

cs.PLcs.LGcs.PL

Nautilus automates GPU kernel optimization from high-level tensor algebra, eliminating manual tuning—enabling faster, portable ML system development without expert-level code.

Details → arXiv →

SecureRouter: Encrypted Routing for Efficient Secure Inference

Yukuan Zhang, Mengxin Zheng, Qian Lou

breakthrough🔴 AdvancedMachine Learning Efficient Inference

cs.CRcs.AIcs.CR

SecureRouter enables efficient encrypted inference by dynamically adapting model structure per query, slashing MPC overhead—making privacy-preserving AI feasible for real-time, high-throughput production systems.

Details → arXiv →

FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models

Zixuan Weng, Jinghuai Zhang, Kunlin Cai et al.

breakthrough🔴 AdvancedNLP LLM Reasoning Efficient Inference

cs.LGcs.AIcs.CL

FineSteer enables precise, adaptive steering of LLM behavior at inference time without retraining, offering a unified, utility-preserving method to fix hallucinations and safety issues—critical for deploying reliable AI in production.

Details → arXiv →

Scaling Test-Time Compute for Agentic Coding

Joongwon Kim, Wannan Yang, Kelvin Niu et al.

breakthrough🔴 AdvancedReasoning & Agents AI Agents Efficient Inference

cs.SEcs.AIcs.CL

Scaling test-time compute for agentic coding introduces trajectory-based evaluation, enabling meaningful refinement of long-horizon code agents—key for autonomous dev tools.

Details → arXiv →

Thermodynamic Diffusion Inference with Minimal Digital Conditioning

Aditi De

breakthrough🔴 AdvancedMachine Learning Efficient Inference

cs.LGcs.AIcs.LG

This paper enables diffusion model inference without digital computation by leveraging thermodynamic equilibration, potentially slashing energy use 10,000x—revolutionizing edge AI deployment and sustainable inference infrastructure.

Details → arXiv →

Outperforming Self-Attention Mechanisms in Solar Irradiance Forecasting via Physics-Guided Neural Networks

Mohammed Ezzaldin Babiker Abdullah, Rufaidah Abdallah Ibrahim Mohammed

breakthrough🔴 AdvancedMachine Learning Efficient Inference

cs.LGcs.AIcs.LG

Outperforms complex Transformers in solar forecasting using physics-guided CNN-BiLSTM, proving domain knowledge can beat architectural scale—critical for efficient, deployable grid stability systems.

Details → arXiv →

SparseBalance: Load-Balanced Long Context Training with Dynamic Sparse Attention

Hongtao Xu, Jianchao Tan, Yuxuan Hu et al.

breakthrough🔴 AdvancedMachine Learning Efficient Inference

cs.LGcs.AIcs.LG

SparseBalance co-optimizes sequence length and sparsity heterogeneity in long-context training, dramatically improving efficiency and accuracy—essential for scalable LLM training on real-world data without costly over-provisioning.

Details → arXiv →

Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents

S. Aaron McClendon, Jorge Gallego-Feliciano, Stavros Zervoudakis et al.

significant🟡 IntermediateMachine Learning Efficient Inference AI Agents

cs.AIcs.AI

By reusing one small model as summarizer, agent, and isolated code reviewer, this inference-time scaffold roughly doubles AppWorld performance on a single 24GB GPU.

Details → arXiv →

MEMENTO: Teaching LLMs to Manage Their Own Context

Vasilis Kontonis, Yuchen Zeng, Shivam Garg et al.

breakthrough🔴 AdvancedMachine Learning Efficient Inference

cs.AIcs.LGcs.AI

MEMENTO trains reasoning models to summarize their own working state into reusable memory blocks, cutting KV-cache costs about 2.5x and boosting throughput without giving up math, science, or coding accuracy.

Details → arXiv →

KV Cache Offloading for Context-Intensive Tasks

Andrey Bocharnikov, Ivan Ermakov, Denis Kuznedelev et al.

significant🟡 IntermediateMachine Learning Efficient Inference

cs.LGcs.AIcs.CL

This study shows popular KV-cache offloading schemes break on context-intensive workloads like structured extraction, then offers a simpler strategy that preserves far more accuracy for long-context production inference.

Details → arXiv →

Measurement of Generative AI Workload Power Profiles for Whole-Facility Data Center Infrastructure Planning

Roberto Vercellino, Jared Willard, Gustavo Campos et al.

significant🟡 IntermediateMachine Learning Efficient Inference

cs.DCcs.LG

Provides public H100 power traces for training, fine-tuning, and vLLM inference, then links them to whole-facility planning—useful for sizing clusters, power delivery, and microgrid strategies.

Details → arXiv →

How to sketch a learning algorithm

Sam Gunn

significant🔴 AdvancedMachine Learning Efficient Inference

cs.LGcs.LG

Introduces a data-deletion scheme that approximates how a model would behave if specific training data were removed, an important building block for unlearning, auditing, and data attribution.

Details → arXiv →

PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer

David Picard, Nicolas Dufour, Lucas Degeorge et al.

breakthrough🔴 AdvancedMachine Learning Efficient Inference

cs.CVcs.AIcs.CV

PoM replaces attention with a linear-time polynomial mixer, maintaining universal approximation while slashing compute—game-changing for scaling vision and language models on edge devices.

Details → arXiv →

In-Place Test-Time Training

Guhao Feng, Shengjie Luo, Kai Hua et al.

breakthrough🔴 AdvancedMachine Learning Efficient Inference

cs.LGcs.AIcs.CL

In-Place Test-Time Training enables LLMs to adapt weights during inference, overcoming static deployment limits—vital for real-time systems needing continuous learning from streaming data without retraining.

Details → arXiv →

CoStream: Codec-Guided Resource-Efficient System for Video Streaming Analytics

Yulin Zou, Yan Chen, Wenyan Chen et al.

breakthrough🟡 IntermediateMachine Learning Efficient Inference

cs.DCcs.CVcs.LG

CoStream jointly optimizes video codec and multimodal inference to cut computational costs by 40%+—enabling scalable, real-time video analytics without sacrificing accuracy on vision-language models.

Details → arXiv →

Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs

Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Patrick Woods et al.

breakthrough🔴 AdvancedMachine Learning Efficient Inference Model Compression

cs.CVcs.CV

This paper cuts memory use for on-device LLMs by dynamically quantizing the KV cache—no more fixed precision waste. For anyone deploying LLMs on phones or edge devices, this could mean 2x longer context or 50% smaller models without accuracy loss.

Details → arXiv →

Hybrid Fourier Neural Operator for Surrogate Modeling of Laser Processing with a Quantum-Circuit Mixer

Mateusz Papierz, Asel Sagingalieva, Alix Benoit et al.

significant🔴 AdvancedMachine Learning Efficient Inference

cs.CEcs.LG

HQ-LP-FNO cuts the size and cost of AI models that simulate laser processing by using quantum-inspired mixing, making real-time simulation feasible on standard hardware. This lets manufacturers rapidly test laser parameters without waiting hours for physics simulations.

Details → arXiv →