← Back to archive day

Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information

Yao Chen, Jiawei Sheng, Wenyuan Zhang, Tingwen Liu

35

Recommendation Score

breakthrough🔴 AdvancedNLPLLM ReasoningModel CompressionSystemBest for researchers

Research context

Primary field

NLP

Language understanding, generation, extraction, and evaluation.

Topics

LLM Reasoning, Model Compression

Paper type

System

Best for

Best for researchers

arXiv categories

cs.CLcs.CL

Why It Matters

Proposes stepwise attention distillation to transfer dynamic reasoning focus from large to small models, significantly improving small-model reasoning without requiring larger architectures—key for efficient deployment in resource-constrained systems.

Abstract

The significant computational demands of large language models have increased interest in distilling reasoning abilities into smaller models via Chain-of-Thought (CoT) distillation. Current CoT distillation methods mainly focus on transferring teacher-generated rationales for complex reasoning to student models. However, they do not adequately explore teachers' dynamic attention toward critical information during reasoning. We find that language models exhibit progressive attention shifts towards key information during reasoning, which implies essential clues for drawing conclusions. Building on this observation and analysis, we introduce a novel CoT distillation framework that transfers the teacher's stepwise attention on key information to the student model. This establishes structured guidance for the student's progressive concentration on key information during reasoning. More importantly, we develop a Mixture of Layers module enabling dynamic alignment that adapts to different layers between the teacher and student. Our method achieves consistent performance improvements across multiple mathematical and commonsense reasoning datasets. To our knowledge, it is the first method to leverage stepwise attention within CoT distillation to improve small model reasoning.

Published April 17, 2026
© 2026 A2A.pub — AI to Action. From papers to practice, daily.
Summaries are AI-assistedPrivacyTerms