Knowledge Distillation 101

1. history

Knowledge Distillation (KD): transfer knowledge from teacher model to a smaller student model
DeepSeek-R1 proposed effective distillation implementations
2006 - Model Compression - Buciluǎ et al
- the first distillation work
- an ensemble of models could be compressed into a single smaller model
- ensemble: a collection of multiple models working together to solve the same problem. their results are combined to a single one
2015 - ****Distilling the Knowledge in a Neural Network - Geoffrey Hinton, Oriol Vinyals, and Jeff Dean
- proposed used the term “distillation”
- instead of training on correct answers, they proposed to give the prob distribution from the teacher model

softmax: convert scores (called logits) to probabilities (sum = 1.0)
temperature (T): control how confident or uncertain a model’s predictions are
- can make the prob distributions more confident (sharp) or more uncertain (soft)
- T = 1 → default softmax: only the correct answer receives 100% prob → hard targets
- T > 1: probs are more spread out or softer → soft targets
soft targets are useful for distillation. we’ll show why

teacher model is trained
teacher model produces logits
convert logits to soft targets with T > 1
student is trained on soft targets (along with hard targets as ground truth)
- goal: minimize the prob distribution difference btw teacher and student
- the student is optimized to mimic the teacher’s behavior (soft targets), not just outputs (hard targets)

alternative: matching logits instead of probs.
- in high-temperature settings, this method becomes mathematically equivalent to standard distillation
- both approaches lead to similar benefits

what knowledge can be transfered?

common techniques

Response-based distillation: uses the teacher’s final output probabilities (like above)
Feature-based distillation: also learns from teacher’s intermediate layers (feature maps)
- proposed in https://arxiv.org/abs/1412.6550 (2014)
Relation-based distillation: learns relationships btw different parts of the teacher
- either between layers or between different data samples
- e.g., compare multiple samples and learns their similarities
- more complex but can work with multiple teacher models, merging their knowledge

Offline distillation: teacher is trained first, then teaches student
1. Simple, efficient, and reusable teacher
2. The student may struggle to reach teacher-level performance due to a knowledge gap
Online distillation: teacher and student learn together
1. Adaptive learning, no need for a separate pre-trained teacher
2. Computationally expensive and harder to set up
Self-distillation: student learns from itself
1. The model teaches itself using its own deeper layers
2. Might not be as effective as learning from a stronger external teacher

Multi-teacher distillation: use multiple teachers
Cross-modal distillation: transfers knowledge btw different modalities (image → text, audio → video, etc.)
Attention-based distillation: teacher generates attention maps to highlight important areas in the data
- in CV, when identifying a “cat,” the attention map will show bright spots over the ears, eyes, and whiskers (important pixels)
- in NLP, it’s the attention score
Non-Target Class-Enhanced KD (NTCE-KD): learns incorrect labels as well
Adversarial distillation: use GAN. generate extra synthetic data. discriminator will check if data is fake or not (compare teacher and student outputs)
Quantized distillation: apply quantization for student
Speculative Knowledge Distillation (SKD): student generates draft tokens. teacher selectively replaces low-quality tokens → on-the-fly high-quality training
Lifelong distillation: Model keeps learning over time

We want to predict how effective distillation will be

Apple and the University of Oxford have a good study on this. main findings:

effectiveness is related to, teacher’s size and quality, student’s size, and # training tokens
- power law: performance improves in a predictable way but only to a point
- if student is small, you can use a smaller teacher to save compute
- if student is large, you need a better and larger teacher
a good teacher doesn’t always mean a better student.
- capacity gap: if teacher is too strong → student might struggle to learn from it
distillation is most efficient when
- student is small enough that traditional supervised training is more expensive
  - supervised training only relies on ground truth labels
  - when model is small, its capability is less → harder to learn → needs massive training
- teacher already exists, so the cost of training is not required
supervised training is more efficient when
- both the teacher and student need to be trained, and teacher training is more costly
- enough compute and data are available
sometimes a student model can even outperform its teacher

one study finetuned Qwen and Llama using the 800k training examples from DeepSeek-R1

DeepSeek-R1-Distill-Qwen-7B model outperformed QwQ-32B on reasoning benchmarks

#Training

Knowledge Distillation 101

https://gdymind.github.io/2026/02/22/Knowledge-Distillation-101/

Author

gdymind

Posted on

February 22, 2026

Licensed under