Knowledge Distillation 101

source: https://huggingface.co/blog/Kseniase/kd

1. history

  • Knowledge Distillation (KD): transfer knowledge from teacher model to a smaller student model
  • DeepSeek-R1 proposed effective distillation implementations
  • 2006 - Model Compression - Buciluǎ et al
    • the first distillation work
    • an ensemble of models could be compressed into a single smaller model
    • ensemble: a collection of multiple models working together to solve the same problem. their results are combined to a single one
  • 2015 - ****Distilling the Knowledge in a Neural Network - Geoffrey Hinton, Oriol Vinyals, and Jeff Dean
    • proposed used the term “distillation”
    • instead of training on correct answers, they proposed to give the prob distribution from the teacher model

2. softmax and temperature

  • softmax: convert scores (called logits) to probabilities (sum = 1.0)
  • temperature (T): control how confident or uncertain a model’s predictions are
    • can make the prob distributions more confident (sharp) or more uncertain (soft)
    • T = 1 → default softmax: only the correct answer receives 100% prob → hard targets
    • T > 1: probs are more spread out or softer → soft targets
  • soft targets are useful for distillation. we’ll show why

3. distillation steps

  1. teacher model is trained
  2. teacher model produces logits
  3. convert logits to soft targets with T > 1
  4. student is trained on soft targets (along with hard targets as ground truth)
    • goal: minimize the prob distribution difference btw teacher and student
    • the student is optimized to mimic the teacher’s behavior (soft targets), not just outputs (hard targets)

image.png

  • alternative: matching logits instead of probs.
    • in high-temperature settings, this method becomes mathematically equivalent to standard distillation
    • both approaches lead to similar benefits

4. distillation types

what knowledge can be transfered?

  • we have see softmax probs and logits. there are other options

common techniques

https://arxiv.org/abs/2006.05525

  1. Response-based distillation: uses the teacher’s final output probabilities (like above)

  2. Feature-based distillation: also learns from teacher’s intermediate layers (feature maps)

  3. Relation-based distillation: learns relationships btw different parts of the teacher

    • either between layers or between different data samples

    • e.g., compare multiple samples and learns their similarities

    • more complex but can work with multiple teacher models, merging their knowledge

      image.png

5. training types

  1. Offline distillation: teacher is trained first, then teaches student
    1. Simple, efficient, and reusable teacher
    2. The student may struggle to reach teacher-level performance due to a knowledge gap
  2. Online distillation: teacher and student learn together
    1. Adaptive learning, no need for a separate pre-trained teacher
    2. Computationally expensive and harder to set up
  3. Self-distillation: student learns from itself
    1. The model teaches itself using its own deeper layers
    2. Might not be as effective as learning from a stronger external teacher

image.png

6. improved algorithms

  • Multi-teacher distillation: use multiple teachers
  • Cross-modal distillation: transfers knowledge btw different modalities (image → text, audio → video, etc.)
  • Attention-based distillation: teacher generates attention maps to highlight important areas in the data
    • in CV, when identifying a “cat,” the attention map will show bright spots over the ears, eyes, and whiskers (important pixels)
    • in NLP, it’s the attention score
  • Non-Target Class-Enhanced KD (NTCE-KD): learns incorrect labels as well
  • Adversarial distillation: use GAN. generate extra synthetic data. discriminator will check if data is fake or not (compare teacher and student outputs)
  • Quantized distillation: apply quantization for student
  • Speculative Knowledge Distillation (SKD): student generates draft tokens. teacher selectively replaces low-quality tokens → on-the-fly high-quality training
  • Lifelong distillation: Model keeps learning over time

7. distillation scaling laws

We want to predict how effective distillation will be

Apple and the University of Oxford have a good study on this. main findings:

  • effectiveness is related to, teacher’s size and quality, student’s size, and # training tokens

    • power law: performance improves in a predictable way but only to a point
    • if student is small, you can use a smaller teacher to save compute
    • if student is large, you need a better and larger teacher

    image.png

  • a good teacher doesn’t always mean a better student.

    • capacity gap: if teacher is too strong → student might struggle to learn from it
  • distillation is most efficient when

    • student is small enough that traditional supervised training is more expensive
      • supervised training only relies on ground truth labels
      • when model is small, its capability is less → harder to learn → needs massive training
    • teacher already exists, so the cost of training is not required
  • supervised training is more efficient when

    • both the teacher and student need to be trained, and teacher training is more costly
    • enough compute and data are available
  • sometimes a student model can even outperform its teacher

8. distill Deepseek-R1

one study finetuned Qwen and Llama using the 800k training examples from DeepSeek-R1

DeepSeek-R1-Distill-Qwen-7B model outperformed QwQ-32B on reasoning benchmarks


Knowledge Distillation 101
https://gdymind.github.io/2026/02/22/Knowledge-Distillation-101/
Author
gdymind
Posted on
February 22, 2026
Licensed under