Jae-Kyung Cho Being unique is better than being perfect

Diary - LLM knowledge distillation

At some point I started using gpt-4o-mini instead of gpt-4o. The biggest difference is speed. gpt-4o’s inference speed isn’t exactly slow, but compared to gpt-4o-mini it feels frustratingly sluggish. And is there a performance difference? Not really.

How did a small model end up catching up to a large model’s performance like this? The reason probably lies in the Knowledge Distillation technique.

While studying Knowledge Distillation, I came across a simple yet excellent piece of writing. Post-training distillation for LLMs

With the consent of the original author, Rishabh Agarwal, I have translated and reorganized this content in my own way.

Introduction

What is Knowledge distillation (KD)?

A methodology for transferring the knowledge held by a large, expensive Teacher model to a small student model

Image

In fact, this distillation technique brought about the breaking of the Scaling law. Previously, it was generally true that the larger the Model size, the higher the performance. As we went from GPT-1 to 2 to 3, the number of parameters grew dramatically, and a model that couldn’t even do single-digit arithmetic became able to perform 4-digit and 5-digit operations (emergent ability).

However, if you look at the performance vs. Model pricing graph (directly correlated with model size) released by LMsys, that formula is gradually breaking down. Image

Small, cheap models are catching up to large, expensive ones. You can see why this is possible just by looking at Jack Morris’s post below. Image

In the end, you can say that after a powerful large model is released into the world, knowledge distillation techniques are used to train small models, greatly narrowing the performance gap.

Knowledge distillation 기법들

So let’s take a look at what kinds of knowledge distillation techniques exist.

Supervised KD (optimizes forward KL)

(Hinton et al., 2015, Distilling the knowledge in a Nueral Network)

When a large model and a small model are trained on the same dataset, the first difference you notice is Classification confidence.

Image

Image

In LLM terms, when comparing the highest-probability tokens, the teacher model’s token probability is higher than the student model’s token probability.
In other words, if the two models generate the same response, you could say the teacher model has a lower PPL (PPL isn’t directly tied to LLM performance, but… a lower loss usually means higher classification performance, so it’s that kind of perspective).

Image

If the distribution of the logits (the probability distribution over tokens) is the same, then the two models will have the same performance. Making this happen is the goal of knowledge distillation.

Making the teacher model’s logit distribution and the student model’s logit distribution the same
Making the distributions the same == minimizing KL divergence

In the end, instead of hard labels (ground truth), you compute and minimize the cross-entropy loss against the soft labels coming out of the teacher model. The loss usually used in LLM training is called next token prediction, but this can be seen as next token distribution prediction.

Image

So how does it perform?

In the Gemma 2 paper, they compared 7B distillation against from-scratch training when training (pre-training) Gemma 2B.

Image

In the end, distillation gives both higher performance and lower PPL, so if you have a powerful model there’s no need to bother with from-scratch training.

Distillation using “Synthetic Data”

(Kim et al., 2016, Sequence-Level Knowledge Distillation) → A familiar name to the folks at SKT…!!

Actually, Supervised KD requires using logits, and while this looks simple, an LLM’s logit is at minimum a 100K-sized embedding per token, so memory issues can arise as soon as the sequence gets a bit longer. (There’s also a method that uses only Top-K.) So a simpler yet powerful method using synthetic data was proposed.

Image

You take prompt data and extract responses from the teacher model. Then you perform SFT on the student model using that data.
The greatest advantage of this methodology is… distillation is possible with just API access! (You don’t need to know anything about logits, so you can distill from the latest LLMs like gpt-4o to boost performance.)

Image

Sam Altman’s recent post taking a shot at the newly released DeepSeek is in a similar vein.

Image

But this synthetic data methodology actually has a solid theoretical foundation.

Image

If you do a Monte-Carlo approximation of the expectation part of this final equation, you end up with exactly the synthetic data distillation methodology. Therefore, using the synthetic data distillation methodology minimizes the KL divergence between the teacher model and the student model, just like Supervised KD.

Also, the higher the quality of the answers you obtain, the better the student model’s performance becomes.

A representative method is Rejection sampling (BoN). You generate N answers from the teacher model and train the student using only the best answer among them. Image

An extended approach is Compute-matched / Cost-matched sampling. When using BoN, you use a smaller teacher model and increase N (equivalent to using a cheaper model). Image
The funny thing is that even using a teacher model smaller than the student model and increasing N showed a meaningful performance improvement. Image

This methodology has two huge advantages. As I said earlier, knowledge distillation is possible even with just API access. (Before DeepSeek R1, most SOTA models were closed-source.) Even if a model isn’t open-source, there’s no way to block API usage. And it works even with different tokenizers! (Forward KL requires matching logits, so the tokenizer has to be exactly identical.) In the end, it means distillation is possible even between models with completely different architectures or foundations.

(There’s no reason not to use this…??)

GKD: Generalized knowledge distillation

(Agarwal et al., 2023, On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes)

The two distillation methodologies introduced above have one fatal problem: train-inference mismatch.

When the training distribution and the inference distribution differ, you frequently fall into OOD (Out-of-distribution) situations. Especially since the student model performs worse than the teacher model, once a wrong token comes out, the answer can veer off into something completely bizarre.

Image

The training distribution is the teacher model’s responses, and the inference distribution is the student model’s responses, so no matter how much you train the two distributions to become similar, the structure remains vulnerable to OOD.

To solve this, the On-policy distillation methodology was proposed. The goal of this methodology is to train while aligning the train and inference distributions.

Image

  • On-policy data: Use the student model’s responses
  • Feedback: Use the teacher model’s logits for the student model’s responses
  • Supervised training: Train so that the teacher–student token-level logit distributions become the same

If you express this mathematically, you can see it’s Reverse KL.

Image

It’s similar to the KL used in the synthetic data methodology earlier, but you can see the positions of the student and teacher distributions have been swapped. (KL-divergence is asymmetric, so they’re not the same!)

Of course, even though the formulas aren’t the same, the purpose of minimizing the KL divergence is to make the student model’s and teacher model’s distributions become the same, and this itself is the same as forward KL… so what is different?

The teacher model and the student model are fundamentally different in expressiveness (model capacity) (the number of dimensions that can be represented == the number of parameters).

If you use Forward KL, the student distribution is trained to do mode-covering. Image On the other hand, if you use Reverse KL, it’s trained to do mode-seeking.
Image (There’s a fantastic blog post about forward KL and Reverse KL forward KL, Reverse KL explanation)

For a student model with small expressiveness, mode-seeking may actually be more suitable. Because rather than having a random out-of-place token come out, it’s better to get the teacher model’s secondary token. But if the teacher’s modes aren’t sharp and spiky and are sufficiently generalized, mode-covering might be better.

Image

So when they ran experiments, mixing forward KL and reverse KL worked best. (0.5 forward KL + 0.5 reverse KL = Jeffrey’s divergence)

Anyway, we’ve covered the theoretical parts, but how exactly can we implement the GKD methodology? The simplest way is to use GKD implemented in TRL. Or you can adapt RLxF (RLHF, RLAIF) code.

Image

RLxF code includes a KL term as a regularizer that keeps the policy from drifting too far from the SFT distribution when performing RL, and you plug in the teacher model in place of SFT. Then you delete the Reward-related term and GKD is complete.

There’s also content saying that if you don’t delete the Reward-related term, you can catch two rabbits at once: RL + distillation. (Of course, as RL gets involved, training instability will increase.)

Image

Another advantages of Distillation

In fact, the advantages of distillation aren’t only in improving the student model’s performance!

Speculative decoding

(Leviathan et al., 2023, Fast Inference from Transformers via Speculative Decoding)

Transformers made it possible to train massive models through training parallelization, but they’re extremely slow at inference. You can compute the logits for the input sequence all at once, but when generating tokens you have to generate them sequentially.

So Speculative decoding, which came out of this, is a method of speeding up a large model by using a small model that answers similarly to the large model.

Image

  • Since the small model can generate quickly, it first generates several tokens
  • The large model extracts the logits for those several tokens all at once and finds the wrongly generated tokens
  • Starting from the wrongly generated token, the small model generates tokens again -> repeat

For this to work, the large model’s and small model’s answers have to be similar for the speed to go up. If all the tokens turn out different, you’re just burning the small model and consuming more resources for nothing.

DistillSpec

(Zhou et al., 2024, DistillSpec: Improving Speculative Decoding via Knowledge Distillation)

But even if they were trained on the same dataset, there’s no guarantee that the small model’s and large model’s answers will be similar. So a method to maximize speculative decoding performance through distillation was proposed.

Image

With distillation, the student model’s answers become similar to the teacher’s, increasing inference speed by 10~45%. This methodology is actually applied to Google’s search page.

Example of speculative decoding applied to Google

Advanced knowledge distillation

SKD: speculative knowledge distillation

(Agarwal et al., 2025, On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes)

Most recently, a hybrid methodology that applies this Speculative decoding was also proposed.

GKD performs well, but there’s a question of whether it’s okay to do GKD when the student’s on-policy responses aren’t that good. In such a situation, it might be better to do Supervised KD using the teacher’s high-quality responses, even if they’re off-policy.
The methodology that takes only the advantages of these two is SKD.

Image

First, you extract the student’s on-policy responses. Then you check whether each token is included in the teacher’s top-K tokens. If they’re all included, the on-policy response is in pretty good shape, so you perform GKD. If there are tokens that aren’t included, you replace those tokens with the teacher’s tokens to raise the quality, and use this to perform Supervised KD.

When this SKD methodology is applied, they say both performance and speculative decoding speed improved.

Image

This is because from the perspective of the student’s performance, on-policy distillation is better, while from the perspective of the teacher’s speculative decoding, distillation using teacher responses is better.

Conclusion

| Compute efficiency | Online < Offline | | — | — | | Sample efficiency | Online > Offline | | Resource waste from a real-time teacher model during training | Online < Offline | | Time delay from student sampling | Online < Offline | | Suitability for long horizon tasks | Online > Offline | | Train-test distribution mismatch | Online > Offline |

  • Advantages of Online KD
    • Doesn’t require much data
    • Suitable for long horizon tasks like Agents because the train-test distribution mismatch is small (doesn’t easily fall into OOD)
  • Advantages of Offline KD
    • Once you generate the teacher model’s responses or store the logits just once, you can keep reusing them
      • If you extract the logits in advance, you can also save the GPU resources needed for the teacher model
    • Training is fast because there’s no student on-policy generation process

The trade-off of knowledge distillation ultimately comes down to resources versus performance. The more a methodology improves performance, the more computing resources / time resources it requires.
So let’s choose an appropriate distillation methodology depending on the situation.

Comments