Bridge-Garden Theory Explains Why Mixing Hard and Soft Labels Improves Knowledge Distillation for LLMs

[Submitted on 25 May 2026]

4d ago· 2 min readenInsight

75/100

Toasty

Bagelometer↗

Right out the toaster. Reliable, with some real depth.

Score75TypeanalysisSentimentpositive

Summary

This research paper investigates knowledge distillation (KD) for language models, specifically why mixing hard labels (sampled tokens) and soft labels (full token distributions) from a teacher model yields better student performance than using either alone. The authors find the improvement comes from reduced exposure bias, not closer teacher matching. They introduce the Bridge-Garden Decomposition theory, which categorizes generation steps into "Bridges" (where exact tokens are required) and "Gardens" (where flexibility is allowed). Hard-only KD works better for Bridges by avoiding risky deviations, while soft-only KD preserves diversity in Gardens. A hybrid strategy handles both cases, reducing exposure bias. The authors develop Bridge-Garden hybrid supervision methods that adaptively balance hard and soft labels, outperforming baselines across seven teacher-student model pairs (Qwen, Llama, Gemma, DeepSeek) on reasoning and coding benchmarks, while reducing training cost by 9.7x.

Key quotes

· 5 pulled

Despite soft labels appear strictly richer, we find that mixing hard and soft labels consistently yields better results.

Crucially, we show that this gain cannot be explained by closer teacher matching during training. Instead, it comes from reduced exposure bias, the mismatch between training and inference distributions.

We introduce the Bridge-Garden Decomposition theory, which categorizes generation steps into two types: Bridges, where the next token must be exact, and Gardens, where it can be flexible.

Hard-only KD excels in Bridges by avoiding risky deviations, while soft-only KD preserves diversity in Gardens.

Our approach outperforms divergence-based and on-policy KD baselines while reducing training cost by 9.7x, enabling efficient model compression.

Snippet from the RSS feed

Knowledge distillation (KD) transfers knowledge from a large teacher model to a smaller student. In language modeling, the student is trained either on tokens sampled from the teacher (hard labels) or the teacher's full next-token distribution (soft label

You might also wanna read

Research: LLMs Encode Human-Labeled Problem Difficulty Better Than Model-Derived Difficulty

This research paper investigates whether large language models (LLMs) internally encode problem difficulty in alignment with human judgment.

arxiv.org·6mo ago

LLM Circuit Finder: Duplicating Specific Layers in Transformer Models Improves Reasoning Performance Without Training

The article describes a GitHub project called 'llm-circuit-finder' that implements a method for discovering and exploiting 'reasoning circui

github.com·2mo ago