DOPD: A Dual On-policy Distillation Method to Address Privilege Illusion in LLM and VLM Training
By
[Submitted on 29 Jun 2026]
Summary
This paper introduces DOPD (Dual On-policy Distillation), a novel approach to on-policy distillation for large language models (LLMs) and vision-language models (VLMs). The authors identify a problem called "privilege illusion" — where injecting privileged information into teacher or student models conflates transferable capability gaps with information asymmetry gaps that can only be mimicked. DOPD addresses this by dynamically routing token-level supervision between privileged teacher and privileged student policies based on their advantage gap and relative probabilities. Experiments show DOPD consistently outperforms vanilla OPD and other counterparts across LLM and VLM settings, with additional validation on stability, robustness, continual learning, and out-of-distribution tasks.
Source
Key quotes
· 5 pulledOn-policy distillation (OPD) offers superior capacity transfer by supervising student-sampled trajectories with dense token-level signals.
This additional input induces a potential failure mode we dub privilege illusion: a pattern that conflates the transferable capability gap that students are meant to close, and the information asymmetry gap that can only be mimicked but never replicated.
This issue is further amplified by the inherent non-uniformity of token-level supervision, where only a small subset of tokens carries pivotal capability-bearing signals.
We propose DOPD, an advantage-aware dual distillation paradigm that dynamically routes token-level supervision between privileged teacher and privileged student policies based on their advantage gap and relative probabilities.
Extensive experiments on both large language model (LLM) and vision-language model (VLM) settings demonstrate that DOPD consistently outperforms Vanilla OPD and other counterparts.
You might also wanna read
RLCSD: A Contrastive Self-Distillation Method to Fix Style Drift in Reasoning Models
This paper introduces RLCSD (Reinforcement Learning with Contrastive on-policy Self-Distillation), a method that addresses a pathology calle
Feedback Distillation: A New Training Method for Improving LLM Reasoning in Theorem Proving
This paper introduces Feedback Distillation, a novel training method for reasoning models that improves upon standard GRPO (Group Relative P
Bridge-Garden Theory Explains Why Mixing Hard and Soft Labels Improves Knowledge Distillation for LLMs
This research paper investigates knowledge distillation (KD) for language models, specifically why mixing hard labels (sampled tokens) and s
Proxy-KD: A Novel Method for Knowledge Distillation from Black-Box Large Language Models
This paper introduces Proxy-KD, a novel knowledge distillation method for transferring capabilities from black-box large language models (li
ConSPO: A Contrastive Approach to Improving Reinforcement Learning with Verifiable Rewards for LLMs
This paper analyzes Group Relative Policy Optimization (GRPO), a widely used RLVR algorithm for post-training large language models on reaso
DILLO: A Language-Based World Model for Proactive Agent Steering Without Visual Simulation
This paper introduces DILLO (DIstiLLed Language-ActiOn World Model), a proactive agent steering framework that replaces slow visual simulati

Comments
Sign in to join the conversation.
No comments yet. Be the first.