DOPD: A Dual On-policy Distillation Method to Address Privilege Illusion in LLM and VLM Training

[Submitted on 29 Jun 2026]

2d ago· 2 min readenInsight

technology science machine learning natural language processing

Summary

This paper introduces DOPD (Dual On-policy Distillation), a novel approach to on-policy distillation for large language models (LLMs) and vision-language models (VLMs). The authors identify a problem called "privilege illusion" — where injecting privileged information into teacher or student models conflates transferable capability gaps with information asymmetry gaps that can only be mimicked. DOPD addresses this by dynamically routing token-level supervision between privileged teacher and privileged student policies based on their advantage gap and relative probabilities. Experiments show DOPD consistently outperforms vanilla OPD and other counterparts across LLM and VLM settings, with additional validation on stability, robustness, continual learning, and out-of-distribution tasks.

Source

Twitter / XDOPD: A Dual On-policy Distillation Method to Address Privilege Illusion in LLM and VLM Trainingarxiv.org

Key quotes

· 5 pulled

On-policy distillation (OPD) offers superior capacity transfer by supervising student-sampled trajectories with dense token-level signals.

This additional input induces a potential failure mode we dub privilege illusion: a pattern that conflates the transferable capability gap that students are meant to close, and the information asymmetry gap that can only be mimicked but never replicated.

This issue is further amplified by the inherent non-uniformity of token-level supervision, where only a small subset of tokens carries pivotal capability-bearing signals.

We propose DOPD, an advantage-aware dual distillation paradigm that dynamically routes token-level supervision between privileged teacher and privileged student policies based on their advantage gap and relative probabilities.

Extensive experiments on both large language model (LLM) and vision-language model (VLM) settings demonstrate that DOPD consistently outperforms Vanilla OPD and other counterparts.

Snippet from the RSS feed

On-policy distillation (OPD) offers superior capacity transfer by supervising student-sampled trajectories with dense token-level signals. To furnish high-quality supervision sources and thereby elevate the performance frontier of distillation, an intuiti

You might also wanna read

RLCSD: A Contrastive Self-Distillation Method to Fix Style Drift in Reasoning Models

This paper introduces RLCSD (Reinforcement Learning with Contrastive on-policy Self-Distillation), a method that addresses a pathology calle

arxiv.org·19d ago

Feedback Distillation: A New Training Method for Improving LLM Reasoning in Theorem Proving

This paper introduces Feedback Distillation, a novel training method for reasoning models that improves upon standard GRPO (Group Relative P

arxiv.org·1mo ago

Bridge-Garden Theory Explains Why Mixing Hard and Soft Labels Improves Knowledge Distillation for LLMs

This research paper investigates knowledge distillation (KD) for language models, specifically why mixing hard labels (sampled tokens) and s

arxiv.org·1mo ago

Proxy-KD: A Novel Method for Knowledge Distillation from Black-Box Large Language Models

This paper introduces Proxy-KD, a novel knowledge distillation method for transferring capabilities from black-box large language models (li

arxiv.org·4d ago

ConSPO: A Contrastive Approach to Improving Reinforcement Learning with Verifiable Rewards for LLMs

This paper analyzes Group Relative Policy Optimization (GRPO), a widely used RLVR algorithm for post-training large language models on reaso

arxiv.org·1mo ago

DILLO: A Language-Based World Model for Proactive Agent Steering Without Visual Simulation

This paper introduces DILLO (DIstiLLed Language-ActiOn World Model), a proactive agent steering framework that replaces slow visual simulati

arxiv.org·10d ago

Comments

No comments yet. Be the first.