RLCSD: A Contrastive Self-Distillation Method to Fix Style Drift in Reasoning Models
By
[Submitted on 10 Jun 2026]
Lightly browned and well buttered. A solid pick from the rack.
Summary
This paper introduces RLCSD (Reinforcement Learning with Contrastive on-policy Self-Distillation), a method that addresses a pathology called "privilege-induced style drift" in on-policy self-distillation (OPSD) for reasoning models. The authors identify that OPSD's learning signal concentrates on style tokens rather than task-bearing ones because hinted models produce shorter, more direct outputs. RLCSD mitigates this by contrasting the teacher-student gap under correct vs. wrong hints, suppressing style shifts and focusing on task-bearing tokens. Experiments on Qwen3 (1.7B/4B/8B) and Olmo-3-7B-Think across math and logical reasoning show RLCSD consistently outperforms GRPO and prior OPSD methods. The contrastive principle is generalizable to other OPSD methods and cross-model distillation settings.
Key quotes
· 5 pulledWe show that the learning signal drawn from this distributional gap concentrates on style tokens rather than task-bearing ones, as the hinted model tends to produce more direct, shorter outputs.
We term this pathology privilege-induced style drift, which destabilizes training or causes response length to shrink.
RLCSD (Reinforcement Learning with Contrastive on-policy Self-Distillation) mitigates this drift by contrasting the teacher-student gap under a correct hint against that under a wrong hint.
Experiments on Qwen3 (1.7B/4B/8B) and Olmo-3-7B-Think across mathematical and logical reasoning show that RLCSD consistently outperforms GRPO and prior OPSD methods.
We further show that the contrastive principle is general: it plugs into existing OPSD methods to improve them.
You might also wanna read
Self-Distillation Fine-Tuning (SDFT): A Method for Continual Learning from Demonstrations
This paper introduces Self-Distillation Fine-Tuning (SDFT), a method for continual learning that enables on-policy learning directly from ex
R-Zero: A Self-Evolving LLM Framework That Generates Its Own Training Data Without Human Input
R-Zero is a fully autonomous framework for training self-evolving Large Language Models (LLMs) that generates its own training data from scr
Comprehensive Survey of Reasoning Failures in Large Language Models
This article presents a comprehensive survey of reasoning failures in Large Language Models (LLMs), introducing a novel categorization frame
Research: Frontier Language Models Show Deterministic Silence for Ontologically Null Concepts
This preprint reports a reproducible behavioral convergence in frontier language models where GPT-5.2 and Claude Opus 4.6 return determinist
Ouro: Looped Language Models That Build Reasoning into Pre-Training Through Latent Space Iteration
Researchers introduce Ouro, a family of pre-trained Looped Language Models (LoopLM) that build reasoning capabilities directly into the pre-
Supervised Fine-Tuning as Reinforcement Learning: Introducing Importance-Weighted SFT
The article explores the connection between supervised fine-tuning (SFT) of large language models and reinforcement learning (RL), arguing t
