All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

RLCSD: A Contrastive Self-Distillation Method to Fix Style Drift in Reasoning Models

By

[Submitted on 10 Jun 2026]

11h ago· 2 min readenInsight

Summary

This paper introduces RLCSD (Reinforcement Learning with Contrastive on-policy Self-Distillation), a method that addresses a pathology called "privilege-induced style drift" in on-policy self-distillation (OPSD) for reasoning models. The authors identify that OPSD's learning signal concentrates on style tokens rather than task-bearing ones because hinted models produce shorter, more direct outputs. RLCSD mitigates this by contrasting the teacher-student gap under correct vs. wrong hints, suppressing style shifts and focusing on task-bearing tokens. Experiments on Qwen3 (1.7B/4B/8B) and Olmo-3-7B-Think across math and logical reasoning show RLCSD consistently outperforms GRPO and prior OPSD methods. The contrastive principle is generalizable to other OPSD methods and cross-model distillation settings.

Key quotes

· 5 pulled
We show that the learning signal drawn from this distributional gap concentrates on style tokens rather than task-bearing ones, as the hinted model tends to produce more direct, shorter outputs.
We term this pathology privilege-induced style drift, which destabilizes training or causes response length to shrink.
RLCSD (Reinforcement Learning with Contrastive on-policy Self-Distillation) mitigates this drift by contrasting the teacher-student gap under a correct hint against that under a wrong hint.
Experiments on Qwen3 (1.7B/4B/8B) and Olmo-3-7B-Think across mathematical and logical reasoning show that RLCSD consistently outperforms GRPO and prior OPSD methods.
We further show that the contrastive principle is general: it plugs into existing OPSD methods to improve them.
Snippet from the RSS feed
On-policy self-distillation (OPSD) provides dense, token-level supervision for reasoning models by aligning a model's own distribution with the distribution it produces under privileged context, typically a verified solution. However, we show that the lea

You might also wanna read