RLCSD: A Contrastive Self-Distillation Method to Fix Style Drift in Reasoning Models

[Submitted on 10 Jun 2026]

11h ago· 2 min readenInsight

75/100

Toasty

Bagelometer↗

Lightly browned and well buttered. A solid pick from the rack.

Score75TypeanalysisSentimentpositive

Summary

This paper introduces RLCSD (Reinforcement Learning with Contrastive on-policy Self-Distillation), a method that addresses a pathology called "privilege-induced style drift" in on-policy self-distillation (OPSD) for reasoning models. The authors identify that OPSD's learning signal concentrates on style tokens rather than task-bearing ones because hinted models produce shorter, more direct outputs. RLCSD mitigates this by contrasting the teacher-student gap under correct vs. wrong hints, suppressing style shifts and focusing on task-bearing tokens. Experiments on Qwen3 (1.7B/4B/8B) and Olmo-3-7B-Think across math and logical reasoning show RLCSD consistently outperforms GRPO and prior OPSD methods. The contrastive principle is generalizable to other OPSD methods and cross-model distillation settings.

Key quotes

· 5 pulled

We show that the learning signal drawn from this distributional gap concentrates on style tokens rather than task-bearing ones, as the hinted model tends to produce more direct, shorter outputs.

We term this pathology privilege-induced style drift, which destabilizes training or causes response length to shrink.

RLCSD (Reinforcement Learning with Contrastive on-policy Self-Distillation) mitigates this drift by contrasting the teacher-student gap under a correct hint against that under a wrong hint.

Experiments on Qwen3 (1.7B/4B/8B) and Olmo-3-7B-Think across mathematical and logical reasoning show that RLCSD consistently outperforms GRPO and prior OPSD methods.

We further show that the contrastive principle is general: it plugs into existing OPSD methods to improve them.

Snippet from the RSS feed

On-policy self-distillation (OPSD) provides dense, token-level supervision for reasoning models by aligning a model's own distribution with the distribution it produces under privileged context, typically a verified solution. However, we show that the lea

You might also wanna read

Self-Distillation Fine-Tuning (SDFT): A Method for Continual Learning from Demonstrations

This paper introduces Self-Distillation Fine-Tuning (SDFT), a method for continual learning that enables on-policy learning directly from ex

arxiv.org·27d ago

R-Zero: A Self-Evolving LLM Framework That Generates Its Own Training Data Without Human Input

R-Zero is a fully autonomous framework for training self-evolving Large Language Models (LLMs) that generates its own training data from scr

arxiv.org·9mo ago

Comprehensive Survey of Reasoning Failures in Large Language Models

This article presents a comprehensive survey of reasoning failures in Large Language Models (LLMs), introducing a novel categorization frame

arxiv.org·3mo ago

Research: Frontier Language Models Show Deterministic Silence for Ontologically Null Concepts

This preprint reports a reproducible behavioral convergence in frontier language models where GPT-5.2 and Claude Opus 4.6 return determinist

zenodo.org·2mo ago

Ouro: Looped Language Models That Build Reasoning into Pre-Training Through Latent Space Iteration

Researchers introduce Ouro, a family of pre-trained Looped Language Models (LoopLM) that build reasoning capabilities directly into the pre-

arxiv.org·5mo ago

Supervised Fine-Tuning as Reinforcement Learning: Introducing Importance-Weighted SFT

The article explores the connection between supervised fine-tuning (SFT) of large language models and reinforcement learning (RL), arguing t

arxiv.org·10mo ago