All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Self-Distillation Fine-Tuning (SDFT): A Method for Continual Learning from Demonstrations

By

teleforce

15d ago· 2 min readenInsight

Summary

This paper introduces Self-Distillation Fine-Tuning (SDFT), a method for continual learning that enables on-policy learning directly from expert demonstrations without requiring explicit reward functions. SDFT uses a demonstration-conditioned model as its own teacher to generate on-policy training signals, preserving prior capabilities while acquiring new skills. The method consistently outperforms supervised fine-tuning (SFT) across skill learning and knowledge acquisition tasks, achieving higher new-task accuracy while substantially reducing catastrophic forgetting in sequential learning experiments.

Key quotes

· 5 pulled
We introduce Self-Distillation Fine-Tuning (SDFT), a simple method that enables on-policy learning directly from demonstrations.
SDFT leverages in-context learning by using a demonstration-conditioned model as its own teacher, generating on-policy training signals that preserve prior capabilities while acquiring new skills.
Across skill learning and knowledge acquisition tasks, SDFT consistently outperforms SFT, achieving higher new-task accuracy while substantially reducing catastrophic forgetting.
In sequential learning experiments, SDFT enables a single model to accumulate multiple skills over time without performance regression.
SDFT establishes on-policy distillation as a practical path to continual learning from demonstrations.
Snippet from the RSS feed
Continual learning, enabling models to acquire new skills and knowledge without degrading existing capabilities, remains a fundamental challenge for foundation models. While on-policy reinforcement learning can reduce forgetting, it requires explicit rewa

You might also wanna read

Contextual Rollout Bandits: A Neural Scheduling Framework for Efficient Reinforcement Learning with Verifiable Rewards

This paper introduces Contextual Rollout Bandits, a novel framework for Reinforcement Learning with Verifiable Rewards (RLVR) that addresses

arxiv.org·5d ago

Sleep-Like Consolidation Mechanism Improves Long-Context Performance in Transformer Language Models

This paper proposes a sleep-like consolidation mechanism for transformer-based large language models to address the poor scaling of attentio

arxiv.org·5d ago

Research Reveals LLMs Contain Built-In Persona Subnetworks Without External Training

This research paper reveals that large language models (LLMs) already contain specialized persona subnetworks within their parameter space,

arxiv.org·3mo ago

Comprehensive Survey of Reasoning Failures in Large Language Models

This article presents a comprehensive survey of reasoning failures in Large Language Models (LLMs), introducing a novel categorization frame

arxiv.org·3mo ago

Research on Hallucination-Associated Neurons in Large Language Models: Identification, Impact, and Origins

This research paper investigates hallucination-associated neurons (H-Neurons) in large language models, examining their identification, beha

arxiv.org·5mo ago

Research: LLMs Encode Human-Labeled Problem Difficulty Better Than Model-Derived Difficulty

This research paper investigates whether large language models (LLMs) internally encode problem difficulty in alignment with human judgment.

arxiv.org·6mo ago