All Topics

Technology

Art

Supervised Fine-Tuning as Reinforcement Learning: Introducing Importance-Weighted SFT

GabrielBianconi

10mo ago· 2 min readenInsight

75/100

Toasty

Bagelometer↗

Properly proved. Has structure, has flavour, has a point.

Score75TypeanalysisSentimentpositive

Summary

The article explores the connection between supervised fine-tuning (SFT) of large language models and reinforcement learning (RL), arguing that SFT can be viewed as maximizing a lower bound on the RL objective in sparse reward settings. It introduces an importance-weighted variant of SFT (iw-SFT) that optimizes a tighter bound to the RL objective and improves performance, demonstrating competitive results with advanced RL algorithms in tasks like language modeling and continuous control.

Key quotes

· 4 pulled

Behavior Cloning (BC) on curated (or filtered) data is the predominant paradigm for supervised fine-tuning (SFT) of large language models.

SFT can be understood as maximizing a lower bound on the RL objective in a sparse reward setting.

A small modification to SFT leads to an importance weighted variant that behaves closer to training with RL.

The resulting SFT variants are competitive with more advanced RL algorithms for large language models and for training policies in continuous control tasks.

Snippet from the RSS feed

Behavior Cloning (BC) on curated (or filtered) data is the predominant paradigm for supervised fine-tuning (SFT) of large language models; as well as for imitation learning of control policies. Here, we draw on a connection between this successful strateg

You might also wanna read

Study Reveals How RL and SFT Differently Teach Transformers Chain-of-Thought Reasoning on Sparse Boolean Functions

This research paper analyzes how transformers learn Chain-of-Thought (CoT) reasoning capabilities through Reinforcement Learning (RL) with p

arxiv.org·3d ago