Supervised Fine-Tuning as Reinforcement Learning: Introducing Importance-Weighted SFT
By
GabrielBianconi
10mo ago· 2 min readenInsight
75/100
Toasty
Bagelometer↗
Properly proved. Has structure, has flavour, has a point.
Score75TypeanalysisSentimentpositive
Summary
The article explores the connection between supervised fine-tuning (SFT) of large language models and reinforcement learning (RL), arguing that SFT can be viewed as maximizing a lower bound on the RL objective in sparse reward settings. It introduces an importance-weighted variant of SFT (iw-SFT) that optimizes a tighter bound to the RL objective and improves performance, demonstrating competitive results with advanced RL algorithms in tasks like language modeling and continuous control.
Key quotes
· 4 pulledBehavior Cloning (BC) on curated (or filtered) data is the predominant paradigm for supervised fine-tuning (SFT) of large language models.
SFT can be understood as maximizing a lower bound on the RL objective in a sparse reward setting.
A small modification to SFT leads to an importance weighted variant that behaves closer to training with RL.
The resulting SFT variants are competitive with more advanced RL algorithms for large language models and for training policies in continuous control tasks.
Behavior Cloning (BC) on curated (or filtered) data is the predominant paradigm for supervised fine-tuning (SFT) of large language models; as well as for imitation learning of control policies. Here, we draw on a connection between this successful strateg
