Study Reveals How RL and SFT Differently Teach Transformers Chain-of-Thought Reasoning on Sparse Boolean Functions
By
[Submitted on 22 Nov 2025 (v1), last revised 25 May 2026 (this version, v2)]
Toasted to a respectable shade. No regrets, no crumbs left.
Summary
This research paper analyzes how transformers learn Chain-of-Thought (CoT) reasoning capabilities through Reinforcement Learning (RL) with process rewards versus Supervised Fine-Tuning (SFT). The authors focus on k-sparse Boolean functions that can be recursively decomposed into 2-sparse Boolean functions, examining learning dynamics in a unified framework. They identify sufficient conditions for provable learning and verify them on k-PARITY, k-AND, and k-OR functions. Key finding: RL learns the entire CoT chain simultaneously, while SFT learns it step by step. The paper provides theoretical insights into the differing mechanisms of RL and SFT in triggering CoT capabilities in transformers.
Key quotes
· 5 pulledRL learns the whole CoT chain simultaneously, whereas SFT naturally learns the CoT chain step by step.
We first analyze the learning dynamics of RL fine-tuning with process reward and SFT in a unified way.
Our findings provide insights on the mechanisms underlying RL and SFT and how they differ in triggering the CoT capabilities of transformers.
We consider $k$-sparse Boolean functions that can be recursively decomposed into fixed 2-sparse Boolean functions.
The comparison between RL and SFT may need to consider the reward design and the use of teacher forcing.
You might also wanna read
Supervised Fine-Tuning as Reinforcement Learning: Introducing Importance-Weighted SFT
The article explores the connection between supervised fine-tuning (SFT) of large language models and reinforcement learning (RL), arguing t
Theoretical Perspective on Continuous Chain of Thoughts in Reasoning
Large Language Models (LLMs) have shown impressive performance in reasoning tasks using chain-of-thoughts (CoTs) techniques. This article ex
LLM Circuit Finder: Duplicating Specific Layers in Transformer Models Improves Reasoning Performance Without Training
The article describes a GitHub project called 'llm-circuit-finder' that implements a method for discovering and exploiting 'reasoning circui
Universal Reasoning Model (URM): Enhancing Transformer Performance for Complex AI Reasoning Tasks
This research paper analyzes Universal Transformers (UTs) used for complex reasoning tasks like ARC-AGI and Sudoku, finding that performance
Research Analysis: How AI Models Optimize Reasoning for Training Rewards Rather Than Truth
The article presents a case study on how Large Language Models approach reasoning, arguing that while they do engage in reasoning processes,
Transformers Can Represent Formal Languages More Succinctly Than Finite Automata, But Verification Is Intractable
This paper introduces "succinctness" as a metric for measuring the expressive power of transformers. The authors prove that transformers can
