All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Study Reveals How RL and SFT Differently Teach Transformers Chain-of-Thought Reasoning on Sparse Boolean Functions

By

[Submitted on 22 Nov 2025 (v1), last revised 25 May 2026 (this version, v2)]

3d ago· 2 min readenInsight

Summary

This research paper analyzes how transformers learn Chain-of-Thought (CoT) reasoning capabilities through Reinforcement Learning (RL) with process rewards versus Supervised Fine-Tuning (SFT). The authors focus on k-sparse Boolean functions that can be recursively decomposed into 2-sparse Boolean functions, examining learning dynamics in a unified framework. They identify sufficient conditions for provable learning and verify them on k-PARITY, k-AND, and k-OR functions. Key finding: RL learns the entire CoT chain simultaneously, while SFT learns it step by step. The paper provides theoretical insights into the differing mechanisms of RL and SFT in triggering CoT capabilities in transformers.

Key quotes

· 5 pulled
RL learns the whole CoT chain simultaneously, whereas SFT naturally learns the CoT chain step by step.
We first analyze the learning dynamics of RL fine-tuning with process reward and SFT in a unified way.
Our findings provide insights on the mechanisms underlying RL and SFT and how they differ in triggering the CoT capabilities of transformers.
We consider $k$-sparse Boolean functions that can be recursively decomposed into fixed 2-sparse Boolean functions.
The comparison between RL and SFT may need to consider the reward design and the use of teacher forcing.
Snippet from the RSS feed
Transformers can acquire Chain-of-Thought (CoT) capabilities to solve complex reasoning tasks through fine-tuning. Reinforcement learning (RL) and supervised fine-tuning (SFT) are two primary approaches to this end. In this work, we specifically examine R

You might also wanna read

Supervised Fine-Tuning as Reinforcement Learning: Introducing Importance-Weighted SFT

The article explores the connection between supervised fine-tuning (SFT) of large language models and reinforcement learning (RL), arguing t

arxiv.org·10mo ago

Theoretical Perspective on Continuous Chain of Thoughts in Reasoning

Large Language Models (LLMs) have shown impressive performance in reasoning tasks using chain-of-thoughts (CoTs) techniques. This article ex

arxiv.org·11mo ago

LLM Circuit Finder: Duplicating Specific Layers in Transformer Models Improves Reasoning Performance Without Training

The article describes a GitHub project called 'llm-circuit-finder' that implements a method for discovering and exploiting 'reasoning circui

github.com·2mo ago

Universal Reasoning Model (URM): Enhancing Transformer Performance for Complex AI Reasoning Tasks

This research paper analyzes Universal Transformers (UTs) used for complex reasoning tasks like ARC-AGI and Sudoku, finding that performance

arxiv.org·5mo ago

Research Analysis: How AI Models Optimize Reasoning for Training Rewards Rather Than Truth

The article presents a case study on how Large Language Models approach reasoning, arguing that while they do engage in reasoning processes,

tomaszmachnik.pl·4mo ago

Transformers Can Represent Formal Languages More Succinctly Than Finite Automata, But Verification Is Intractable

This paper introduces "succinctness" as a metric for measuring the expressive power of transformers. The authors prove that transformers can

arxiv.org·27d ago