Study Finds Single Transformer Layer Can Match Full-Parameter RL Training in LLMs
By
[Submitted on 1 Jul 2026]
Summary
This research paper challenges the common assumption that reinforcement learning (RL) post-training for large language models (LLMs) requires updating all transformer layers uniformly. Through systematic layer-wise analysis across seven models (Qwen3, Qwen2.5), three RL algorithms (GRPO, GiGPO, Dr. GRPO), and multiple task domains, the authors find that training a single transformer layer can recover most — and sometimes surpass — the gains of full-parameter RL training. They introduce a metric called "layer contribution" to quantify this phenomenon. The results show RL gains are highly concentrated in a small subset of layers, consistently in the middle of the transformer stack, with input/output layers contributing substantially less. This pattern holds across datasets, tasks, model families, and RL algorithms.
Source
Key quotes
· 4 pulledSurprisingly, we find that training a single transformer layer can recover most of the gains achieved by full-parameter RL training, and in some cases even surpass it.
RL gains are highly concentrated in a small subset of, and in many cases even a single, transformer layers.
High-contribution layers concentrate in the middle of the transformer stack, while layers near the input and output ends contribute substantially less.
The resulting layer rankings remain strongly correlated across datasets, tasks, model families, and RL algorithms.
You might also wanna read
Study Reveals How RL and SFT Differently Teach Transformers Chain-of-Thought Reasoning on Sparse Boolean Functions
This research paper analyzes how transformers learn Chain-of-Thought (CoT) reasoning capabilities through Reinforcement Learning (RL) with p
ConSPO: A Contrastive Approach to Improving Reinforcement Learning with Verifiable Rewards for LLMs
This paper analyzes Group Relative Policy Optimization (GRPO), a widely used RLVR algorithm for post-training large language models on reaso
Study Finds Larger Language Models Delay But Don't Prevent Plasticity Loss During Training
This research paper investigates whether loss of plasticity (the inability of a neural network to learn new information after training on ol
AgentGym-RL: A Reinforcement Learning Framework for Training LLM Agents in Multi-Turn Decision Making
This paper introduces AgentGym-RL, a unified reinforcement learning framework for training LLM agents to perform multi-turn interactive deci
AgentGym-RL: A Reinforcement Learning Framework for Training LLM Agents in Multi-Turn Decision Making
This paper introduces AgentGym-RL, a unified reinforcement learning framework for training LLM agents to perform multi-turn interactive deci
New Framework Formalizes Learning from Language Feedback with Provable Performance Guarantees
This paper formalizes the Learning from Language Feedback (LLF) problem, providing a principled framework for interactive learning using lan
SPIRAL: A Reinforcement Learning Framework for Multi-Primitive Language Model Reasoning
This paper introduces SPIRAL (Sequential-Parallel-Aggregative Reinforcement Learning), a framework that trains language models to use three

Comments
Sign in to join the conversation.
No comments yet. Be the first.