Study Finds Single Transformer Layer Can Match Full-Parameter RL Training in LLMs

[Submitted on 1 Jul 2026]

2d ago· 2 min readenInsight

technology science artificial intelligence machine learning research

Summary

This research paper challenges the common assumption that reinforcement learning (RL) post-training for large language models (LLMs) requires updating all transformer layers uniformly. Through systematic layer-wise analysis across seven models (Qwen3, Qwen2.5), three RL algorithms (GRPO, GiGPO, Dr. GRPO), and multiple task domains, the authors find that training a single transformer layer can recover most — and sometimes surpass — the gains of full-parameter RL training. They introduce a metric called "layer contribution" to quantify this phenomenon. The results show RL gains are highly concentrated in a small subset of layers, consistently in the middle of the transformer stack, with input/output layers contributing substantially less. This pattern holds across datasets, tasks, model families, and RL algorithms.

Source

Hacker NewsStudy Finds Single Transformer Layer Can Match Full-Parameter RL Training in LLMsarxiv.org

Key quotes

· 4 pulled

Surprisingly, we find that training a single transformer layer can recover most of the gains achieved by full-parameter RL training, and in some cases even surpass it.

RL gains are highly concentrated in a small subset of, and in many cases even a single, transformer layers.

High-contribution layers concentrate in the middle of the transformer stack, while layers near the input and output ends contribute substantially less.

The resulting layer rankings remain strongly correlated across datasets, tasks, model families, and RL algorithms.

Snippet from the RSS feed

Reinforcement learning (RL) has become a central component of post-training large language models (LLMs), yet little is understood about how RL adaptation is distributed across transformer layers. Existing approaches typically update all model parameters

You might also wanna read

Study Reveals How RL and SFT Differently Teach Transformers Chain-of-Thought Reasoning on Sparse Boolean Functions

This research paper analyzes how transformers learn Chain-of-Thought (CoT) reasoning capabilities through Reinforcement Learning (RL) with p

arxiv.org·1mo ago

ConSPO: A Contrastive Approach to Improving Reinforcement Learning with Verifiable Rewards for LLMs

This paper analyzes Group Relative Policy Optimization (GRPO), a widely used RLVR algorithm for post-training large language models on reaso

arxiv.org·1mo ago

Study Finds Larger Language Models Delay But Don't Prevent Plasticity Loss During Training

This research paper investigates whether loss of plasticity (the inability of a neural network to learn new information after training on ol

arxiv.org·9d ago

AgentGym-RL: A Reinforcement Learning Framework for Training LLM Agents in Multi-Turn Decision Making

This paper introduces AgentGym-RL, a unified reinforcement learning framework for training LLM agents to perform multi-turn interactive deci

arxiv.org·14d ago

AgentGym-RL: A Reinforcement Learning Framework for Training LLM Agents in Multi-Turn Decision Making

This paper introduces AgentGym-RL, a unified reinforcement learning framework for training LLM agents to perform multi-turn interactive deci

arxiv.org·14d ago

New Framework Formalizes Learning from Language Feedback with Provable Performance Guarantees

This paper formalizes the Learning from Language Feedback (LLF) problem, providing a principled framework for interactive learning using lan

arxiv.org·23d ago

SPIRAL: A Reinforcement Learning Framework for Multi-Primitive Language Model Reasoning

This paper introduces SPIRAL (Sequential-Parallel-Aggregative Reinforcement Learning), a framework that trains language models to use three

arxiv.org·8d ago

Comments

No comments yet. Be the first.