Study Finds Larger Language Models Delay But Don't Prevent Plasticity Loss During Training
By
[Submitted on 23 Jun 2026]
Summary
This research paper investigates whether loss of plasticity (the inability of a neural network to learn new information after training on older data) remains a problem in modern transformer-based large language models. The authors study GPT-style Transformers (5M to 314M parameters) trained on a multilingual continual learning problem and find evidence of plasticity loss across all model sizes, measured by deterioration on a held-out Vietnamese probing task. They discover that plasticity loss onset follows a predictable scaling law, growing sublinearly with model size, suggesting larger models delay but don't prevent the phenomenon. Additionally, plasticity loss was observed even under stationary (non-continual) training, challenging the view that it's exclusive to continual learning scenarios.
Source
Key quotes
· 4 pulledThe loss of plasticity - the ability of a network to learn new information after having already learned older information - is a fundamental challenge in creating artificial neural networks capable of continual learning.
These results suggest that larger models may delay the measurable effects of plasticity loss, but that increasing parameter count alone is likely to be insufficient to completely prevent it.
We also find evidence of plasticity loss under stationary multilingual training, challenging the view that the phenomenon is exclusive to continual learning with abrupt task changes.
Overall, our results suggest that even large Transformer language models trained on natural-language will eventually lose the ability to efficiently adapt to new data after sufficiently long training, in both continual and stationary settings.
You might also wanna read
Scaling Laws Limit Reliability of Large Language Models, Study Finds
This research paper demonstrates that the scaling laws governing large language models (LLMs) fundamentally limit their ability to improve p
Study Reveals Convergent Evolution in How Language Models Learn Number Representations
This research paper investigates how different language models (Transformers, Linear RNNs, LSTMs, and classical word embeddings) learn to re
Sleep-Like Consolidation Mechanism Improves Long-Context Performance in Transformer Language Models
This paper proposes a sleep-like consolidation mechanism for transformer-based large language models to address the poor scaling of attentio
Final Training of a Large Language Model from Scratch: Chapter 5 Completion
This article concludes a 22-part series documenting the author's journey through Chapter 5 of Sebastian Raschka's book "Build a Large Langua
Analyzing Memorization in Transformers Through Loss Landscape Curvature Decomposition
This research paper analyzes how memorization manifests in transformer models (both language models and vision transformers) through loss la
Examining the Limitations of Transformer Models and the Gap to Human-Level AI
The article presents a skeptical perspective on claims about imminent Artificial General Intelligence (AGI), arguing that current transforme

Comments
Sign in to join the conversation.
No comments yet. Be the first.