Study Shows Weight Decay During Pretraining Improves Language Model Adaptability After Fine-Tuning

[Submitted on 11 Feb 2026 (v1), last revised 28 May 2026 (this version, v2)]

58m ago· 2 min readenInsight

75/100

Toasty

Bagelometer↗

Not artisan, but a perfectly fine bagel. Hits the spot.

Score75TypeanalysisSentimentneutral

Summary

This research paper investigates how weight decay during pretraining of large language models affects their downstream adaptability (plasticity). Through systematic experiments, the authors demonstrate that larger weight decay increases model plasticity, leading to better performance after fine-tuning—even when base models show worse pretraining loss. This creates counterintuitive trade-offs where worse-performing base models can become better after additional training. The mechanistic analysis reveals weight decay encourages linearly separable representations, regularizes attention matrices, and reduces overfitting. The findings challenge using cross-entropy loss as the sole metric for hyperparameter optimization and highlight the importance of considering downstream adaptability during pretraining.

Key quotes

· 4 pulled

Weight decay increases the plasticity of the pretrained model, resulting in greater performance gains downstream after fine-tuning.

This effect can lead to counterintuitive trade-offs where base models that perform worse after pretraining can perform better after further training.

Weight decay encourages linearly separable representations, regularizes attention matrices, and reduces overfitting on the training data.

These findings highlight the importance of pretrained model plasticity, the limits of using cross-entropy loss as the sole metric for hyperparameter optimization.

Snippet from the RSS feed

Large language models are typically trained in two broad phases: pretraining to produce a base model, followed by further training to improve downstream performance. However, hyperparameter optimization and scaling laws are studied primarily from the pers

You might also wanna read

Parametric Memory Law: A Quantitative Framework for Understanding LoRA Memory Capacity in LLMs

This research paper introduces the Parametric Memory Law, a quantitative framework for understanding how Low-Rank Adaptation (LoRA) enables

arxiv.org·2d ago

Bridge-Garden Theory Explains Why Mixing Hard and Soft Labels Improves Knowledge Distillation for LLMs

This research paper investigates knowledge distillation (KD) for language models, specifically why mixing hard labels (sampled tokens) and s

arxiv.org·4d ago

Researchers Develop Method to Predict Real-Time Progress in Reasoning Language Models

This research paper investigates whether real-time progress prediction is feasible for reasoning language models that use long latent chains

arxiv.org·4d ago

AI systems achieve 50% pass rate in standard three-party Turing test, study finds

This paper demonstrates that three current AI systems (when suitably prompted) achieve a pass rate of at least 50% in a standard three-party

pnas.org·4d ago

RICP: A Teacher-Student Framework for Retrieved In-Context Principles from Mistakes in LLMs

This paper introduces Retrieved In-Context Principles (RICP), a novel teacher-student framework for improving Large Language Models (LLMs) t

arxiv.org·5d ago

HSIR: New Method Improves Self-Improvement Training for Large Reasoning Models

This research paper identifies two key problems in self-improvement training for Large Reasoning Models (LRMs): data imbalance (too many sim

arxiv.org·5d ago