Dispersion loss counteracts embedding condensation to improve small language model generalization
By
Chen Liu*,
Summary
This paper introduces an observation-driven improvement for language model training. The authors identify a geometric phenomenon called "embedding condensation," where token embeddings collapse into a narrow cone-like subspace in smaller language models. To counteract this, they design a training technique called "dispersion loss" (LM-Dispersion), which improves generalization in small language models. The work was presented at ICML 2026.
Source
Hacker NewsDispersion loss counteracts embedding condensation to improve small language model generalizationchenliu-1996.github.ioKey quotes
· 3 pulledThis paper presents an observation-driven improvement on language model training.
We observe a geometric phenomenon which we term embedding condensation, where token embeddings collapse into a narrow cone-like subspace in smaller language models.
Dispersion loss counteracts embedding condensation and improves generalization in small language models (ICML 2026).
You might also wanna read
LK Losses: A New Training Objective to Optimize Acceptance Rate in Speculative Decoding for LLMs
This paper introduces LK losses, a novel training objective for speculative decoding in large language models (LLMs). Speculative decoding a
Bridge-Garden Theory Explains Why Mixing Hard and Soft Labels Improves Knowledge Distillation for LLMs
This research paper investigates knowledge distillation (KD) for language models, specifically why mixing hard labels (sampled tokens) and s
Study Finds Larger Language Models Delay But Don't Prevent Plasticity Loss During Training
This research paper investigates whether loss of plasticity (the inability of a neural network to learn new information after training on ol
SemDLM+: Improving Diffusion Language Models by Balancing Bias and Variance in Transition Kernel Design
This paper analyzes sensitivity in Diffusion Language Models (DLMs) through generalization error analysis, identifying three critical factor
Verbalized Sampling: A Training-Free Method to Mitigate Mode Collapse and Improve LLM Output Diversity
This paper identifies a fundamental data-level cause of mode collapse in LLM post-training alignment: typicality bias in preference data, wh
Verbalized Sampling: A Training-Free Method to Mitigate Mode Collapse and Improve LLM Output Diversity
This paper identifies a fundamental data-level cause of mode collapse in LLM post-training alignment: typicality bias in preference data, wh
Three training-time interventions improve diffusion-based speculative decoding by 21-76%
This paper presents an empirical analysis of three training-time interventions to improve speculative decoding with diffusion language model

Comments
Sign in to join the conversation.
No comments yet. Be the first.