Dispersion loss counteracts embedding condensation to improve small language model generalization

Chen Liu*,

1d ago· 5 min readenInsight

technology science machine learning natural language processing

Summary

This paper introduces an observation-driven improvement for language model training. The authors identify a geometric phenomenon called "embedding condensation," where token embeddings collapse into a narrow cone-like subspace in smaller language models. To counteract this, they design a training technique called "dispersion loss" (LM-Dispersion), which improves generalization in small language models. The work was presented at ICML 2026.

Source

Hacker NewsDispersion loss counteracts embedding condensation to improve small language model generalizationchenliu-1996.github.io

Key quotes

· 3 pulled

This paper presents an observation-driven improvement on language model training.

We observe a geometric phenomenon which we term embedding condensation, where token embeddings collapse into a narrow cone-like subspace in smaller language models.

Dispersion loss counteracts embedding condensation and improves generalization in small language models (ICML 2026).

Snippet from the RSS feed

Dispersion loss counteracts embedding condensation and improves generalization in small language models (ICML 2026).

You might also wanna read

LK Losses: A New Training Objective to Optimize Acceptance Rate in Speculative Decoding for LLMs

This paper introduces LK losses, a novel training objective for speculative decoding in large language models (LLMs). Speculative decoding a

arxiv.org·1mo ago

Bridge-Garden Theory Explains Why Mixing Hard and Soft Labels Improves Knowledge Distillation for LLMs

This research paper investigates knowledge distillation (KD) for language models, specifically why mixing hard labels (sampled tokens) and s

arxiv.org·1mo ago

Study Finds Larger Language Models Delay But Don't Prevent Plasticity Loss During Training

This research paper investigates whether loss of plasticity (the inability of a neural network to learn new information after training on ol

arxiv.org·9d ago

SemDLM+: Improving Diffusion Language Models by Balancing Bias and Variance in Transition Kernel Design

This paper analyzes sensitivity in Diffusion Language Models (DLMs) through generalization error analysis, identifying three critical factor

arxiv.org·18d ago

Verbalized Sampling: A Training-Free Method to Mitigate Mode Collapse and Improve LLM Output Diversity

This paper identifies a fundamental data-level cause of mode collapse in LLM post-training alignment: typicality bias in preference data, wh

arxiv.org·8d ago

Verbalized Sampling: A Training-Free Method to Mitigate Mode Collapse and Improve LLM Output Diversity

This paper identifies a fundamental data-level cause of mode collapse in LLM post-training alignment: typicality bias in preference data, wh

arxiv.org·8d ago

Three training-time interventions improve diffusion-based speculative decoding by 21-76%

This paper presents an empirical analysis of three training-time interventions to improve speculative decoding with diffusion language model

arxiv.org·9d ago

Comments

No comments yet. Be the first.