All Topics
All Topics
Technology
Technology
AI
AI
Business
Business
Entertainment
Entertainment
News
News
Programming
Programming
Security
Security
Science
Science
Design
Design
Environment
Environment
Finance
Finance
Crypto
Crypto
Politics
Politics
Sports
Sports
Education
Education
Gaming
Gaming
Art
Art
Music
Music
Health
Health
Books
Books
Food
Food
Travel
Travel
Personal
Personal
Bluesky
Twitter

Dispersion loss counteracts embedding condensation to improve small language model generalization

By

Chen Liu*,

1d ago· 5 min readenInsight

Summary

This paper introduces an observation-driven improvement for language model training. The authors identify a geometric phenomenon called "embedding condensation," where token embeddings collapse into a narrow cone-like subspace in smaller language models. To counteract this, they design a training technique called "dispersion loss" (LM-Dispersion), which improves generalization in small language models. The work was presented at ICML 2026.

Source

Hacker NewsDispersion loss counteracts embedding condensation to improve small language model generalizationchenliu-1996.github.io

Key quotes

· 3 pulled
This paper presents an observation-driven improvement on language model training.
We observe a geometric phenomenon which we term embedding condensation, where token embeddings collapse into a narrow cone-like subspace in smaller language models.
Dispersion loss counteracts embedding condensation and improves generalization in small language models (ICML 2026).
Snippet from the RSS feed
Dispersion loss counteracts embedding condensation and improves generalization in small language models (ICML 2026).

You might also wanna read

LK Losses: A New Training Objective to Optimize Acceptance Rate in Speculative Decoding for LLMs

This paper introduces LK losses, a novel training objective for speculative decoding in large language models (LLMs). Speculative decoding a

arxiv.org·1mo ago

Bridge-Garden Theory Explains Why Mixing Hard and Soft Labels Improves Knowledge Distillation for LLMs

This research paper investigates knowledge distillation (KD) for language models, specifically why mixing hard labels (sampled tokens) and s

arxiv.org·1mo ago

Study Finds Larger Language Models Delay But Don't Prevent Plasticity Loss During Training

This research paper investigates whether loss of plasticity (the inability of a neural network to learn new information after training on ol

arxiv.org·9d ago

SemDLM+: Improving Diffusion Language Models by Balancing Bias and Variance in Transition Kernel Design

This paper analyzes sensitivity in Diffusion Language Models (DLMs) through generalization error analysis, identifying three critical factor

arxiv.org·18d ago

Verbalized Sampling: A Training-Free Method to Mitigate Mode Collapse and Improve LLM Output Diversity

This paper identifies a fundamental data-level cause of mode collapse in LLM post-training alignment: typicality bias in preference data, wh

arxiv.org·8d ago

Verbalized Sampling: A Training-Free Method to Mitigate Mode Collapse and Improve LLM Output Diversity

This paper identifies a fundamental data-level cause of mode collapse in LLM post-training alignment: typicality bias in preference data, wh

arxiv.org·8d ago

Three training-time interventions improve diffusion-based speculative decoding by 21-76%

This paper presents an empirical analysis of three training-time interventions to improve speculative decoding with diffusion language model

arxiv.org·9d ago

Comments

Sign in to join the conversation.

No comments yet. Be the first.