SemDLM+: Improving Diffusion Language Models by Balancing Bias and Variance in Transition Kernel Design

[Submitted on 13 Jun 2026]

4h ago· 2 min readenInsight

75/100

Toasty

Bagelometer↗

A respectable bake. You'd come back tomorrow for another.

Score75TypeanalysisSentimentneutral

Summary

This paper analyzes sensitivity in Diffusion Language Models (DLMs) through generalization error analysis, identifying three critical factors: asymptotic bias, exposure bias, and optimization variance. It compares masking diffusion (sparse, easier posterior approximation) with uniform diffusion (stronger sampling repair, harder approximation). The authors revisit Semantic DLM (SemDLM) as a middle ground that reduces posterior approximation difficulty while retaining repair ability, but identify a "semantic basin problem" causing low-diversity text. They propose SemDLM+, which adds a global transition and semantic-frequency penalty during sampling, achieving competitive language modeling and generation quality with satisfactory diversity on LM1B and OpenWebText benchmarks.

Key quotes

· 5 pulled

Diffusion Language Models (DLMs) have demonstrated strong scaling capacity as alternatives to autoregressive language models.

Our theory suggests that SemDLM can serve as a plausible middle ground by reducing the posterior approximation difficulty of uniform diffusion while retaining repair ability.

We find that SemDLM suffers from a semantic basin problem, where sampling repeatedly stays within a semantic region and produces low-diversity text.

To address this, we propose SemDLM+, which adds a global transition and a semantic-frequency penalty during sampling.

Experiments on LM1B and OpenWebText show that SemDLM+ improves training dynamics and achieves competitive language modeling and generation quality with satisfactory diversity.

Snippet from the RSS feed

Diffusion Language Models (DLMs) have demonstrated strong scaling capacity as alternatives to autoregressive language models. However, their performance is highly sensitive to the choice of transition kernels, and poorly designed kernels can lead to issue

You might also wanna read

Fast-dLLM: Training-Free Acceleration Method for Diffusion Language Models Using KV Cache and Parallel Decoding

Researchers introduce Fast-dLLM, a training-free acceleration method for diffusion-based large language models that addresses their slower i

arxiv.org·7mo ago

Consistency Diffusion Language Models Achieve 14x Faster Inference Through KV Caching and Step Reduction

Consistency Diffusion Language Models (CDLM) represent a breakthrough in language model architecture that addresses key limitations of stand

together.ai·3mo ago

Google's DiffusionGemma achieves 4x faster text generation using diffusion-based parallel token generation

DiffusionGemma is a new text generation model from Google that achieves up to 4x faster inference speeds compared to traditional autoregress

blog.google·5d ago

Google's DiffusionGemma achieves 4x faster text generation using diffusion-based parallel token generation

DiffusionGemma is a new text generation model from Google that achieves up to 4x faster inference speeds compared to traditional autoregress

blog.google·5d ago

MMaDA-Parallel: Multimodal Diffusion Language Models for Thinking-Aware Generation and Editing

This article presents MMaDA-Parallel, a multimodal large diffusion language model for thinking-aware editing and generation. The research id

github.com·6mo ago

Zebra-Llama: Efficient Hybrid Language Models Combining SSMs and Attention Layers

Researchers propose Zebra-Llama, a family of hybrid language models (1B, 3B, 8B) that combine State Space Models (SSMs) and Multi-head Laten

arxiv.org·6mo ago

Exploring the Connection Between Text Diffusion Models and BERT's Masked Language Modeling

This article explores the connection between diffusion models for text generation and traditional masked language modeling (MLM) used in BER

nathan.rs·7mo ago