All Topics
All Topics
Technology
Technology
AI
AI
Business
Business
Entertainment
Entertainment
News
News
Programming
Programming
Security
Security
Science
Science
Design
Design
Environment
Environment
Finance
Finance
Crypto
Crypto
Politics
Politics
Sports
Sports
Education
Education
Gaming
Gaming
Art
Art
Music
Music
Health
Health
Books
Books
Food
Food
Travel
Travel
Personal
Personal
Bluesky
Twitter

Three training-time interventions improve diffusion-based speculative decoding by 21-76%

By

[Submitted on 10 Jun 2026 (v1), last revised 22 Jun 2026 (this version, v2)]

10h ago· 2 min readenInsight

Summary

This paper presents an empirical analysis of three training-time interventions to improve speculative decoding with diffusion language models as drafters. The key problem is that block-diffusion drafters generate tokens bidirectionally within a block, while autoregressive target models verify tokens left-to-right, creating a gap between training objective and verification reward. The three proposed interventions are: (1) token positional weighting, (2) a first-error focal loss targeting the position that breaks the accepted prefix, and (3) a chain loss term substituting a differentiable surrogate for expected accepted length. These interventions act orthogonally and compose additively. Across four target models and six benchmarks (reasoning, code, dialogue), they raise accepted draft length by 21-76% per benchmark over a position-uniform baseline without additional forward passes or changes to the inference pipeline.

Source

bskyThree training-time interventions improve diffusion-based speculative decoding by 21-76%arxiv.org

Key quotes

· 4 pulled
Speculative decoding addresses this bottleneck by employing a lightweight draft model to propose multiple future tokens that are subsequently verified in parallel by a larger target model.
A subtlety of this regime is that block-diffusion drafters generate tokens bidirectionally within a block, whereas verification is performed by an autoregressive target model that evaluates tokens in a strictly left-to-right manner, leaving a gap between the symmetric training-time objective and the asymmetric verification-time reward.
The three interventions act along orthogonal axes (position, block-conditional first error, joint prefix) and compose additively; they are likewise orthogonal to test-time alignment mechanisms such as multi-draft self-selection, with which they can in principle be combined.
Across four target models and six reasoning, code, and dialogue benchmarks, the three interventions raise accepted draft length by 21-76% per benchmark over a position-uniform baseline, without adding additional forward passes and without changing the inference pipeline or the rejection-sampling exactness contract.
Snippet from the RSS feed
Large language models (LLMs) achieve remarkable performance across a wide range of tasks, but their autoregressive decoding process incurs substantial inference costs due to inherently sequential token generation. Speculative decoding addresses this bottl

You might also wanna read

Fast-dLLM: Training-Free Acceleration Method for Diffusion Language Models Using KV Cache and Parallel Decoding

Researchers introduce Fast-dLLM, a training-free acceleration method for diffusion-based large language models that addresses their slower i

arxiv.org·8mo ago

Speculative Speculative Decoding: Parallelizing LLM Inference for Faster Performance

Researchers introduce speculative speculative decoding (SSD), a novel technique to accelerate large language model inference by parallelizin

arxiv.org·3mo ago

Google's DiffusionGemma achieves 4x faster text generation using diffusion-based parallel token generation

DiffusionGemma is a new text generation model from Google that achieves up to 4x faster inference speeds compared to traditional autoregress

blog.google·15d ago

Google's DiffusionGemma achieves 4x faster text generation using diffusion-based parallel token generation

DiffusionGemma is a new text generation model from Google that achieves up to 4x faster inference speeds compared to traditional autoregress

blog.google·15d ago

iLLaDA: An 8B Masked Diffusion Language Model Trained with Bidirectional Attention

The paper introduces iLLaDA, an 8-billion parameter masked diffusion language model trained from scratch with fully bidirectional attention,

arxiv.org·5h ago

Roofline Model for Estimating Speculative Decoding Speedup in LLM Inference

This article presents a roofline model for estimating speedup ratios from speculative decoding in large language model (LLM) inference. It a

modal.com·3d ago

MMaDA-Parallel: Multimodal Diffusion Language Models for Thinking-Aware Generation and Editing

This article presents MMaDA-Parallel, a multimodal large diffusion language model for thinking-aware editing and generation. The research id

github.com·7mo ago

Comments

Sign in to join the conversation.

No comments yet. Be the first.