Three training-time interventions improve diffusion-based speculative decoding by 21-76%
By
[Submitted on 10 Jun 2026 (v1), last revised 22 Jun 2026 (this version, v2)]
Summary
This paper presents an empirical analysis of three training-time interventions to improve speculative decoding with diffusion language models as drafters. The key problem is that block-diffusion drafters generate tokens bidirectionally within a block, while autoregressive target models verify tokens left-to-right, creating a gap between training objective and verification reward. The three proposed interventions are: (1) token positional weighting, (2) a first-error focal loss targeting the position that breaks the accepted prefix, and (3) a chain loss term substituting a differentiable surrogate for expected accepted length. These interventions act orthogonally and compose additively. Across four target models and six benchmarks (reasoning, code, dialogue), they raise accepted draft length by 21-76% per benchmark over a position-uniform baseline without additional forward passes or changes to the inference pipeline.
Source
Key quotes
· 4 pulledSpeculative decoding addresses this bottleneck by employing a lightweight draft model to propose multiple future tokens that are subsequently verified in parallel by a larger target model.
A subtlety of this regime is that block-diffusion drafters generate tokens bidirectionally within a block, whereas verification is performed by an autoregressive target model that evaluates tokens in a strictly left-to-right manner, leaving a gap between the symmetric training-time objective and the asymmetric verification-time reward.
The three interventions act along orthogonal axes (position, block-conditional first error, joint prefix) and compose additively; they are likewise orthogonal to test-time alignment mechanisms such as multi-draft self-selection, with which they can in principle be combined.
Across four target models and six reasoning, code, and dialogue benchmarks, the three interventions raise accepted draft length by 21-76% per benchmark over a position-uniform baseline, without adding additional forward passes and without changing the inference pipeline or the rejection-sampling exactness contract.
You might also wanna read
Fast-dLLM: Training-Free Acceleration Method for Diffusion Language Models Using KV Cache and Parallel Decoding
Researchers introduce Fast-dLLM, a training-free acceleration method for diffusion-based large language models that addresses their slower i
Speculative Speculative Decoding: Parallelizing LLM Inference for Faster Performance
Researchers introduce speculative speculative decoding (SSD), a novel technique to accelerate large language model inference by parallelizin
Google's DiffusionGemma achieves 4x faster text generation using diffusion-based parallel token generation
DiffusionGemma is a new text generation model from Google that achieves up to 4x faster inference speeds compared to traditional autoregress
Google's DiffusionGemma achieves 4x faster text generation using diffusion-based parallel token generation
DiffusionGemma is a new text generation model from Google that achieves up to 4x faster inference speeds compared to traditional autoregress
iLLaDA: An 8B Masked Diffusion Language Model Trained with Bidirectional Attention
The paper introduces iLLaDA, an 8-billion parameter masked diffusion language model trained from scratch with fully bidirectional attention,
Roofline Model for Estimating Speculative Decoding Speedup in LLM Inference
This article presents a roofline model for estimating speedup ratios from speculative decoding in large language model (LLM) inference. It a
MMaDA-Parallel: Multimodal Diffusion Language Models for Thinking-Aware Generation and Editing
This article presents MMaDA-Parallel, a multimodal large diffusion language model for thinking-aware editing and generation. The research id

Comments
Sign in to join the conversation.
No comments yet. Be the first.