Three training-time interventions improve diffusion-based speculative decoding by 21-76%

[Submitted on 10 Jun 2026 (v1), last revised 22 Jun 2026 (this version, v2)]

10h ago· 2 min readenInsight

technology science machine learning natural language processing

Summary

This paper presents an empirical analysis of three training-time interventions to improve speculative decoding with diffusion language models as drafters. The key problem is that block-diffusion drafters generate tokens bidirectionally within a block, while autoregressive target models verify tokens left-to-right, creating a gap between training objective and verification reward. The three proposed interventions are: (1) token positional weighting, (2) a first-error focal loss targeting the position that breaks the accepted prefix, and (3) a chain loss term substituting a differentiable surrogate for expected accepted length. These interventions act orthogonally and compose additively. Across four target models and six benchmarks (reasoning, code, dialogue), they raise accepted draft length by 21-76% per benchmark over a position-uniform baseline without additional forward passes or changes to the inference pipeline.

Source

bskyThree training-time interventions improve diffusion-based speculative decoding by 21-76%arxiv.org

Key quotes

· 4 pulled

Speculative decoding addresses this bottleneck by employing a lightweight draft model to propose multiple future tokens that are subsequently verified in parallel by a larger target model.

A subtlety of this regime is that block-diffusion drafters generate tokens bidirectionally within a block, whereas verification is performed by an autoregressive target model that evaluates tokens in a strictly left-to-right manner, leaving a gap between the symmetric training-time objective and the asymmetric verification-time reward.

The three interventions act along orthogonal axes (position, block-conditional first error, joint prefix) and compose additively; they are likewise orthogonal to test-time alignment mechanisms such as multi-draft self-selection, with which they can in principle be combined.

Across four target models and six reasoning, code, and dialogue benchmarks, the three interventions raise accepted draft length by 21-76% per benchmark over a position-uniform baseline, without adding additional forward passes and without changing the inference pipeline or the rejection-sampling exactness contract.

Snippet from the RSS feed

Large language models (LLMs) achieve remarkable performance across a wide range of tasks, but their autoregressive decoding process incurs substantial inference costs due to inherently sequential token generation. Speculative decoding addresses this bottl

You might also wanna read

Fast-dLLM: Training-Free Acceleration Method for Diffusion Language Models Using KV Cache and Parallel Decoding

Researchers introduce Fast-dLLM, a training-free acceleration method for diffusion-based large language models that addresses their slower i

arxiv.org·8mo ago

Speculative Speculative Decoding: Parallelizing LLM Inference for Faster Performance

Researchers introduce speculative speculative decoding (SSD), a novel technique to accelerate large language model inference by parallelizin

arxiv.org·3mo ago

Google's DiffusionGemma achieves 4x faster text generation using diffusion-based parallel token generation

DiffusionGemma is a new text generation model from Google that achieves up to 4x faster inference speeds compared to traditional autoregress

blog.google·15d ago

Google's DiffusionGemma achieves 4x faster text generation using diffusion-based parallel token generation

DiffusionGemma is a new text generation model from Google that achieves up to 4x faster inference speeds compared to traditional autoregress

blog.google·15d ago

iLLaDA: An 8B Masked Diffusion Language Model Trained with Bidirectional Attention

The paper introduces iLLaDA, an 8-billion parameter masked diffusion language model trained from scratch with fully bidirectional attention,

arxiv.org·5h ago

Roofline Model for Estimating Speculative Decoding Speedup in LLM Inference

This article presents a roofline model for estimating speedup ratios from speculative decoding in large language model (LLM) inference. It a

modal.com·3d ago

MMaDA-Parallel: Multimodal Diffusion Language Models for Thinking-Aware Generation and Editing

This article presents MMaDA-Parallel, a multimodal large diffusion language model for thinking-aware editing and generation. The research id

github.com·7mo ago

Comments

No comments yet. Be the first.