Fast-dLLM: Training-Free Acceleration Method for Diffusion Language Models Using KV Cache and Parallel Decoding
By
nathan-barry
Not artisan, but a perfectly fine bagel. Hits the spot.
Summary
Researchers introduce Fast-dLLM, a training-free acceleration method for diffusion-based large language models that addresses their slower inference speeds compared to autoregressive models. The approach includes a novel block-wise approximate KV Cache mechanism for bidirectional diffusion models and a confidence-aware parallel decoding strategy to maintain generation quality while enabling parallel token decoding. Experimental results show up to 27.6× throughput improvement with minimal accuracy loss, closing the performance gap with autoregressive models.
Key quotes
· 5 pulledDiffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities.
We introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop.
We identify the root cause of generation quality degradation in parallel decoding as the disruption of token dependencies under the conditional independence assumption.
Experimental results on LLaDA and Dream models across multiple LLM benchmarks demonstrate up to 27.6× throughput improvement with minimal accuracy loss.
Closing the performance gap with autoregressive models and paving the way for practical deployment of Diffusion LLMs.
You might also wanna read
RTP-LLM: Alibaba's High-Performance Inference Engine for Large Language Model Deployment
This paper presents RTP-LLM, a high-performance inference engine developed by Alibaba for industrial-scale deployment of Large Language Mode
Parametric Memory Law: A Quantitative Framework for Understanding LoRA Memory Capacity in LLMs
This research paper introduces the Parametric Memory Law, a quantitative framework for understanding how Low-Rank Adaptation (LoRA) enables
PromptEmbedder: A Dual-LLM Framework for Efficient, Architecture-Agnostic Text Embedding
The article presents PromptEmbedder, a novel dual-LLM framework for efficient and transferable text embedding. It addresses the bottleneck o
