Understanding Speculative Sampling: Using Draft Distributions to Match Target Sampling Results
By
teleforce
A bagel you'd recommend to a friend without hedging.
Summary
Speculative sampling is a technique that uses a draft sampling distribution to achieve the same results as a target sampling distribution. The method involves using two probability distributions - a target distribution p(x) and a draft distribution q(x) - and implementing a smart rejection mechanism to adjust sampling. This approach down-samples over-sampled tokens and up-samples under-sampled tokens from the draft distribution to match the target distribution's sampling results.
Key quotes
· 4 pulledThe idea of speculative sampling is to use a draft sampling to achieve the same sampling result as the target sampling.
We have a target sampling distribution $p(x)$ and a draft sampling distribution $q(x)$.
The core trick of speculative sampling is to design a smart rejection method to down-sample the over-sampled tokens and up-sample the under-sampled tokens.
If we directly sample from $q(x)$, we will get a sample $x$ that is not from the target distribution $p(x)$.
You might also wanna read
DeepSeek-V4: Hybrid Sparse-Attention Architecture Enables Efficient Million-Token Context Inference
DeepSeek-V4 introduces a hybrid sparse-attention architecture combined with on-policy distillation across domain specialists, enabling 1M-to
Rotary GPU: Enabling Large Mixture-of-Experts Models on Consumer Laptop GPUs with Limited Memory
This paper presents Rotary GPU, an exploratory approach to running large Mixture-of-Experts (MoE) language models on consumer-grade hardware
LinkedIn cuts GPU training hours by 65% with Generative Recommender system optimizations
LinkedIn has developed a Generative Recommender (GR) system that models user activity as token sequences, offering richer long-context perso
Rank-Aware Decomposition Technique Reduces Computation in Recommender Systems by 87.5%
This paper presents a rank-aware decomposition technique for deep ranking models in industrial recommender systems. The key insight is that
Modified Raft Consensus Protocol Enables Progress with Minority Node Participation
This article describes a modified version of the Raft consensus protocol that allows progress to be made even when fewer than a majority of
Hands-on evaluation of MiniMax M2.7 via API on ML and coding workflows
The author evaluates MiniMax M2.7 by using it through Claude Code on three real-world ML and coding workflows: scaffolding a Kaggle competit
