Speculative Speculative Decoding: Parallelizing LLM Inference for Faster Performance
By
E-Reverance
Not artisan, but a perfectly fine bagel. Hits the spot.
Summary
Researchers introduce speculative speculative decoding (SSD), a novel technique to accelerate large language model inference by parallelizing speculation and verification operations. While standard speculative decoding uses a fast draft model to predict tokens and verifies them with a slower target model, SSD goes further by having the draft model predict verification outcomes and prepare speculations pre-emptively. This eliminates drafting overhead when predictions match actual outcomes. The paper presents Saguaro, an optimized SSD algorithm that achieves 30% faster performance than optimized speculative decoding baselines and up to 5x faster than standard autoregressive decoding.
Key quotes
· 4 pulledSpeculative decoding has become a standard way to accelerate inference by using a fast draft model to predict upcoming tokens from a slower target model, and then verifying them in parallel with a single target model forward pass.
We introduce speculative speculative decoding (SSD) to parallelize these operations. While a verification is ongoing, the draft model predicts likely verification outcomes and prepares speculations pre-emptively for them.
If the actual verification outcome is then in the predicted set, a speculation can be returned immediately, eliminating drafting overhead entirely.
Our implementation is on average 30% faster than optimized speculative decoding baselines and up to 5x faster than autoregressive decoding with open source inference engines.
