Roofline Model for Estimating Speculative Decoding Speedup in LLM Inference
This article presents a roofline model for estimating speedup ratios from speculative decoding in large language model (LLM) inference. It analyzes how different draft lengths, acceptance probabilities, block sizes, and relative costs per token/block affect the speedup achieved a