LK Losses: A New Training Objective to Optimize Acceptance Rate in Speculative Decoding for LLMs
By
[Submitted on 27 Feb 2026 (v1), last revised 1 Jun 2026 (this version, v2)]
Crisped on the outside, thoughtful enough on the inside.
Summary
This paper introduces LK losses, a novel training objective for speculative decoding in large language models (LLMs). Speculative decoding accelerates LLM inference by using a lightweight draft model to propose tokens that are verified in parallel by the target model. While standard training minimizes KL divergence as a proxy for acceptance rate, small draft models often converge to suboptimal solutions where minimizing KL doesn't maximize acceptance rate. The proposed LK losses directly target acceptance rate optimization. Experiments across four draft architectures and six target models (8B to 685B parameters) show consistent improvements of 8-10% in average acceptance length across general, coding, and math domains. The approach is easy to implement, introduces no computational overhead, and integrates into existing training frameworks.
Key quotes
· 5 pulledWhile KL divergence and acceptance rate share the same global optimum, small draft models, having limited capacity, typically converge to suboptimal solutions where minimizing KL does not guarantee maximizing acceptance rate.
We propose LK losses, special training objectives that directly target acceptance rate.
Comprehensive experiments across four draft architectures and six target models, ranging from 8B to 685B parameters, demonstrate consistent improvements in acceptance metrics across all configurations compared to the standard KL-based training.
We evaluate our approach on general, coding and math domains and report gains of up to 8-10% in average acceptance length.
LK losses are easy to implement, introduce no computational overhead and can be directly integrated into any existing speculator training framework, making them a compelling alternative to the existing draft training objectives.
You might also wanna read
Speculative Speculative Decoding: Parallelizing LLM Inference for Faster Performance
Researchers introduce speculative speculative decoding (SSD), a novel technique to accelerate large language model inference by parallelizin
Fast-dLLM: Training-Free Acceleration Method for Diffusion Language Models Using KV Cache and Parallel Decoding
Researchers introduce Fast-dLLM, a training-free acceleration method for diffusion-based large language models that addresses their slower i
Expected Attention: KV Cache Compression Method for Efficient LLM Inference
This research paper introduces Expected Attention, a training-free method for compressing Key-Value (KV) cache in large language models to r
ChunkLLM: A Lightweight Framework for Accelerating Large Language Model Inference
ChunkLLM is a lightweight, pluggable framework designed to accelerate large language model inference by addressing computational inefficienc
Research Directions for Overcoming Memory and Interconnect Challenges in Large Language Model Inference Hardware
This article discusses the technical challenges of Large Language Model (LLM) inference, highlighting how the autoregressive Decode phase ma
Attention Matching: Fast KV Cache Compaction for Language Models
This article presents a new approach called Attention Matching for fast key-value (KV) cache compaction in language models. Traditional meth
