LK Losses: A New Training Objective to Optimize Acceptance Rate in Speculative Decoding for LLMs

[Submitted on 27 Feb 2026 (v1), last revised 1 Jun 2026 (this version, v2)]

8d ago· 2 min readenInsight

75/100

Toasty

Bagelometer↗

Crisped on the outside, thoughtful enough on the inside.

Score75TypeanalysisSentimentpositive

Summary

This paper introduces LK losses, a novel training objective for speculative decoding in large language models (LLMs). Speculative decoding accelerates LLM inference by using a lightweight draft model to propose tokens that are verified in parallel by the target model. While standard training minimizes KL divergence as a proxy for acceptance rate, small draft models often converge to suboptimal solutions where minimizing KL doesn't maximize acceptance rate. The proposed LK losses directly target acceptance rate optimization. Experiments across four draft architectures and six target models (8B to 685B parameters) show consistent improvements of 8-10% in average acceptance length across general, coding, and math domains. The approach is easy to implement, introduces no computational overhead, and integrates into existing training frameworks.

Key quotes

· 5 pulled

While KL divergence and acceptance rate share the same global optimum, small draft models, having limited capacity, typically converge to suboptimal solutions where minimizing KL does not guarantee maximizing acceptance rate.

We propose LK losses, special training objectives that directly target acceptance rate.

Comprehensive experiments across four draft architectures and six target models, ranging from 8B to 685B parameters, demonstrate consistent improvements in acceptance metrics across all configurations compared to the standard KL-based training.

We evaluate our approach on general, coding and math domains and report gains of up to 8-10% in average acceptance length.

LK losses are easy to implement, introduce no computational overhead and can be directly integrated into any existing speculator training framework, making them a compelling alternative to the existing draft training objectives.

Snippet from the RSS feed

Speculative decoding accelerates autoregressive large language model (LLM) inference by using a lightweight draft model to propose candidate tokens that are then verified in parallel by the target model. The speedup is significantly determined by the acce

You might also wanna read

Speculative Speculative Decoding: Parallelizing LLM Inference for Faster Performance

Researchers introduce speculative speculative decoding (SSD), a novel technique to accelerate large language model inference by parallelizin

arxiv.org·3mo ago

Fast-dLLM: Training-Free Acceleration Method for Diffusion Language Models Using KV Cache and Parallel Decoding

Researchers introduce Fast-dLLM, a training-free acceleration method for diffusion-based large language models that addresses their slower i

arxiv.org·7mo ago

Expected Attention: KV Cache Compression Method for Efficient LLM Inference

This research paper introduces Expected Attention, a training-free method for compressing Key-Value (KV) cache in large language models to r

arxiv.org·8mo ago

ChunkLLM: A Lightweight Framework for Accelerating Large Language Model Inference

ChunkLLM is a lightweight, pluggable framework designed to accelerate large language model inference by addressing computational inefficienc

arxiv.org·7mo ago

Research Directions for Overcoming Memory and Interconnect Challenges in Large Language Model Inference Hardware

This article discusses the technical challenges of Large Language Model (LLM) inference, highlighting how the autoregressive Decode phase ma

arxiv.org·4mo ago

Attention Matching: Fast KV Cache Compaction for Language Models

This article presents a new approach called Attention Matching for fast key-value (KV) cache compaction in language models. Traditional meth

arxiv.org·3mo ago