All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Sequential KV Cache Compression Using Probabilistic Language Tries and Predictive Delta Coding

By

EGreg

1mo ago· 2 min readenInsight

Summary

This research paper introduces a novel two-layer architecture for compressing transformer key-value (KV) caches as sequences rather than individual vectors. The approach leverages the fact that tokens in KV caches are samples from the formal language the model was trained on. The first layer uses probabilistic prefix deduplication to identify semantically equivalent shared prefixes across sessions using Probabilistic Language Tries. The second layer employs predictive delta coding to store only residuals from the model's own predictions. The method achieves theoretical compression ratios up to 914,000x over existing approaches like TurboQuant, with compression improving as context length grows.

Key quotes

· 5 pulled
The tokens stored in a KV cache are not arbitrary floating-point data -- they are samples from the exact formal language the model was trained on, and the model is by construction a near-optimal predictor of that language.
We introduce sequential KV compression, a two-layer architecture that exploits this structure.
The theoretical compression ratio over TurboQuant is approximately 914,000x at the Shannon limit.
Even at 1000x above the entropy floor -- a deliberately pessimistic worst-case overhead, two orders of magnitude above the 2-5x typical of practical source coders -- the ratio remains approximately 914x over TurboQuant, with compression improving rather than degrading as context length grows.
The two layers are orthogonal and compose with existing per-vector quantization methods including TurboQuant.
Snippet from the RSS feed
Recent work on KV cache quantization, culminating in TurboQuant, has approached the Shannon entropy limit for per-vector compression of transformer key-value caches. We observe that this limit applies to a strictly weaker problem than the one that actuall

You might also wanna read