Sequential KV Cache Compression Using Probabilistic Language Tries and Predictive Delta Coding
By
EGreg
Plain bagel done well. Pleasantly substantive.
Summary
This research paper introduces a novel two-layer architecture for compressing transformer key-value (KV) caches as sequences rather than individual vectors. The approach leverages the fact that tokens in KV caches are samples from the formal language the model was trained on. The first layer uses probabilistic prefix deduplication to identify semantically equivalent shared prefixes across sessions using Probabilistic Language Tries. The second layer employs predictive delta coding to store only residuals from the model's own predictions. The method achieves theoretical compression ratios up to 914,000x over existing approaches like TurboQuant, with compression improving as context length grows.
Key quotes
· 5 pulledThe tokens stored in a KV cache are not arbitrary floating-point data -- they are samples from the exact formal language the model was trained on, and the model is by construction a near-optimal predictor of that language.
We introduce sequential KV compression, a two-layer architecture that exploits this structure.
The theoretical compression ratio over TurboQuant is approximately 914,000x at the Shannon limit.
Even at 1000x above the entropy floor -- a deliberately pessimistic worst-case overhead, two orders of magnitude above the 2-5x typical of practical source coders -- the ratio remains approximately 914x over TurboQuant, with compression improving rather than degrading as context length grows.
The two layers are orthogonal and compose with existing per-vector quantization methods including TurboQuant.
