VideoMLA: Low-Rank Latent KV Cache Reduces Memory by 92.7% for Minute-Scale Video Diffusion

[Submitted on 28 May 2026]

4d ago· 2 min readenInsight

75/100

Toasty

Bagelometer↗

Crispy enough to crunch, soft enough to enjoy. A good bake.

Score75TypeanalysisSentimentpositive

Summary

This paper introduces VideoMLA, the first application of Multi-Head Latent Attention (MLA) to video diffusion models. It replaces per-head key-value caches with a shared low-rank content latent and decoupled 3D-RoPE positional key, reducing per-token KV memory by 92.7% per cached layer. Despite video attention not being inherently low-rank (contrary to assumptions in language models), VideoMLA maintains quality at high compression ratios. The bottleneck itself determines effective rank rather than the pretrained spectrum. VideoMLA matches short-horizon baselines, achieves best long-horizon scores on VBench, and improves throughput by 1.23x on a single B200 GPU.

Key quotes

· 4 pulled

VideoMLA replaces per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, reducing per-token KV memory by 92.7% at every cached layer.

We further investigate why MLA succeeds in video diffusion even though the spectral assumption often used to motivate it in language models does not hold: pretrained video attention is not low-rank.

The MLA bottleneck, rather than the pretrained spectrum, determines the effective rank: both spectral and random initialization occupy nearly the full rank budget from initialization, and training preserves this budget while adapting within it.

On VBench, VideoMLA matches short-horizon streaming video diffusion baselines, achieves the best overall score at long horizons among evaluated methods, and improves throughput by 1.23x on a single B200.

Snippet from the RSS feed

Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-head KV layout itself, a d

You might also wanna read

StreamingVLM: Real-Time Vision-Language Model for Infinite Video Stream Processing

StreamingVLM is a new vision-language model designed for real-time understanding of infinite video streams, addressing the computational cha

arxiv.org·7mo ago

Fast-dLLM: Training-Free Acceleration Method for Diffusion Language Models Using KV Cache and Parallel Decoding

Researchers introduce Fast-dLLM, a training-free acceleration method for diffusion-based large language models that addresses their slower i

arxiv.org·7mo ago

Attention Matching: Fast KV Cache Compaction for Language Models

This article presents a new approach called Attention Matching for fast key-value (KV) cache compaction in language models. Traditional meth

arxiv.org·3mo ago

Expected Attention: KV Cache Compression Method for Efficient LLM Inference

This research paper introduces Expected Attention, a training-free method for compressing Key-Value (KV) cache in large language models to r

arxiv.org·8mo ago

δ-mem: A Compact Online Memory Mechanism for Efficient Long-Context LLM Processing

The article presents δ-mem, a lightweight memory mechanism for large language models that augments frozen full-attention backbones with a co

arxiv.org·23d ago

STARFlow-V: Normalizing Flow-Based Video Generation Model with End-to-End Learning

STARFlow-V is a normalizing flow-based video generation model that offers end-to-end learning, robust causal prediction, and native likeliho

starflow-v.github.io·6mo ago