All Topics
All Topics
Technology
Technology
AI
AI
Business
Business
Entertainment
Entertainment
News
News
Programming
Programming
Security
Security
Science
Science
Design
Design
Environment
Environment
Finance
Finance
Crypto
Crypto
Politics
Politics
Sports
Sports
Education
Education
Gaming
Gaming
Art
Art
Music
Music
Health
Health
Books
Books
Food
Food
Travel
Travel
Personal
Personal
Bluesky
Twitter

BIFE: A New Framework for Stable Minute-Long Video Generation Using Semantic Sparse KV Cache and Block Forcing

By

[Submitted on 28 Nov 2025 (v1), last revised 22 Jun 2026 (this version, v2)]

11d ago· 2 min readenInsight

Summary

This paper introduces BIFE (Better Interaction, Fewer Errors), a framework for minute-long video generation that addresses two key challenges in autoregressive diffusion models: failure to preserve long-range interactions due to sliding-window KV cache and error accumulation over time. BIFE proposes a semantic sparse KV cache for retrieval-based long-range conditioning and a Block Forcing training strategy to enforce cross-block consistency. The authors also introduce InterVBench, a minute-long video benchmark with fine-grained block-level annotations and Video Drift Error metrics. Experiments show BIFE achieves state-of-the-art performance with a 22.2% improvement on VDE-Subject and 19.4% on VDE-Clarity over baselines.

Source

bskyBIFE: A New Framework for Stable Minute-Long Video Generation Using Semantic Sparse KV Cache and Block Forcingarxiv.org

Key quotes

· 5 pulled
Long video generation is a critical step toward building realistic world models, requiring both high visual fidelity and long-range interaction consistency.
Recent autoregressive diffusion models enable long-horizon generation through KV cache reuse, yet suffer from two fundamental challenges: failure to preserve long-range interactions due to sliding-window KV cache and error accumulation that progressively degrades generation quality over time.
To address these issues, we propose BIFE, a framework that introduces a semantic sparse KV cache for retrieval-based long-range conditioning and a Block Forcing training strategy to enforce cross-block consistency.
Together, these designs preserve historical interactions while mitigating drift, enabling stable and coherent minute-long video generation.
Extensive experiments on InterVBench and VBench-Long demonstrate that BIFE achieves state-of-the-art performance, including a 22.2% improvement on VDE-Subject and a 19.4% improvement on VDE-Clarity over baselines.
Snippet from the RSS feed
Long video generation is a critical step toward building realistic world models, requiring both high visual fidelity and long-range interaction consistency. Recent autoregressive diffusion models enable long-horizon generation through KV cache reuse, yet

You might also wanna read

Comments

Sign in to join the conversation.

No comments yet. Be the first.