Sleep-Like Consolidation Mechanism Improves Long-Context Performance in Transformer Language Models
By
juxtapose
Crisped on the outside, thoughtful enough on the inside.
Summary
This paper proposes a sleep-like consolidation mechanism for transformer-based large language models to address the poor scaling of attention mechanisms with long context lengths. The model periodically converts recent context into persistent fast weights, clears its key-value cache, and performs offline recurrent passes over accumulated context during a 'sleep' phase. This shifts extra computation to sleep while preserving wake-time prediction latency. The method is tested on synthetic tasks (cellular automata, multi-hop graph retrieval) and a realistic math reasoning task, showing that increasing sleep duration improves performance, especially on examples requiring deeper reasoning.
Key quotes
· 3 pulledTo handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache.
During inference, this shifts extra computation to sleep while preserving the latency of wake-time prediction.
We then show that increasing sleep duration N for our models improves performance, with the largest gains on examples that require deeper reasoning.
