How New Open-Weight LLMs Are Reducing Long-Context Costs: KV Sharing, Attention Budgeting, and Compressed Attention
By
gmays
Crisp on the outside, thoughtful on the inside. A keeper.
Summary
The article analyzes recent developments in open-weight LLM architectures, focusing on how newer models like Gemma 4 and DeepSeek V4 are implementing techniques to improve long-context efficiency. Key innovations discussed include KV sharing, per-layer embeddings, layer-wise attention budgeting, and compressed attention mechanisms. These architectural tricks aim to reduce KV-cache size, memory traffic, and attention costs as reasoning models and agent workflows keep more tokens around for longer periods.
Key quotes
· 3 pulledAs reasoning models and agent workflows keep more tokens around (for longer), KV-cache size, memory traffic, and attention cost quickly become the main constraints
LLM developers are adding a growing number of architecture tricks to reduce those costs
The thing that stood out to me is how much newer architectures are focused on long-context efficiency
You might also wanna read
DeepSeek-V4: Hybrid Sparse-Attention Architecture Enables Efficient Million-Token Context Inference
DeepSeek-V4 introduces a hybrid sparse-attention architecture combined with on-policy distillation across domain specialists, enabling 1M-to
Parametric Memory Law: A Quantitative Framework for Understanding LoRA Memory Capacity in LLMs
This research paper introduces the Parametric Memory Law, a quantitative framework for understanding how Low-Rank Adaptation (LoRA) enables
PromptEmbedder: A Dual-LLM Framework for Efficient, Architecture-Agnostic Text Embedding
The article presents PromptEmbedder, a novel dual-LLM framework for efficient and transferable text embedding. It addresses the bottleneck o
