How New Open-Weight LLMs Are Reducing Long-Context Costs: KV Sharing, Attention Budgeting, and Compressed Attention

gmays

12d ago· 29 min readenInsight

100/100

Golden Brown

Bagelometer↗

Crisp on the outside, thoughtful on the inside. A keeper.

Score100TypeanalysisSentimentpositive

Summary

The article analyzes recent developments in open-weight LLM architectures, focusing on how newer models like Gemma 4 and DeepSeek V4 are implementing techniques to improve long-context efficiency. Key innovations discussed include KV sharing, per-layer embeddings, layer-wise attention budgeting, and compressed attention mechanisms. These architectural tricks aim to reduce KV-cache size, memory traffic, and attention costs as reasoning models and agent workflows keep more tokens around for longer periods.

Key quotes

· 3 pulled

As reasoning models and agent workflows keep more tokens around (for longer), KV-cache size, memory traffic, and attention cost quickly become the main constraints

LLM developers are adding a growing number of architecture tricks to reduce those costs

The thing that stood out to me is how much newer architectures are focused on long-context efficiency

Snippet from the RSS feed

From Gemma 4 to DeepSeek V4, How New Open-Weight LLMs Are Reducing Long-Context Costs

You might also wanna read

DeepSeek-V4: Hybrid Sparse-Attention Architecture Enables Efficient Million-Token Context Inference

DeepSeek-V4 introduces a hybrid sparse-attention architecture combined with on-policy distillation across domain specialists, enabling 1M-to

artgor.medium.com·1h ago

Parametric Memory Law: A Quantitative Framework for Understanding LoRA Memory Capacity in LLMs

This research paper introduces the Parametric Memory Law, a quantitative framework for understanding how Low-Rank Adaptation (LoRA) enables

arxiv.org·1d ago

PromptEmbedder: A Dual-LLM Framework for Efficient, Architecture-Agnostic Text Embedding

The article presents PromptEmbedder, a novel dual-LLM framework for efficient and transferable text embedding. It addresses the bottleneck o

arxiv.org·3d ago