How New Open-Weight LLMs Are Reducing Long-Context Costs: KV Sharing, Attention Budgeting, and Compressed Attention
The article analyzes recent developments in open-weight LLM architectures, focusing on how newer models like Gemma 4 and DeepSeek V4 are implementing techniques to improve long-context efficiency. Key innovations discussed include KV sharing, per-layer embeddings, layer-wise atte