Technology

Art

Why LLMs Are So Expensive: The Quadratic Cost of Dense Attention — and How Subquadratic Claims to Fix It

Shashi Bellamkonda

3h ago· 8 min readenInsight

technology programming enterprise computing ai architecture

Summary

This article explains the fundamental computational bottleneck behind large language models: dense attention, which requires every token to compare against every other token, creating quadratic compute scaling (O(n²)). As context windows grow, this becomes exponentially more expensive. The piece introduces Subquadratic's SubQ model as a potential breakthrough that breaks this quadratic constraint through independent benchmarks. It explores the architectural implications for enterprise teams deploying AI at scale, including cost modeling, inference efficiency, and the trade-offs between attention mechanisms.

Source

bskyWhy LLMs Are So Expensive: The Quadratic Cost of Dense Attention — and How Subquadratic Claims to Fix Itshashi.co

Key quotes

· 3 pulled

Each dot is a token, roughly a word or part of a word. Each line is a computation the model runs to figure out how that token relates to every other token in the document.

The technical term is dense attention. The practical result is that every LLM compares every word to every other word.

Subquadratic's SubQ model posts independent benchmarks suggesting that constraint is breakable.

Snippet from the RSS feed

Dense attention's quadratic compute scaling has been the hidden cost driver behind enterprise AI since 2017. Subquadratic's SubQ model posts independent benchmarks suggesting that constraint is breakable. Here is what the architecture actually means for e

You might also wanna read

New Method Enables Constant-Cost Self-Attention Computation for Transformers

Researchers present a novel mathematical approach to compute self-attention in Transformer AI models with constant cost per token, rather th

arxiv.org·4mo ago

Subquadratic launches AI architecture with 12-million-token context window, outperforming GPT-5.5 on retrieval benchmarks

Subquadratic has launched a new AI architecture featuring a 12-million-token context window, shattering the current million-token standard s

The New Stack·1mo ago

How New Open-Weight LLMs Are Reducing Long-Context Costs: KV Sharing, Attention Budgeting, and Compressed Attention

The article analyzes recent developments in open-weight LLM architectures, focusing on how newer models like Gemma 4 and DeepSeek V4 are imp

magazine.sebastianraschka.com·1mo ago

Understanding Continuous Batching in Large Language Models: From Attention Mechanisms to Throughput Optimization

This technical blog post explains continuous batching in large language models (LLMs) by starting from first principles of attention mechani

huggingface.co·4mo ago

Subquadratic Launches SubQ 1.1 Small LLM with Competitive Benchmark Performance

Subquadratic introduces SubQ 1.1 Small, a new LLM that balances long-context optimization with general reasoning. The model achieves strong

subq.ai·12d ago

Breaking Quadratic Barriers: A Non-Attention LLM for Ultra-Long Context Horizons

arxiv.org·1y ago

Comments

No comments yet. Be the first.