CODA: A GPU Kernel Abstraction for Efficient Transformer Training via GEMM-Epilogue Programming
By
matt_d
Crisped on the outside, thoughtful enough on the inside.
Summary
CODA is a GPU kernel abstraction that expresses memory-bound Transformer operators (normalization, activations, residual updates, reductions) as GEMM-plus-epilogue programs. By reparameterizing these operators to execute while a GEMM output tile remains on chip, CODA reduces data movement bottlenecks in Transformer training. The abstraction preserves the performance of expert-written GEMMs while covering nearly all non-attention computation in standard Transformer blocks. Both human- and LLM-authored CODA kernels achieve high performance, offering a practical path combining framework-level productivity with hardware-level efficiency.
Key quotes
· 4 pulledWe introduce CODA, a GPU kernel abstraction that expresses these computations as GEMM-plus-epilogue programs.
CODA is based on the observation that many Transformer operators exposed as separate framework kernels can be algebraically reparameterized to execute while a GEMM output tile remains on chip, before it is written to memory.
This constrained interface preserves the performance structure of expert-written GEMMs while remaining expressive enough to cover nearly all non-attention computation in the forward and backward pass of a standard Transformer block.
Across representative Transformer workloads, both human- and LLM-authored CODA kernels achieve high performance, suggesting that GEMM-plus-epilogue programming offers a practical path toward combining framework-level productivity with hardware-level efficiency.
You might also wanna read
NVIDIA Announces "Hack for Impact" London Event for Autonomous AI Agent Development
NVIDIA is hosting a "Hack for Impact" event in London, challenging participants to build autonomous agentic applications using open-source m
Four practical steps to control Azure Foundry token costs for agentic AI workloads
This article provides practical guidance on controlling token costs in Microsoft Azure Foundry, particularly for agentic AI workloads where
MerLean-Prover: A Recursive Agent Harness for Lean 4 Theorem Proving Outperforms Baselines
MerLean-Prover is an end-to-end Lean4 theorem prover that replaces 'sorry' declarations with kernel-checkable proofs using three agent types
Why small pull request policies can backfire on software quality
The article critiques a common software engineering policy that limits pull requests (PRs) to small sizes (e.g., 500 lines, few files). Whil
apenwarr.ca·8h agoHow Anthropic contains Claude's expanding access across its products
Anthropic describes how it has evolved its approach to granting Claude, its AI assistant, increasingly broad access to internal systems over
Testing Cursor's Jira integration: How ticket quality affects AI agent performance
Cursor launched a Jira integration that lets developers assign tickets directly to an AI agent, eliminating context switching. The author te
bit.ly·9h ago