All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

CODA: A GPU Kernel Abstraction for Efficient Transformer Training via GEMM-Epilogue Programming

By

matt_d

10d ago· 2 min readenInsight

Summary

CODA is a GPU kernel abstraction that expresses memory-bound Transformer operators (normalization, activations, residual updates, reductions) as GEMM-plus-epilogue programs. By reparameterizing these operators to execute while a GEMM output tile remains on chip, CODA reduces data movement bottlenecks in Transformer training. The abstraction preserves the performance of expert-written GEMMs while covering nearly all non-attention computation in standard Transformer blocks. Both human- and LLM-authored CODA kernels achieve high performance, offering a practical path combining framework-level productivity with hardware-level efficiency.

Key quotes

· 4 pulled
We introduce CODA, a GPU kernel abstraction that expresses these computations as GEMM-plus-epilogue programs.
CODA is based on the observation that many Transformer operators exposed as separate framework kernels can be algebraically reparameterized to execute while a GEMM output tile remains on chip, before it is written to memory.
This constrained interface preserves the performance structure of expert-written GEMMs while remaining expressive enough to cover nearly all non-attention computation in the forward and backward pass of a standard Transformer block.
Across representative Transformer workloads, both human- and LLM-authored CODA kernels achieve high performance, suggesting that GEMM-plus-epilogue programming offers a practical path toward combining framework-level productivity with hardware-level efficiency.
Snippet from the RSS feed
Transformer training systems are built around dense linear algebra, yet a nontrivial fraction of end-to-end time is spent on surrounding memory-bound operators. Normalization, activations, residual updates, reductions, and related computations repeatedly

You might also wanna read