Implementing Flash Attention for NVIDIA 5090 GPUs with CUDA C++
By
dsr12
Toasted golden, schmeared with insight. Top of the rack.
Summary
A technical tutorial explaining how to implement Flash Attention for NVIDIA 5090 GPUs using CUDA C++. The author shares their learning journey, focusing on writing attention mechanisms in CUDA C++ rather than Triton, as CUDA offers access to advanced features like MXFP8/NVFP4 MMA for sm120 architecture. The post serves as an educational resource for those familiar with CUDA and Tensor cores, filling a gap in existing documentation since most tutorials cover matmul kernels but not attention mechanisms.
Key quotes
· 4 pulledThe main objective is to learn writing attention in CUDA C++, since many features are not available in Triton, such as MXFP8 / NVFP4 MMA for sm120.
I also feel this is a natural next step after learning about matmul kernels.
There are many excellent blogposts on writing fast matmul kernels, but there is none for attention.
Readers are highly recommended to be familiar with CUDA C++ and how to use Tensor cores on NVIDIA.
You might also wanna read
Running Gemma 4 on a 2016 Xeon Server with No GPU: A Technical Walkthrough
The article describes running Gemma 4 (a 25B-parameter Mixture-of-Experts model) on a severely outdated server with a 2016 Intel Xeon E5-262
NVIDIA Announces "Hack for Impact" London Event for Autonomous AI Agent Development
NVIDIA is hosting a "Hack for Impact" event in London, challenging participants to build autonomous agentic applications using open-source m
Four practical steps to control Azure Foundry token costs for agentic AI workloads
This article provides practical guidance on controlling token costs in Microsoft Azure Foundry, particularly for agentic AI workloads where
MerLean-Prover: A Recursive Agent Harness for Lean 4 Theorem Proving Outperforms Baselines
MerLean-Prover is an end-to-end Lean4 theorem prover that replaces 'sorry' declarations with kernel-checkable proofs using three agent types
Why small pull request policies can backfire on software quality
The article critiques a common software engineering policy that limits pull requests (PRs) to small sizes (e.g., 500 lines, few files). Whil
apenwarr.ca·7h agoHow Anthropic contains Claude's expanding access across its products
Anthropic describes how it has evolved its approach to granting Claude, its AI assistant, increasingly broad access to internal systems over
