All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Implementing Flash Attention for NVIDIA 5090 GPUs with CUDA C++

By

dsr12

9mo ago· 29 min readen

Summary

A technical tutorial explaining how to implement Flash Attention for NVIDIA 5090 GPUs using CUDA C++. The author shares their learning journey, focusing on writing attention mechanisms in CUDA C++ rather than Triton, as CUDA offers access to advanced features like MXFP8/NVFP4 MMA for sm120 architecture. The post serves as an educational resource for those familiar with CUDA and Tensor cores, filling a gap in existing documentation since most tutorials cover matmul kernels but not attention mechanisms.

Key quotes

· 4 pulled
The main objective is to learn writing attention in CUDA C++, since many features are not available in Triton, such as MXFP8 / NVFP4 MMA for sm120.
I also feel this is a natural next step after learning about matmul kernels.
There are many excellent blogposts on writing fast matmul kernels, but there is none for attention.
Readers are highly recommended to be familiar with CUDA C++ and how to use Tensor cores on NVIDIA.
Snippet from the RSS feed
In this post, I will walkthrough how I learned to implement Flash Attention for 5090 in CUDA C++. The main objective is to learn writing attention in CUDA C++, since many features are not available in Triton, such as MXFP8 / NVFP4 MMA for sm120. I also fe

You might also wanna read