All Topics

Technology

Art

Implementing Flash Attention for NVIDIA 5090 GPUs with CUDA C++

dsr12

9mo ago· 29 min readen

100/100

Golden Brown

Bagelometer↗

Toasted golden, schmeared with insight. Top of the rack.

Score100Typehow-toSentimentneutral

Summary

A technical tutorial explaining how to implement Flash Attention for NVIDIA 5090 GPUs using CUDA C++. The author shares their learning journey, focusing on writing attention mechanisms in CUDA C++ rather than Triton, as CUDA offers access to advanced features like MXFP8/NVFP4 MMA for sm120 architecture. The post serves as an educational resource for those familiar with CUDA and Tensor cores, filling a gap in existing documentation since most tutorials cover matmul kernels but not attention mechanisms.

Key quotes

· 4 pulled

The main objective is to learn writing attention in CUDA C++, since many features are not available in Triton, such as MXFP8 / NVFP4 MMA for sm120.

I also feel this is a natural next step after learning about matmul kernels.

There are many excellent blogposts on writing fast matmul kernels, but there is none for attention.

Readers are highly recommended to be familiar with CUDA C++ and how to use Tensor cores on NVIDIA.

Snippet from the RSS feed

In this post, I will walkthrough how I learned to implement Flash Attention for 5090 in CUDA C++. The main objective is to learn writing attention in CUDA C++, since many features are not available in Triton, such as MXFP8 / NVFP4 MMA for sm120. I also fe

You might also wanna read

Running Gemma 4 on a 2016 Xeon Server with No GPU: A Technical Walkthrough

The article describes running Gemma 4 (a 25B-parameter Mixture-of-Experts model) on a severely outdated server with a 2016 Intel Xeon E5-262

point.free·1h ago

NVIDIA Announces "Hack for Impact" London Event for Autonomous AI Agent Development

NVIDIA is hosting a "Hack for Impact" event in London, challenging participants to build autonomous agentic applications using open-source m

luma.com·3h ago

Four practical steps to control Azure Foundry token costs for agentic AI workloads

This article provides practical guidance on controlling token costs in Microsoft Azure Foundry, particularly for agentic AI workloads where

purplefrogsystems.com·4h ago

MerLean-Prover: A Recursive Agent Harness for Lean 4 Theorem Proving Outperforms Baselines

MerLean-Prover is an end-to-end Lean4 theorem prover that replaces 'sorry' declarations with kernel-checkable proofs using three agent types

arxiv.org·5h ago

Why small pull request policies can backfire on software quality

The article critiques a common software engineering policy that limits pull requests (PRs) to small sizes (e.g., 500 lines, few files). Whil

apenwarr.ca·7h ago

How Anthropic contains Claude's expanding access across its products

Anthropic describes how it has evolved its approach to granting Claude, its AI assistant, increasingly broad access to internal systems over

anthropic.com·8h ago