All Topics
All Topics
Technology
Technology
AI
AI
Business
Business
Entertainment
Entertainment
News
News
Programming
Programming
Security
Security
Science
Science
Design
Design
Environment
Environment
Finance
Finance
Crypto
Crypto
Politics
Politics
Sports
Sports
Education
Education
Gaming
Gaming
Art
Art
Music
Music
Health
Health
Books
Books
Food
Food
Travel
Travel
Personal
Personal
Bluesky
Twitter

KernelBench-Mega: Open Benchmark for Agentic GPU Whole-Block Megakernel Performance

By

Elliot Arledge

2d ago· 2 min readenNews

Summary

KernelBench-Mega is an open benchmark for agentic GPU kernel generation, testing whole-block megakernels that fuse entire model blocks into a single kernel. The benchmark evaluates performance on GPUs like RTX PRO 6000 Blackwell, H100, and B200, using metrics such as decode speedup over an optimized-PyTorch baseline (e.g., 19.35x) and tokens per second. The article specifically highlights the Problem 02_kimi_linear_decode task, a Kimi-Linear W4A16 hybrid decode operation.

Source

Twitter / XKernelBench-Mega: Open Benchmark for Agentic GPU Whole-Block Megakernel Performancekernelbench.com

Key quotes

· 3 pulled
KernelBench-Mega tests whole-block megakernels: instead of grading a single isolated op, the agent fuses an entire model block into one kernel.
The headline metric is the decode speedup over an optimized-PyTorch baseline (e.g. 19.35x = 19x faster than the reference), not a 0-1 roofline fraction.
Problem 02_kimi_linear_decode is a Kimi-Linear W4A16 hybrid decode (4-bit weights, bf16 activations).
Snippet from the RSS feed
Open agentic GPU kernel benchmark results, repositories, transcripts, and datasets.

You might also wanna read

Comments

Sign in to join the conversation.

No comments yet. Be the first.