KernelBench-Mega: Open Benchmark for Agentic GPU Whole-Block Megakernel Performance
KernelBench-Mega is an open benchmark for agentic GPU kernel generation, testing whole-block megakernels that fuse entire model blocks into a single kernel. The benchmark evaluates performance on GPUs like RTX PRO 6000 Blackwell, H100, and B200, using metrics such as decode speedup over an optimized-PyTorch baseline (e.g., 19.35x) and tokens per second. The article specifically highlights the Problem 02_kimi_linear_decode task, a Kimi-Linear W4A16 hybrid decode operation.