cuTile Rust: Extending Rust's Ownership Model to Safe GPU Kernel Programming

[Submitted on 14 Jun 2026]

12d ago· 2 min readenInsight

Summary

This article presents cuTile Rust, a system that extends Rust's ownership and memory safety guarantees to GPU kernel programming. It allows developers to write safe, idiomatic GPU kernels using Rust's ownership discipline by splitting mutable outputs into disjoint pieces and preserving host-side ownership contracts during kernel launches. The system supports a composable host execution model spanning synchronous launches, asynchronous pipelines, and CUDA graph replay. Performance evaluation on NVIDIA B200 GPUs shows cuTile Rust achieves 7 TB/s for element-wise operations and 2 PFlop/s for GEMM (96% of cuBLAS), matching cuTile Python performance. A practical implementation called Grout, a cuTile-Rust-based inference engine, demonstrates competitive inference performance for Qwen3 models, reaching 171 tokens/s for Qwen3-4B on RTX 5090 and 82 tokens/s for Qwen3-32B on B200, competitive with vLLM and SGLang.

Source

Twitter / XcuTile Rust: Extending Rust's Ownership Model to Safe GPU Kernel Programmingarxiv.org

Key quotes

· 4 pulled

Rust has made safe systems programming practical on the CPU, but writing custom GPU kernels in Rust still forces programmers outside the language's ownership guarantees.

cuTile Rust extends Rust's ownership discipline to tile-based GPU kernels: mutable outputs are split into disjoint pieces, kernel launches preserve the host-side ownership contract, and programmers can opt out locally when they need lower-level control.

On the NVIDIA B200 GPU, cuTile Rust achieves 7 TB/s for element-wise operations and 2 PFlop/s for GEMM (96% of cuBLAS), matching cuTile Python within measurement noise.

In batch-1 decode, Grout reaches 171 generated tokens/s for Qwen3-4B on the NVIDIA GeForce RTX 5090 and 82 generated tokens/s for Qwen3-32B on the B200, competitive with vLLM and SGLang.

Snippet from the RSS feed

Rust has made safe systems programming practical on the CPU, but writing custom GPU kernels in Rust still forces programmers outside the language's ownership guarantees. We present cuTile Rust, a tile-based system for safe, idiomatic GPU kernel authoring

You might also wanna read

cuTile Rust: Extending Rust's Ownership Model to Safe GPU Kernel Programming

This paper (arXiv:2606.15991) presents cuTile Rust, a system that extends Rust's ownership and borrowing guarantees to GPU kernel authoring.

hgpu.org·12d ago

cuTile Rust: A Safe, Tile-Based GPU Kernel Programming DSL for Rust

cuTile Rust (cutile-rs) is a new tile-based GPU kernel programming DSL for Rust that extends Rust's ownership and borrowing rules across the

github.com·12d ago

Cuq Framework: Formal Verification of Rust GPU Kernels Targeting PTX Architecture

Cuq is a research framework that provides the first formal semantics and verified translation for Rust GPU kernels targeting NVIDIA's PTX ar

github.com·8mo ago

VectorWare Enables Rust Async/Await Programming on GPUs

VectorWare announces a breakthrough in GPU programming by enabling Rust's async/await and Future trait on GPUs. This represents a significan

vectorware.com·4mo ago

RustGPT: Complete Transformer-Based LLM Implementation in Pure Rust

RustGPT is a complete Large Language Model implementation built entirely in Rust without external ML frameworks. The project demonstrates bu

github.com·9mo ago

GPU-Optimized Datalog Evaluation: GPULOG System Analysis from ASPLOS'25 Paper

This article analyzes the ASPLOS'25 paper 'Optimizing Datalog for the GPU,' which presents GPULOG, a system that optimizes Datalog evaluatio

danglingpointers.substack.com·7mo ago

Comments

No comments yet. Be the first.