cuTile Rust: Extending Rust's Ownership Model to Safe GPU Kernel Programming
By
[Submitted on 14 Jun 2026]
Summary
This article presents cuTile Rust, a system that extends Rust's ownership and memory safety guarantees to GPU kernel programming. It allows developers to write safe, idiomatic GPU kernels using Rust's ownership discipline by splitting mutable outputs into disjoint pieces and preserving host-side ownership contracts during kernel launches. The system supports a composable host execution model spanning synchronous launches, asynchronous pipelines, and CUDA graph replay. Performance evaluation on NVIDIA B200 GPUs shows cuTile Rust achieves 7 TB/s for element-wise operations and 2 PFlop/s for GEMM (96% of cuBLAS), matching cuTile Python performance. A practical implementation called Grout, a cuTile-Rust-based inference engine, demonstrates competitive inference performance for Qwen3 models, reaching 171 tokens/s for Qwen3-4B on RTX 5090 and 82 tokens/s for Qwen3-32B on B200, competitive with vLLM and SGLang.
Source
Key quotes
· 4 pulledRust has made safe systems programming practical on the CPU, but writing custom GPU kernels in Rust still forces programmers outside the language's ownership guarantees.
cuTile Rust extends Rust's ownership discipline to tile-based GPU kernels: mutable outputs are split into disjoint pieces, kernel launches preserve the host-side ownership contract, and programmers can opt out locally when they need lower-level control.
On the NVIDIA B200 GPU, cuTile Rust achieves 7 TB/s for element-wise operations and 2 PFlop/s for GEMM (96% of cuBLAS), matching cuTile Python within measurement noise.
In batch-1 decode, Grout reaches 171 generated tokens/s for Qwen3-4B on the NVIDIA GeForce RTX 5090 and 82 generated tokens/s for Qwen3-32B on the B200, competitive with vLLM and SGLang.
You might also wanna read
cuTile Rust: Extending Rust's Ownership Model to Safe GPU Kernel Programming
This paper (arXiv:2606.15991) presents cuTile Rust, a system that extends Rust's ownership and borrowing guarantees to GPU kernel authoring.
cuTile Rust: A Safe, Tile-Based GPU Kernel Programming DSL for Rust
cuTile Rust (cutile-rs) is a new tile-based GPU kernel programming DSL for Rust that extends Rust's ownership and borrowing rules across the
Cuq Framework: Formal Verification of Rust GPU Kernels Targeting PTX Architecture
Cuq is a research framework that provides the first formal semantics and verified translation for Rust GPU kernels targeting NVIDIA's PTX ar
VectorWare Enables Rust Async/Await Programming on GPUs
VectorWare announces a breakthrough in GPU programming by enabling Rust's async/await and Future trait on GPUs. This represents a significan
vectorware.com·4mo agoRustGPT: Complete Transformer-Based LLM Implementation in Pure Rust
RustGPT is a complete Large Language Model implementation built entirely in Rust without external ML frameworks. The project demonstrates bu
GPU-Optimized Datalog Evaluation: GPULOG System Analysis from ASPLOS'25 Paper
This article analyzes the ASPLOS'25 paper 'Optimizing Datalog for the GPU,' which presents GPULOG, a system that optimizes Datalog evaluatio

Comments
Sign in to join the conversation.
No comments yet. Be the first.