All Topics
All Topics
Technology
Technology
AI
AI
Business
Business
Entertainment
Entertainment
News
News
Programming
Programming
Security
Security
Science
Science
Design
Design
Environment
Environment
Finance
Finance
Crypto
Crypto
Politics
Politics
Sports
Sports
Education
Education
Gaming
Gaming
Art
Art
Music
Music
Health
Health
Books
Books
Food
Food
Travel
Travel
Personal
Personal
Bluesky
Twitter

cuTile Rust: Extending Rust's Ownership Model to Safe GPU Kernel Programming

By

[Submitted on 14 Jun 2026]

12d ago· 2 min readenInsight

Summary

This article presents cuTile Rust, a system that extends Rust's ownership and memory safety guarantees to GPU kernel programming. It allows developers to write safe, idiomatic GPU kernels using Rust's ownership discipline by splitting mutable outputs into disjoint pieces and preserving host-side ownership contracts during kernel launches. The system supports a composable host execution model spanning synchronous launches, asynchronous pipelines, and CUDA graph replay. Performance evaluation on NVIDIA B200 GPUs shows cuTile Rust achieves 7 TB/s for element-wise operations and 2 PFlop/s for GEMM (96% of cuBLAS), matching cuTile Python performance. A practical implementation called Grout, a cuTile-Rust-based inference engine, demonstrates competitive inference performance for Qwen3 models, reaching 171 tokens/s for Qwen3-4B on RTX 5090 and 82 tokens/s for Qwen3-32B on B200, competitive with vLLM and SGLang.

Source

Twitter / XcuTile Rust: Extending Rust's Ownership Model to Safe GPU Kernel Programmingarxiv.org

Key quotes

· 4 pulled
Rust has made safe systems programming practical on the CPU, but writing custom GPU kernels in Rust still forces programmers outside the language's ownership guarantees.
cuTile Rust extends Rust's ownership discipline to tile-based GPU kernels: mutable outputs are split into disjoint pieces, kernel launches preserve the host-side ownership contract, and programmers can opt out locally when they need lower-level control.
On the NVIDIA B200 GPU, cuTile Rust achieves 7 TB/s for element-wise operations and 2 PFlop/s for GEMM (96% of cuBLAS), matching cuTile Python within measurement noise.
In batch-1 decode, Grout reaches 171 generated tokens/s for Qwen3-4B on the NVIDIA GeForce RTX 5090 and 82 generated tokens/s for Qwen3-32B on the B200, competitive with vLLM and SGLang.
Snippet from the RSS feed
Rust has made safe systems programming practical on the CPU, but writing custom GPU kernels in Rust still forces programmers outside the language's ownership guarantees. We present cuTile Rust, a tile-based system for safe, idiomatic GPU kernel authoring

You might also wanna read

Comments

Sign in to join the conversation.

No comments yet. Be the first.