TileMaxSim: IO-Aware GPU Kernels Achieve 80% HBM Bandwidth for Multi-Vector Retrieval Scoring
By
[Submitted on 24 Jun 2026]
Summary
This paper presents TileMaxSim, a family of IO-aware GPU kernels for accelerating MaxSim scoring in multi-vector retrieval models like ColBERT. The authors identify that naive GPU implementations achieve only 5-18% of peak HBM bandwidth due to materializing large similarity matrices. TileMaxSim closes this gap through three key techniques: multi-query SRAM tiling, dimension tiling for embeddings exceeding 128 dimensions, and fused product-quantization scoring via shared-memory lookup tables. On NVIDIA H100 GPUs, TileMaxSim reaches 80.2% of peak HBM bandwidth, scoring 82M documents/second — a 220x speedup over loop-based scoring and 6.5x over fused PyTorch. It preserves exact retrieval quality on MS MARCO and BEIR benchmarks, and as a drop-in replacement in ColBERTv2/PLAID, cuts scoring latency from 268ms to 1.2ms (98% reduction).
Source
Key quotes
· 5 pullednaive implementations reach only 5-18% of peak HBM bandwidth because they materialize the Nq x Nd similarity matrix, wasting memory traffic on data that is consumed once and discarded
TileMaxSim reaches 80.2% of peak HBM bandwidth and scores 82M documents/second (71.6M/s on real MS MARCO passages), a 220x speedup over loop-based scoring
As a drop-in replacement in ColBERTv2/PLAID, it cuts scoring latency at 100K candidates from 268 ms to 1.2 ms (98% lower end-to-end latency)
TileMaxSim preserves exact retrieval quality: on MS MARCO and three BEIR benchmarks, rankings match reference MaxSim
fused product-quantization scoring via shared-memory lookup tables, cutting HBM I/O by up to ~31x
You might also wanna read
Building memchunk: A High-Performance Text Chunking Library for RAG Pipelines Using SIMD and memchr
The article details the development of memchunk, a high-performance text chunking library for RAG (Retrieval-Augmented Generation) pipelines
GPU-Optimized Datalog Evaluation: GPULOG System Analysis from ASPLOS'25 Paper
This article analyzes the ASPLOS'25 paper 'Optimizing Datalog for the GPU,' which presents GPULOG, a system that optimizes Datalog evaluatio
Rotary GPU: Enabling Large Mixture-of-Experts Models on Consumer Laptop GPUs with Limited Memory
This paper presents Rotary GPU, an exploratory approach to running large Mixture-of-Experts (MoE) language models on consumer-grade hardware
KernelBench-Mega: Open Benchmark for Agentic GPU Whole-Block Megakernel Performance
KernelBench-Mega is an open benchmark for agentic GPU kernel generation, testing whole-block megakernels that fuse entire model blocks into
Helios: A 14B Parameter Real-Time Video Generation Model for Minute-Scale Content
Helios is a 14B parameter video generation model that achieves real-time performance at 19.5 FPS on a single NVIDIA H100 GPU while supportin

Comments
Sign in to join the conversation.
No comments yet. Be the first.