TileMaxSim: IO-Aware GPU Kernels Achieve 80% HBM Bandwidth for Multi-Vector Retrieval Scoring

[Submitted on 24 Jun 2026]

9d ago· 2 min readenInsight

technology machine learning programming gpu computing

Summary

This paper presents TileMaxSim, a family of IO-aware GPU kernels for accelerating MaxSim scoring in multi-vector retrieval models like ColBERT. The authors identify that naive GPU implementations achieve only 5-18% of peak HBM bandwidth due to materializing large similarity matrices. TileMaxSim closes this gap through three key techniques: multi-query SRAM tiling, dimension tiling for embeddings exceeding 128 dimensions, and fused product-quantization scoring via shared-memory lookup tables. On NVIDIA H100 GPUs, TileMaxSim reaches 80.2% of peak HBM bandwidth, scoring 82M documents/second — a 220x speedup over loop-based scoring and 6.5x over fused PyTorch. It preserves exact retrieval quality on MS MARCO and BEIR benchmarks, and as a drop-in replacement in ColBERTv2/PLAID, cuts scoring latency from 268ms to 1.2ms (98% reduction).

Source

bskyTileMaxSim: IO-Aware GPU Kernels Achieve 80% HBM Bandwidth for Multi-Vector Retrieval Scoringarxiv.org

Key quotes

· 5 pulled

naive implementations reach only 5-18% of peak HBM bandwidth because they materialize the Nq x Nd similarity matrix, wasting memory traffic on data that is consumed once and discarded

TileMaxSim reaches 80.2% of peak HBM bandwidth and scores 82M documents/second (71.6M/s on real MS MARCO passages), a 220x speedup over loop-based scoring

As a drop-in replacement in ColBERTv2/PLAID, it cuts scoring latency at 100K candidates from 268 ms to 1.2 ms (98% lower end-to-end latency)

TileMaxSim preserves exact retrieval quality: on MS MARCO and three BEIR benchmarks, rankings match reference MaxSim

fused product-quantization scoring via shared-memory lookup tables, cutting HBM I/O by up to ~31x

Snippet from the RSS feed

Multi-vector retrieval models such as ColBERT achieve state-of-the-art accuracy through fine-grained token-level MaxSim scoring, yet existing GPU implementations leave most hardware performance unused. We give a roofline analysis of MaxSim on modern GPUs

You might also wanna read

Building memchunk: A High-Performance Text Chunking Library for RAG Pipelines Using SIMD and memchr

The article details the development of memchunk, a high-performance text chunking library for RAG (Retrieval-Augmented Generation) pipelines

minha.sh·6mo ago

GPU-Optimized Datalog Evaluation: GPULOG System Analysis from ASPLOS'25 Paper

This article analyzes the ASPLOS'25 paper 'Optimizing Datalog for the GPU,' which presents GPULOG, a system that optimizes Datalog evaluatio

danglingpointers.substack.com·8mo ago

Rotary GPU: Enabling Large Mixture-of-Experts Models on Consumer Laptop GPUs with Limited Memory

This paper presents Rotary GPU, an exploratory approach to running large Mixture-of-Experts (MoE) language models on consumer-grade hardware

arxiv.org·1mo ago

KernelBench-Mega: Open Benchmark for Agentic GPU Whole-Block Megakernel Performance

KernelBench-Mega is an open benchmark for agentic GPU kernel generation, testing whole-block megakernels that fuse entire model blocks into

kernelbench.com·2d ago

Helios: A 14B Parameter Real-Time Video Generation Model for Minute-Scale Content

Helios is a 14B parameter video generation model that achieves real-time performance at 19.5 FPS on a single NVIDIA H100 GPU while supportin