All Topics
All Topics
Technology
Technology
AI
AI
Business
Business
Entertainment
Entertainment
News
News
Programming
Programming
Security
Security
Science
Science
Design
Design
Environment
Environment
Finance
Finance
Crypto
Crypto
Politics
Politics
Sports
Sports
Education
Education
Gaming
Gaming
Art
Art
Music
Music
Health
Health
Books
Books
Food
Food
Travel
Travel
Personal
Personal
Bluesky
Twitter

TileMaxSim: IO-Aware GPU Kernels Achieve 80% HBM Bandwidth for Multi-Vector Retrieval Scoring

By

[Submitted on 24 Jun 2026]

9d ago· 2 min readenInsight

Summary

This paper presents TileMaxSim, a family of IO-aware GPU kernels for accelerating MaxSim scoring in multi-vector retrieval models like ColBERT. The authors identify that naive GPU implementations achieve only 5-18% of peak HBM bandwidth due to materializing large similarity matrices. TileMaxSim closes this gap through three key techniques: multi-query SRAM tiling, dimension tiling for embeddings exceeding 128 dimensions, and fused product-quantization scoring via shared-memory lookup tables. On NVIDIA H100 GPUs, TileMaxSim reaches 80.2% of peak HBM bandwidth, scoring 82M documents/second — a 220x speedup over loop-based scoring and 6.5x over fused PyTorch. It preserves exact retrieval quality on MS MARCO and BEIR benchmarks, and as a drop-in replacement in ColBERTv2/PLAID, cuts scoring latency from 268ms to 1.2ms (98% reduction).

Source

bskyTileMaxSim: IO-Aware GPU Kernels Achieve 80% HBM Bandwidth for Multi-Vector Retrieval Scoringarxiv.org

Key quotes

· 5 pulled
naive implementations reach only 5-18% of peak HBM bandwidth because they materialize the Nq x Nd similarity matrix, wasting memory traffic on data that is consumed once and discarded
TileMaxSim reaches 80.2% of peak HBM bandwidth and scores 82M documents/second (71.6M/s on real MS MARCO passages), a 220x speedup over loop-based scoring
As a drop-in replacement in ColBERTv2/PLAID, it cuts scoring latency at 100K candidates from 268 ms to 1.2 ms (98% lower end-to-end latency)
TileMaxSim preserves exact retrieval quality: on MS MARCO and three BEIR benchmarks, rankings match reference MaxSim
fused product-quantization scoring via shared-memory lookup tables, cutting HBM I/O by up to ~31x
Snippet from the RSS feed
Multi-vector retrieval models such as ColBERT achieve state-of-the-art accuracy through fine-grained token-level MaxSim scoring, yet existing GPU implementations leave most hardware performance unused. We give a roofline analysis of MaxSim on modern GPUs

You might also wanna read

Comments

Sign in to join the conversation.

No comments yet. Be the first.