All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Building high-performance expert-parallel dispatch and combine kernels for MoE LLM inference

By

kkm

2d ago· 48 min readenInsight

Summary

This article provides a deep technical deep-dive into the architecture and implementation of high-performance Expert Parallelism (EP) kernels for Mixture-of-Experts (MoE) large language models. It explains the challenges of GPU communication in distributed LLM inference, focusing on the dispatch and combine kernels used in expert-parallel systems. The article builds up both high-throughput and low-latency kernel designs from scratch, covering GPU memory management, communication patterns, and optimization techniques for scaling MoE model inference across multiple GPUs.

Key quotes

· 3 pulled
Large language models are large. Because they're large, we need lots of GPUs to run them.
To use lots of GPUs on LLM inference, we need to get those GPUs talking to one another.
All have their place. But for MoE models, in the MoE layers, when you want to serve at large scale, 'wide Expert'
Snippet from the RSS feed
How expert-parallel dispatch and combine kernels work, built up from scratch: the high-throughput shape and the low-latency one.

You might also wanna read