MegaTrain: System for Training 100B+ Parameter LLMs on Single GPU Using CPU Memory
By
chrsw
A bagel you'd recommend to a friend without hedging.
Summary
MegaTrain is a memory-centric system that enables training of 100B+ parameter large language models at full precision on a single GPU by storing parameters and optimizer states in host (CPU) memory and treating GPUs as transient compute engines. The system uses pipelined double-buffered execution to overlap parameter prefetching, computation, and gradient offloading, and replaces persistent autograd graphs with stateless layer templates. On a single H200 GPU with 1.5TB host memory, it can train models up to 120B parameters and achieves 1.84× the training throughput of DeepSpeed ZeRO-3 with CPU offloading for 14B models.
Key quotes
· 4 pulledMegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines
To battle the CPU-GPU bandwidth bottleneck, we adopt two key optimizations: 1) We introduce a pipelined double-buffered execution engine that overlaps parameter prefetching, computation, and gradient offloading across multiple CUDA streams
On a single H200 GPU with 1.5TB host memory, MegaTrain reliably trains models up to 120B parameters
MegaTrain also achieves 1.84× the training throughput of DeepSpeed ZeRO-3 with CPU offloading when training 14B models
You might also wanna read
Guide to Calculating GPU Memory for Self-Hosted LLM Inference
The article provides a guide on calculating GPU memory requirements and managing concurrent requests for self-hosted large language model (L
RTP-LLM: Alibaba's High-Performance Inference Engine for Large Language Model Deployment
This paper presents RTP-LLM, a high-performance inference engine developed by Alibaba for industrial-scale deployment of Large Language Mode
Monostate: All-in-One AI Training Platform for Fine-Tuning LLMs
Monostate is an all-in-one AI training platform that enables users to fine-tune large language models (LLMs) with their own data using vario
