All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

MegaTrain: System for Training 100B+ Parameter LLMs on Single GPU Using CPU Memory

By

chrsw

1mo ago· 2 min readenInsight

Summary

MegaTrain is a memory-centric system that enables training of 100B+ parameter large language models at full precision on a single GPU by storing parameters and optimizer states in host (CPU) memory and treating GPUs as transient compute engines. The system uses pipelined double-buffered execution to overlap parameter prefetching, computation, and gradient offloading, and replaces persistent autograd graphs with stateless layer templates. On a single H200 GPU with 1.5TB host memory, it can train models up to 120B parameters and achieves 1.84× the training throughput of DeepSpeed ZeRO-3 with CPU offloading for 14B models.

Key quotes

· 4 pulled
MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines
To battle the CPU-GPU bandwidth bottleneck, we adopt two key optimizations: 1) We introduce a pipelined double-buffered execution engine that overlaps parameter prefetching, computation, and gradient offloading across multiple CUDA streams
On a single H200 GPU with 1.5TB host memory, MegaTrain reliably trains models up to 120B parameters
MegaTrain also achieves 1.84× the training throughput of DeepSpeed ZeRO-3 with CPU offloading when training 14B models
Snippet from the RSS feed
We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU mem

You might also wanna read