MegaTrain: System for Training 100B+ Parameter LLMs on Single GPU Using CPU Memory

chrsw

1mo ago· 2 min readenInsight

75/100

Toasty

Bagelometer↗

A bagel you'd recommend to a friend without hedging.

Score75TypeanalysisSentimentpositive

Summary

MegaTrain is a memory-centric system that enables training of 100B+ parameter large language models at full precision on a single GPU by storing parameters and optimizer states in host (CPU) memory and treating GPUs as transient compute engines. The system uses pipelined double-buffered execution to overlap parameter prefetching, computation, and gradient offloading, and replaces persistent autograd graphs with stateless layer templates. On a single H200 GPU with 1.5TB host memory, it can train models up to 120B parameters and achieves 1.84× the training throughput of DeepSpeed ZeRO-3 with CPU offloading for 14B models.

Key quotes

· 4 pulled

MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines

To battle the CPU-GPU bandwidth bottleneck, we adopt two key optimizations: 1) We introduce a pipelined double-buffered execution engine that overlaps parameter prefetching, computation, and gradient offloading across multiple CUDA streams

On a single H200 GPU with 1.5TB host memory, MegaTrain reliably trains models up to 120B parameters

MegaTrain also achieves 1.84× the training throughput of DeepSpeed ZeRO-3 with CPU offloading when training 14B models

Snippet from the RSS feed

We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU mem

You might also wanna read

Guide to Calculating GPU Memory for Self-Hosted LLM Inference

The article provides a guide on calculating GPU memory requirements and managing concurrent requests for self-hosted large language model (L

Product Hunt·9mo ago

RTP-LLM: Alibaba's High-Performance Inference Engine for Large Language Model Deployment

This paper presents RTP-LLM, a high-performance inference engine developed by Alibaba for industrial-scale deployment of Large Language Mode

arxiv.org·1d ago

Monostate: All-in-One AI Training Platform for Fine-Tuning LLMs

Monostate is an all-in-one AI training platform that enables users to fine-tune large language models (LLMs) with their own data using vario

Product Hunt·2mo ago