All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
Bluesky
Twitter
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Azure achieves record MLPerf training time for Llama 405B using 8,192-GPU cluster

By

azinheidarshenas

11h ago· 5 min readenInsight

Summary

Azure achieved the most performant MLPerf Training v6.0 result for Llama 3.1 405B, training the model in just over seven minutes using a massive cluster of 2,048 NVIDIA GB200 NVL72 nodes (8,192 GPUs) across 128 racks. The article details how Azure's Fairwater AI supercomputing infrastructure, combining NVIDIA NVLink for intra-rack communication and Azure's MRC scale-out networking fabric for inter-rack communication, enabled near-perfect weak scaling efficiency of 99.8% with minimal step-time variance. The key architectural ingredients include high operational efficiency of NVLink scale-up domains, resilient MRC scale-out networking, and topology-aware workload mapping that aligns parallelism strategies with network structure.

Key quotes

· 5 pulled
Azure achieved the most performant MLPerf Training v6.0 result to date for Llama 3.1 405B, with a time-to-train of just over seven minutes according to MLCommons.
The key insight is that not all communication is equally latency-sensitive. Some communication must be completed before computation can continue, while other communication can overlap with compute.
Step time remains nearly identical at scale, measuring 1.2734 seconds at 112 GB200 NVL72 racks (7,168 GPUs) and 1.2712 seconds at 128 racks (8,192 GPUs) — a difference of just 2 ms.
This corresponds to a near-perfect 99.8% weak scaling efficiency as we expanded the cluster by an additional 1,024 GPUs.
Any network instability, congestion, or synchronization jitter would reduce this overlap and expose communication on the critical path, directly increasing overall step time.
Snippet from the RSS feed
Azure achieved the most performant MLPerf Training v6.0 result to date for Llama 3.1 405B, with a time-to-train of just over seven minutes according to MLCommons. This loadbearing benchmark measures how communication overhead and system stability dominate

You might also wanna read