Virtualizing NVIDIA HGX B200 GPUs with Open-Source Software
By
ben_s
Warm and crisp on the edges. A bagel with a bit of bite.
Summary
This technical blog post details the process of virtualizing NVIDIA HGX B200 GPUs using open-source software. It covers hardware overview, Fabric Manager configuration for shared NVSwitch multitenancy mode, VFIO passthrough implementation, QEMU PCI topology fixes, and solutions for large BAR boot stalls. The article provides practical guidance on setting up GPU partitions for virtualization environments.
Key quotes
· 4 pulledFABRIC_MODE=1 - Shared NVSwitch multitenancy
With FABRIC_MODE=1, Fabric Manager starts in Shared NVSwitch Multitenancy Mode and exposes an API for activating and deactivating GPU partitions.
This blog post covers how we virtualized NVIDIA HGX B200 GPUs using open-source software.
It talks about VFIO passthrough, QEMU PCI topology fixes, large BAR boot stalls, and Fabric Manager partitions.
You might also wanna read
Rotary GPU: Enabling Large Mixture-of-Experts Models on Consumer Laptop GPUs with Limited Memory
This paper presents Rotary GPU, an exploratory approach to running large Mixture-of-Experts (MoE) language models on consumer-grade hardware
Unsloth and NVIDIA Partner to Accelerate LLM Fine-Tuning by 20%
Unsloth has partnered with NVIDIA to optimize fine-tuning of large language models, achieving 20% faster training speeds. The collaboration
unsloth.ai·24d agoSystalyze's Utilyze Tool Reveals True GPU Compute Utilization in AI Workloads
Systalyze introduces Utilyze, a GPU compute utilization monitoring tool that reveals actual compute throughput versus traditional metrics li
Holos: Docker Compose for KVM/QEMU Virtualization Management
Holos is a Docker Compose-like tool for KVM/QEMU virtualization that allows users to define multi-VM stacks in a single YAML file. It simpli
SmolVM: CLI Tool for Portable, Lightweight Virtual Machines
SmolVM is a CLI tool for building and running portable, lightweight, self-contained virtual machines. It enables users to manage custom Linu
Zeroboot: Sub-millisecond VM Sandboxes for AI Agents Using Copy-on-Write Forking
Zeroboot is an open-source project that enables sub-millisecond VM sandboxes for AI agents using copy-on-write forking. It achieves signific
