All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

ntransformer: C++/CUDA LLM Inference Engine Enables Running Llama 70B on RTX 3090

By

xaskasdf

3mo ago· 8 min readenCode

Summary

ntransformer is a high-efficiency C++/CUDA LLM inference engine that enables running large language models like Llama 70B on consumer-grade hardware like the RTX 3090 (24GB VRAM). The engine uses innovative memory management techniques including streaming model layers through GPU memory via PCIe and optional NVMe direct I/O that bypasses the CPU entirely. Performance results show the system can run Llama 3.1 8B Q8_0 models at 48.9 tokens/second with all layers resident in VRAM, and can handle much larger 70B models through tiered memory management that combines VRAM, RAM, and NVMe storage.

Key quotes

· 4 pulled
High-efficiency C++/CUDA LLM inference engine. Runs Llama 70B on a single RTX 3090 (24GB VRAM) by streaming model layers through GPU memory via PCIe, with optional NVMe direct I/O that bypasses the CPU entirely.
Llama 3.1 8B Q8_0 Resident: 48.9 tok/s, 10.0 GB - All layers in VRAM
Llama 3.1 70B Q6_K Tiered (auto): 0.2 tok/s, 23.1 GB - 26 VRAM + 54 RAM + 0 NVMe
High-efficiency LLM inference engine in C++/CUDA. Run Llama 70B on RTX 3090.
Snippet from the RSS feed
High-efficiency LLM inference engine in C++/CUDA. Run Llama 70B on RTX 3090. - xaskasdf/ntransformer

You might also wanna read