All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Efficient Performance of 120B Model on Minimal Hardware

By

zigzag312

· 3 min readenNews

Summary

The article discusses the efficient performance of a 120B model running on minimal hardware, specifically highlighting the use of CPU for expert layers and GPU for attention layers, requiring only 5 to 8GB of VRAM. It emphasizes the benefits of this setup, such as low memory use and snappy performance, and recommends hardware like the RTX3000 series for optimal results.

Key quotes

· 4 pulled
The expert layers run amazing on CPU (~17T/s 25T/s on a 14900K) and you can force that with this new llama-cpp option: --cpu-moe.
No giant MLP weights are resident on the GPU, so memory use stays low.
This yields an amazing snappy system for a 120B model! Even something like a 3060Ti would be amazing!
GPU with BF16 support would be best (RTX3000+) because all layers except the MOE layers (which are mxfp4) are BF16.
Snippet from the RSS feed
Here is the thing, the expert layers run amazing on CPU (~~\~17T/s~~ 25T/s on a 14900K) and you can force that with this new llama-cpp option:...

You might also wanna read