Flash-MoE: Running 397B Parameter AI Model on MacBook Pro with 48GB RAM
By
mft_
2mo ago· 5 min readenCode
100/100
Golden Brown
Bagelometer↗
Hand-rolled, kettle-boiled, baked to perfection. Worth every minute at the bakery.
Score100TypenewsSentimentpositive
Summary
Flash-MoE is a pure C/Metal inference engine that enables running the massive Qwen3.5-397B-A17B model (397 billion parameters) on a MacBook Pro with 48GB RAM. The system achieves 4.4+ tokens/second with production-quality output including tool calling, streaming the entire 209GB model from SSD through a custom Metal compute pipeline. The project was built in 24 hours by an AI and human collaboration, using no Python or frameworks—just C, Objective-C, and hand-tuned Metal shaders.
Key quotes
· 4 pulledPure C/Metal inference engine that runs Qwen3.5-397B-A17B (a 397 billion parameter Mixture-of-Experts model) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second with production-quality output including tool calling.
The entire 209GB model streams from SSD through a custom Metal compute pipeline. No Python. No frameworks. Just C, Objective-C, and hand-tuned Metal shaders.
Running a big model on a small laptop.
The story of how an AI and a human built this in 24 hours.
Running a big model on a small laptop. Contribute to danveloper/flash-moe development by creating an account on GitHub.
