All Topics

Technology

Art

Flash-MoE: Running 397B Parameter AI Model on MacBook Pro with 48GB RAM

mft_

2mo ago· 5 min readenCode

100/100

Golden Brown

Bagelometer↗

Hand-rolled, kettle-boiled, baked to perfection. Worth every minute at the bakery.

Score100TypenewsSentimentpositive

Summary

Flash-MoE is a pure C/Metal inference engine that enables running the massive Qwen3.5-397B-A17B model (397 billion parameters) on a MacBook Pro with 48GB RAM. The system achieves 4.4+ tokens/second with production-quality output including tool calling, streaming the entire 209GB model from SSD through a custom Metal compute pipeline. The project was built in 24 hours by an AI and human collaboration, using no Python or frameworks—just C, Objective-C, and hand-tuned Metal shaders.

Key quotes

· 4 pulled

Pure C/Metal inference engine that runs Qwen3.5-397B-A17B (a 397 billion parameter Mixture-of-Experts model) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second with production-quality output including tool calling.

The entire 209GB model streams from SSD through a custom Metal compute pipeline. No Python. No frameworks. Just C, Objective-C, and hand-tuned Metal shaders.

Running a big model on a small laptop.

The story of how an AI and a human built this in 24 hours.

Snippet from the RSS feed

Running a big model on a small laptop. Contribute to danveloper/flash-moe development by creating an account on GitHub.

You might also wanna read

StepFun Releases Step 3.5 Flash: 196B Sparse MoE Model for OpenClaw Agents

StepFun has released Step 3.5 Flash, a 196B sparse Mixture of Experts (MoE) model that activates only 11B parameters per token for high effi

Product Hunt·3d ago