Back to Repositories

danveloper/flash-moe

Flash-MoE runs Qwen3.5-397B -- a 397 billion parameter Mixture-of-Experts model -- on a MacBook Pro with 48GB RAM at 4.4 tokens per second. Not through quantization tricks that destroy output quality. Not through a cloud API pretending to be local. Through a pure C/Metal inference engine that streams 209GB of model weights directly from SSD, loading only the 17 billion active parameters needed for each token. The key insight is brutally simple: MoE models have 512 experts per layer, but only 4 are active for any given token. Why load all 397 billion parameters into memory when you only need 17 billion at a time? Flash-MoE exploits this directly. RAM usage during inference sits at roughly 5.5GB. The entire 209GB model lives on your SSD, and the engine streams expert weights on demand through a custom Metal compute pipeline. The architecture is 60 transformer layers: 45 GatedDeltaNet layers using linear attention and 15 standard full-attention layers. Each MoE layer has 512 experts with K=4 activated per token plus one shared expert. The engine fuses Metal kernels for attention, normalization, and MoE routing. An FMA-optimized dequantization kernel reduces GPU compute by 12%. Built by Dan Woods (VP of AI at CVS Health), the entire codebase was written in roughly 24 hours using Claude Code. He fed Apple's 'LLM in a Flash' research paper to Claude and ran 90 experiments to produce optimized Objective-C and Metal shader code. The repo documents 58 failed experiments -- things like LZ4 compression (-13% performance), speculative decoding (neutral), and custom expert caching (slower than OS page cache defaults). That transparency about what didn't work is as valuable as the working code. Dual quantization is supported: 4-bit (209GB) delivers 4.36 tok/s with production-quality output including tool calling, while 2-bit (120GB) hits 7.05 tok/s peak but breaks JSON formatting. The serial GPU-SSD pipeline averages 4.28ms per layer -- counterintuitively, sequential execution beats concurrent DMA+GPU because Apple's unified memory controller becomes a bottleneck under parallel access (73% performance drop). This proves that frontier-class MoE models can run on consumer hardware without cloud dependencies. If you have a MacBook Pro M3 Max with 48GB and a fast SSD, you can run a model that competes with GPT-4 class outputs entirely offline.

ai engineering
Objective-C

Why It Matters

The assumption that 400B+ parameter models require datacenter hardware just got disproven on a laptop. Flash-MoE demonstrates that the Mixture-of-Experts architecture creates a natural opportunity for consumer inference -- you only need memory for the active parameters, not the total parameter count. For developers building local-first AI applications, this changes the calculus entirely. A 397B model running at 4.4 tok/s on a $3,500 laptop means you can build production applications with frontier-quality outputs and zero API costs. No rate limits, no data privacy concerns, no monthly bills. The 2.1K stars in under two weeks and 393 points on Hacker News signal that the developer community sees this as a breakthrough, not a toy. The codebase itself is a masterclass in Apple Silicon optimization -- the documented failed experiments alone are worth studying for anyone doing Metal shader programming.

Repository Stats

Stars
0
Forks
0
Last Commit
N/A

Related Resources

Weekly AI Digest