397B MoE on MacBook: 4.4 t/s Flash-MoE Engine
Flash-MoE: Running 397B Parameters on a MacBook at 4.4+ Tokens/Second
Imagine running a 397 billion parameter Mixture-of-Experts (MoE) model on your MacBook Pro with production-quality output at 4.4+ tokens/second. That's exactly what Flash-MoE achieves - no Python, no frameworks, just pure C/Objective-C and hand-tuned Metal shaders.
The Beast: Qwen3.5-397B-A17B
This isn't theoretical. The project streams a 209GB 4-bit quantized model from SSD while delivering:
| Configuration | Speed | Quality | Disk |
|---|---|---|---|
| 4-bit FMA kernel | 4.36 t/s | Excellent | 209GB |
| 4-bit baseline | 3.90 t/s | Excellent | 209GB |
| 2-bit experts | 5.74 t/s | Good* | 120GB |
*2-bit breaks JSON/tool calling reliability
Hardware Reality Check
MacBook Pro M3 Max: 48GB unified memory, 40-core GPU, 17.5GB/s SSD. No data center required.
Breakthrough Techniques
1. SSD Expert Streaming + "Trust the OS"
Only load K=4 active experts per layer (~6.75MB each)
OS page cache handles LRU (71% hit rate naturally)
Parallel pread() via GCD dispatch groups
No custom cache needed
Lesson: Custom Metal LRU, LZ4 compression, and mmap all performed worse.
2. FMA-Optimized Dequant Kernel (+12% speed)
// Before: (nibble * scale + bias) * x
// After: fma(nibble, scale*x, bias*x)
3. Deferred GPU Pipeline
CMD3(expert fwd) β [DEFERRED]
CMD1: attention projections
CPU: routing + pread experts
CMD2: combine + norm
β Next layer
4. Hand-Written Metal Kernels
- 4-bit/2-bit dequantized matvec (tiled, SIMD, shared cache)
- Fused SwiGLU, RMS norm, MoE combine
- Batched GPU attention (Q@Kα΅, softmax, scores@V)
- GPU RoPE + deinterleave
5. BLAS-Accelerated Linear Attention
GatedDeltaNet uses cblas_sgemv/sger - 64% faster than scalar code.
What They Discarded (58 Experiments)
| Failed Approach | Impact | Why |
|---|---|---|
| LZ4 compression | -13% | Decompress > cache savings |
| GPU LUT dequant | -2% | Register serialization |
| Expert prediction | -18% | Cache pollution |
| mmap experts | -5x | Page fault overhead |
Production Ready
cd metal_infer && make
./infer --prompt "Explain quantum computing" --tokens 100
./chat # Interactive TUI + tool calling
Safety: 6GB fixed memory footprint leaves 42GB for OS + page cache. No OOM risk on primary dev machine.
The Paper
Full technical details cover 90+ experiments and the 24-hour human+AI development story. This is state-of-the-art Apple Silicon inference - open source, battle-tested, and pushing hardware limits.
Stars: 3.2k | Forks: 371 - The community knows real innovation when they see it.