397B MoE on MacBook: 4.4 t/s Flash-MoE Engine

April 03, 2026

Category: Practical Open Source Projects

Tags:

Apple Silicon LLM inference Mixture of Experts Metal Compute Model Quantization

Flash-MoE: Running 397B Parameters on a MacBook at 4.4+ Tokens/Second

Imagine running a 397 billion parameter Mixture-of-Experts (MoE) model on your MacBook Pro with production-quality output at 4.4+ tokens/second. That's exactly what Flash-MoE achieves - no Python, no frameworks, just pure C/Objective-C and hand-tuned Metal shaders.

The Beast: Qwen3.5-397B-A17B

This isn't theoretical. The project streams a 209GB 4-bit quantized model from SSD while delivering:

Configuration	Speed	Quality	Disk
4-bit FMA kernel	4.36 t/s	Excellent	209GB
4-bit baseline	3.90 t/s	Excellent	209GB
2-bit experts	5.74 t/s	Good*	120GB

*2-bit breaks JSON/tool calling reliability

Hardware Reality Check

MacBook Pro M3 Max: 48GB unified memory, 40-core GPU, 17.5GB/s SSD. No data center required.

Breakthrough Techniques

1. SSD Expert Streaming + "Trust the OS"

Only load K=4 active experts per layer (~6.75MB each)
OS page cache handles LRU (71% hit rate naturally)
Parallel pread() via GCD dispatch groups
No custom cache needed

Lesson: Custom Metal LRU, LZ4 compression, and mmap all performed worse.

2. FMA-Optimized Dequant Kernel (+12% speed)

// Before: (nibble * scale + bias) * x
// After:  fma(nibble, scale*x, bias*x)

GPU fused multiply-add does dequant+multiply in one instruction.

3. Deferred GPU Pipeline

CMD3(expert fwd) → [DEFERRED]
CMD1: attention projections
CPU: routing + pread experts
CMD2: combine + norm
→ Next layer

4.28ms per layer average at 4-bit.

4. Hand-Written Metal Kernels

4-bit/2-bit dequantized matvec (tiled, SIMD, shared cache)
Fused SwiGLU, RMS norm, MoE combine
Batched GPU attention (Q@Kᵀ, softmax, scores@V)
GPU RoPE + deinterleave

5. BLAS-Accelerated Linear Attention

GatedDeltaNet uses cblas_sgemv/sger - 64% faster than scalar code.

What They Discarded (58 Experiments)

Failed Approach	Impact	Why
LZ4 compression	-13%	Decompress > cache savings
GPU LUT dequant	-2%	Register serialization
Expert prediction	-18%	Cache pollution
mmap experts	-5x	Page fault overhead

Production Ready

cd metal_infer && make
./infer --prompt "Explain quantum computing" --tokens 100
./chat  # Interactive TUI + tool calling

Safety: 6GB fixed memory footprint leaves 42GB for OS + page cache. No OOM risk on primary dev machine.

The Paper

Full technical details cover 90+ experiments and the 24-hour human+AI development story. This is state-of-the-art Apple Silicon inference - open source, battle-tested, and pushing hardware limits.

Stars: 3.2k | Forks: 371 - The community knows real innovation when they see it.

Original Article: View Original

Share this article