397B MoE on MacBook: 4.4 t/s Flash-MoE Engine

Flash-MoE: Running 397B Parameters on a MacBook at 4.4+ Tokens/Second

Imagine running a 397 billion parameter Mixture-of-Experts (MoE) model on your MacBook Pro with production-quality output at 4.4+ tokens/second. That's exactly what Flash-MoE achieves - no Python, no frameworks, just pure C/Objective-C and hand-tuned Metal shaders.

The Beast: Qwen3.5-397B-A17B

This isn't theoretical. The project streams a 209GB 4-bit quantized model from SSD while delivering:

Configuration Speed Quality Disk
4-bit FMA kernel 4.36 t/s Excellent 209GB
4-bit baseline 3.90 t/s Excellent 209GB
2-bit experts 5.74 t/s Good* 120GB

*2-bit breaks JSON/tool calling reliability

Hardware Reality Check

MacBook Pro M3 Max: 48GB unified memory, 40-core GPU, 17.5GB/s SSD. No data center required.

Breakthrough Techniques

1. SSD Expert Streaming + "Trust the OS"

Only load K=4 active experts per layer (~6.75MB each)
OS page cache handles LRU (71% hit rate naturally)
Parallel pread() via GCD dispatch groups
No custom cache needed

Lesson: Custom Metal LRU, LZ4 compression, and mmap all performed worse.

2. FMA-Optimized Dequant Kernel (+12% speed)

// Before: (nibble * scale + bias) * x
// After:  fma(nibble, scale*x, bias*x)
GPU fused multiply-add does dequant+multiply in one instruction.

3. Deferred GPU Pipeline

CMD3(expert fwd) β†’ [DEFERRED]
CMD1: attention projections
CPU: routing + pread experts
CMD2: combine + norm
β†’ Next layer
4.28ms per layer average at 4-bit.

4. Hand-Written Metal Kernels

  • 4-bit/2-bit dequantized matvec (tiled, SIMD, shared cache)
  • Fused SwiGLU, RMS norm, MoE combine
  • Batched GPU attention (Q@Kα΅€, softmax, scores@V)
  • GPU RoPE + deinterleave

5. BLAS-Accelerated Linear Attention

GatedDeltaNet uses cblas_sgemv/sger - 64% faster than scalar code.

What They Discarded (58 Experiments)

Failed Approach Impact Why
LZ4 compression -13% Decompress > cache savings
GPU LUT dequant -2% Register serialization
Expert prediction -18% Cache pollution
mmap experts -5x Page fault overhead

Production Ready

cd metal_infer && make
./infer --prompt "Explain quantum computing" --tokens 100
./chat  # Interactive TUI + tool calling

Safety: 6GB fixed memory footprint leaves 42GB for OS + page cache. No OOM risk on primary dev machine.

The Paper

Full technical details cover 90+ experiments and the 24-hour human+AI development story. This is state-of-the-art Apple Silicon inference - open source, battle-tested, and pushing hardware limits.

Stars: 3.2k | Forks: 371 - The community knows real innovation when they see it.

Original Article: View Original

Share this article