Posts tagged with: Metal Compute

Content related to Metal Compute

397B MoE on MacBook: 4.4 t/s Flash-MoE Engine

April 03, 2026

Tags:

Apple Silicon LLM inference Mixture of Experts Metal Compute Model Quantization

Flash-MoE runs Qwen3.5-397B-A17B (397 billion parameters) on a MacBook Pro M3 Max with 48GB RAM at 4.4+ tokens/second. Pure C/Metal inference streams 209GB model from SSD with production-quality output including tool calling. Key innovations: FMA-optimized dequant kernels (+12% speed), OS page cache expert streaming, deferred GPU compute, and hand-tuned Metal shaders. 58 experiments documented with full technical paper.

Categories

Posts tagged with: Metal Compute

397B MoE on MacBook: 4.4 t/s Flash-MoE Engine