LLM inference - Open Source Projects

397B MoE on MacBook: 4.4 t/s Flash-MoE Engine

April 03, 2026

Tags:

Apple Silicon LLM inference Mixture of Experts Metal Compute Model Quantization

Flash-MoE runs Qwen3.5-397B-A17B (397 billion parameters) on a MacBook Pro M3 Max with 48GB RAM at 4.4+ tokens/second. Pure C/Metal inference streams 209GB model from SSD with production-quality output including tool calling. Key innovations: FMA-optimized dequant kernels (+12% speed), OS page cache expert streaming, deferred GPU compute, and hand-tuned Metal shaders. 58 experiments documented with full technical paper.

TurboQuant+: 6.4x KV Cache Compression for LLMs

March 29, 2026

Tags:

Apple Silicon Llama.cpp LLM inference KV cache compression TurboQuant

TurboQuant+ implements ICLR 2026's breakthrough KV cache compression, achieving 4.6-6.4x compression with near q8_0 quality and speed. Features turbo2/turbo3/turbo4 formats, attention-gated Sparse V decoding (+22.8% decode speed), and full llama.cpp Metal integration. Run Qwen 3.5 35B-A3B on M5 Max with 93.9% NIAH retrieval and 1.02x q8_0 prefill speed. Complete Python prototype with 511+ tests and community validation across Apple Silicon, NVIDIA, and AMD.

Run TinyLlama on a $10 Board with PicoLM – A Complete Tutorial

February 27, 2026

Tags:

TinyLlama PicoLM Embedded AI LLM inference Raspberry Pi

Discover how PicoLM turns a $10 Raspberry Pi or LicheeRV board into a powerful local LLM host. This tutorial walks you through downloading the TinyLlama 1.1B model, compiling the C‑only engine, configuring PicoClaw for offline chat, and benchmarking performance on cheap hardware. Learn about zero‑dependency design, flash attention, and JSON grammar constraints that let you generate structured output on a tiny device. Great for developers wanting a cost‑effective, privacy‑preserving LLM on edge hardware.

Categories

Posts tagged with: LLM inference

397B MoE on MacBook: 4.4 t/s Flash-MoE Engine

TurboQuant+: 6.4x KV Cache Compression for LLMs

Run TinyLlama on a $10 Board with PicoLM – A Complete Tutorial