Posts tagged with: LLM inference

Content related to LLM inference

397B MoE on MacBook: 4.4 t/s Flash-MoE Engine

April 03, 2026

Flash-MoE runs Qwen3.5-397B-A17B (397 billion parameters) on a MacBook Pro M3 Max with 48GB RAM at 4.4+ tokens/second. Pure C/Metal inference streams 209GB model from SSD with production-quality output including tool calling. Key innovations: FMA-optimized dequant kernels (+12% speed), OS page cache expert streaming, deferred GPU compute, and hand-tuned Metal shaders. 58 experiments documented with full technical paper.

TurboQuant+: 6.4x KV Cache Compression for LLMs

March 29, 2026

TurboQuant+ implements ICLR 2026's breakthrough KV cache compression, achieving 4.6-6.4x compression with near q8_0 quality and speed. Features turbo2/turbo3/turbo4 formats, attention-gated Sparse V decoding (+22.8% decode speed), and full llama.cpp Metal integration. Run Qwen 3.5 35B-A3B on M5 Max with 93.9% NIAH retrieval and 1.02x q8_0 prefill speed. Complete Python prototype with 511+ tests and community validation across Apple Silicon, NVIDIA, and AMD.

Run TinyLlama on a $10 Board with PicoLM – A Complete Tutorial

February 27, 2026

Discover how PicoLM turns a $10 Raspberry Pi or LicheeRV board into a powerful local LLM host. This tutorial walks you through downloading the TinyLlama 1.1B model, compiling the C‑only engine, configuring PicoClaw for offline chat, and benchmarking performance on cheap hardware. Learn about zero‑dependency design, flash attention, and JSON grammar constraints that let you generate structured output on a tiny device. Great for developers wanting a cost‑effective, privacy‑preserving LLM on edge hardware.