Categories
- All Posts 548
- Practical Open Source Projects 478
- Tutorial Articles 22
- Online Utilities 13
- AI news 7
- Tiny Startups Showcase 7
- Prompt Templates 5
- Claude Code Skills 5
- Hugging Face Spaces 3
- OpenClaw Use Cases 3
- LLM Learning Resources 1
- Online AI Image Tools 1
- OpenClaw Master Skills Collection 1
- Rust Training Resources 1
- AI Short Drama Tools 1
- My Favorites 0
Posts tagged with: LLM inference
Content related to LLM inference
397B MoE on MacBook: 4.4 t/s Flash-MoE Engine
Flash-MoE runs Qwen3.5-397B-A17B (397 billion parameters) on a MacBook Pro M3 Max with 48GB RAM at 4.4+ tokens/second. Pure C/Metal inference streams 209GB model from SSD with production-quality output including tool calling. Key innovations: FMA-optimized dequant kernels (+12% speed), OS page cache expert streaming, deferred GPU compute, and hand-tuned Metal shaders. 58 experiments documented with full technical paper.
TurboQuant+: 6.4x KV Cache Compression for LLMs
TurboQuant+ implements ICLR 2026's breakthrough KV cache compression, achieving 4.6-6.4x compression with near q8_0 quality and speed. Features turbo2/turbo3/turbo4 formats, attention-gated Sparse V decoding (+22.8% decode speed), and full llama.cpp Metal integration. Run Qwen 3.5 35B-A3B on M5 Max with 93.9% NIAH retrieval and 1.02x q8_0 prefill speed. Complete Python prototype with 511+ tests and community validation across Apple Silicon, NVIDIA, and AMD.
Run TinyLlama on a $10 Board with PicoLM – A Complete Tutorial
Discover how PicoLM turns a $10 Raspberry Pi or LicheeRV board into a powerful local LLM host. This tutorial walks you through downloading the TinyLlama 1.1B model, compiling the C‑only engine, configuring PicoClaw for offline chat, and benchmarking performance on cheap hardware. Learn about zero‑dependency design, flash attention, and JSON grammar constraints that let you generate structured output on a tiny device. Great for developers wanting a cost‑effective, privacy‑preserving LLM on edge hardware.