Posts tagged with: TurboQuant

Content related to TurboQuant

TurboQuant+: 6.4x KV Cache Compression for LLMs

March 29, 2026

Tags:

Apple Silicon Llama.cpp LLM inference KV cache compression TurboQuant

TurboQuant+ implements ICLR 2026's breakthrough KV cache compression, achieving 4.6-6.4x compression with near q8_0 quality and speed. Features turbo2/turbo3/turbo4 formats, attention-gated Sparse V decoding (+22.8% decode speed), and full llama.cpp Metal integration. Run Qwen 3.5 35B-A3B on M5 Max with 93.9% NIAH retrieval and 1.02x q8_0 prefill speed. Complete Python prototype with 511+ tests and community validation across Apple Silicon, NVIDIA, and AMD.

Categories

Posts tagged with: TurboQuant

TurboQuant+: 6.4x KV Cache Compression for LLMs