TurboQuant+: 6.4x KV Cache Compression for LLMs
TurboQuant+: Revolutionizing LLM Inference with 6.4x KV Cache Compression
Local LLM inference just got dramatically more efficient. TurboQuant+ implements the ICLR 2026 TurboQuant paper with production-ready llama.cpp integration, delivering 4.6-6.4x KV cache compression while maintaining near-q8_0 quality and speed.
π Key Breakthroughs
1. Multi-Bit Format Family
| Format | Bits/Val | Compression | PPL vs q8_0 | Use Case |
|---|---|---|---|---|
| turbo4 | 4.25 | 3.8x | +0.23% | Best quality/speed |
| turbo3 | 3.5 | 4.6x | +1.06% | Maximum compression |
| turbo2 | 2.5 | 6.4x | +6.48% | Extreme memory pressure |
turbo4 outperforms q4_0 in both quality (closer to q8_0) and compression ratio.
2. Sparse V: Attention-Gated Decoding
The killer feature: skip dequantization of low-attention V positions.
- +22.8% decode speed at 32K context
- Zero PPL degradation (validated 50 chunks wikitext-103, CI Β±0.021)
- Works across ALL formats (q8_0, q4_0, turbo3/4)
- NIAH improvement: turbo4 = 31/33 vs q8_0's 30/33
"Most attention weights are negligible at long context. Skip them." β Core insight
3. Speed Parity Achieved
M5 Max, Qwen 3.5 35B-A3B:
Prefill (32K context): turbo3 = 1.10x q8_0 (1204 vs 1098 tok/s) Decode: turbo4 = 0.93x q8_0 with Sparse V
Real-world 24K PDF: turbo4 decode = 63.7 tok/s (0.93x q8_0)
π₯ Getting Started (5 Minutes)
Python Prototype
git clone https://github.com/TheTom/turboquant_plus.git
cd turboquant_plus
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
python3 -m pytest tests/ -v # 141 tests pass
python3 benchmarks/demo.py # Quick demo
llama.cpp Production
git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
# Run with turbo cache
./build/bin/llama-server -m your-model.gguf --cache-type-k turbo3 --cache-type-v turbo3 -c 262144
π Battle-Tested Validation
β 511+ Python tests, 100% coverage β Qwen 3.5 35B-A3B (MoE) on M5 Max β 9/9 NIAH single needle with Sparse V (vs 7/9 baseline) β 100% multi-key retrieval through 32K β Community tested: M1-M5 Macs, RTX 3090/4090/5090, AMD 6800XT
π― Why This Matters
Memory: 6.4x smaller KV cache = run bigger models longer contexts Speed: Match q8_0 prefill, 0.9x decode at 32K Quality: turbo4 beats q4_0, only +0.23% PPL vs q8_0
π Future: TurboQuant+
- Adaptive bit allocation per layer
- Temporal decay compression (30-34% memory savings)
- MoE-aware compression
- CUDA backend (NVIDIA support)
Status: v1 complete and production-ready. Extensions coming post-PR to upstream llama.cpp.
β Star the repo and try turbo3 on your next long-context run. Your RAM (and electric bill) will thank you.