TurboQuant+: 6.4x KV Cache Compression for LLMs

TurboQuant+: Revolutionizing LLM Inference with 6.4x KV Cache Compression

Local LLM inference just got dramatically more efficient. TurboQuant+ implements the ICLR 2026 TurboQuant paper with production-ready llama.cpp integration, delivering 4.6-6.4x KV cache compression while maintaining near-q8_0 quality and speed.

πŸš€ Key Breakthroughs

1. Multi-Bit Format Family

Format Bits/Val Compression PPL vs q8_0 Use Case
turbo4 4.25 3.8x +0.23% Best quality/speed
turbo3 3.5 4.6x +1.06% Maximum compression
turbo2 2.5 6.4x +6.48% Extreme memory pressure

turbo4 outperforms q4_0 in both quality (closer to q8_0) and compression ratio.

2. Sparse V: Attention-Gated Decoding

The killer feature: skip dequantization of low-attention V positions.

  • +22.8% decode speed at 32K context
  • Zero PPL degradation (validated 50 chunks wikitext-103, CI Β±0.021)
  • Works across ALL formats (q8_0, q4_0, turbo3/4)
  • NIAH improvement: turbo4 = 31/33 vs q8_0's 30/33

"Most attention weights are negligible at long context. Skip them." – Core insight

3. Speed Parity Achieved

M5 Max, Qwen 3.5 35B-A3B:

Prefill (32K context): turbo3 = 1.10x q8_0 (1204 vs 1098 tok/s) Decode: turbo4 = 0.93x q8_0 with Sparse V

Real-world 24K PDF: turbo4 decode = 63.7 tok/s (0.93x q8_0)

πŸ”₯ Getting Started (5 Minutes)

Python Prototype

git clone https://github.com/TheTom/turboquant_plus.git
cd turboquant_plus
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
python3 -m pytest tests/ -v  # 141 tests pass
python3 benchmarks/demo.py   # Quick demo

llama.cpp Production

git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

# Run with turbo cache
./build/bin/llama-server -m your-model.gguf --cache-type-k turbo3 --cache-type-v turbo3 -c 262144

πŸ“Š Battle-Tested Validation

βœ… 511+ Python tests, 100% coverage βœ… Qwen 3.5 35B-A3B (MoE) on M5 Max βœ… 9/9 NIAH single needle with Sparse V (vs 7/9 baseline) βœ… 100% multi-key retrieval through 32K βœ… Community tested: M1-M5 Macs, RTX 3090/4090/5090, AMD 6800XT

🎯 Why This Matters

Memory: 6.4x smaller KV cache = run bigger models longer contexts Speed: Match q8_0 prefill, 0.9x decode at 32K Quality: turbo4 beats q4_0, only +0.23% PPL vs q8_0

πŸš€ Future: TurboQuant+

  • Adaptive bit allocation per layer
  • Temporal decay compression (30-34% memory savings)
  • MoE-aware compression
  • CUDA backend (NVIDIA support)

Status: v1 complete and production-ready. Extensions coming post-PR to upstream llama.cpp.


⭐ Star the repo and try turbo3 on your next long-context run. Your RAM (and electric bill) will thank you.

GitHub: turboquant_plus

Original Article: View Original

Share this article