TurboQuant+: Revolutionizing LLM Inference with 6.4x KV Cache Compression

Local LLM inference just got dramatically more efficient. TurboQuant+ implements the ICLR 2026 TurboQuant paper with production-ready llama.cpp integration, delivering 4.6-6.4x KV cache compression while maintaining near-q8_0 quality and speed.

🚀 Key Breakthroughs

1. Multi-Bit Format Family

Format	Bits/Val	Compression	PPL vs q8_0	Use Case
turbo4	4.25	3.8x	+0.23%	Best quality/speed
turbo3	3.5	4.6x	+1.06%	Maximum compression
turbo2	2.5	6.4x	+6.48%	Extreme memory pressure

turbo4 outperforms q4_0 in both quality (closer to q8_0) and compression ratio.

2. Sparse V: Attention-Gated Decoding

The killer feature: skip dequantization of low-attention V positions.

+22.8% decode speed at 32K context
Zero PPL degradation (validated 50 chunks wikitext-103, CI ±0.021)
Works across ALL formats (q8_0, q4_0, turbo3/4)
NIAH improvement: turbo4 = 31/33 vs q8_0's 30/33

"Most attention weights are negligible at long context. Skip them." – Core insight

3. Speed Parity Achieved

M5 Max, Qwen 3.5 35B-A3B:

Prefill (32K context): turbo3 = 1.10x q8_0 (1204 vs 1098 tok/s) Decode: turbo4 = 0.93x q8_0 with Sparse V

Real-world 24K PDF: turbo4 decode = 63.7 tok/s (0.93x q8_0)

🔥 Getting Started (5 Minutes)

Python Prototype

git clone https://github.com/TheTom/turboquant_plus.git
cd turboquant_plus
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
python3 -m pytest tests/ -v  # 141 tests pass
python3 benchmarks/demo.py   # Quick demo

llama.cpp Production

git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

# Run with turbo cache
./build/bin/llama-server -m your-model.gguf --cache-type-k turbo3 --cache-type-v turbo3 -c 262144

📊 Battle-Tested Validation

✅ 511+ Python tests, 100% coverage ✅ Qwen 3.5 35B-A3B (MoE) on M5 Max ✅ 9/9 NIAH single needle with Sparse V (vs 7/9 baseline) ✅ 100% multi-key retrieval through 32K ✅ Community tested: M1-M5 Macs, RTX 3090/4090/5090, AMD 6800XT

🎯 Why This Matters

Memory: 6.4x smaller KV cache = run bigger models longer contexts Speed: Match q8_0 prefill, 0.9x decode at 32K Quality: turbo4 beats q4_0, only +0.23% PPL vs q8_0

🚀 Future: TurboQuant+

Adaptive bit allocation per layer
Temporal decay compression (30-34% memory savings)
MoE-aware compression
CUDA backend (NVIDIA support)

Status: v1 complete and production-ready. Extensions coming post-PR to upstream llama.cpp.

⭐ Star the repo and try turbo3 on your next long-context run. Your RAM (and electric bill) will thank you.

GitHub: turboquant_plus