git clone https://github.com/TheTom/turboquant_plus.git
cd turboquant_plus
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
python3 -m pytest tests/ -v  # 141 个测试通过
python3 benchmarks/demo.py   # 快速演示

llama.cpp 生产环境

git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

# 使用 turbo 缓存运行
./build/bin/llama-server -m your-model.gguf --cache-type-k turbo3 --cache-type-v turbo3 -c 262144

📊 实战验证

✅ 511+ Python 测试，100% 覆盖率 ✅ Qwen 3.5 35B-A3B (MoE) 在 M5 Max 上 ✅ 9/9 NIAH 单针 使用 Sparse V（vs 基线 7/9） ✅ 32K 下 100% 多键检索 ✅ 社区测试：M1-M5 Mac、RTX 3090/4090/5090、AMD 6800XT

🎯 为什么重要

内存：6.4 倍更小的 KV 缓存 = 运行更大模型更长上下文速度：匹配 q8_0 预填充，32K 下 0.9 倍解码质量：turbo4 超越 q4_0，仅 +0.23% PPL vs q8_0

🚀 未来：TurboQuant+

每层自适应位分配
时序衰减压缩（30-34% 内存节省）
MoE 感知压缩
CUDA 后端（NVIDIA 支持）

状态：v1 完成并生产就绪。扩展功能将在上游 llama.cpp PR 后推出。

⭐ 给仓库点星，在下次长上下文运行中试试 turbo3。你的 RAM（和电费账单）会感谢你。

GitHub: turboquant_plus

原创文章: 查看原文

分享本文