类别

所有帖子 497
实用开源项目 436
教程文章 22
在线工具 12
AI 新闻 7
Tiny Startups Showcase 7
提示模板 4
Hugging Face Spaces 3
OpenClaw Use Cases 2
LLM Learning Resources 1
Online AI Image Tools 1
OpenClaw Master Skills Collection 1
Rust Training Resources 1
我的收藏 0

标记为: KV cache compression

Content related to KV cache compression

TurboQuant+：LLM 的 6.4 倍 KV 缓存压缩

March 29, 2026

标签:

Apple Silicon Llama.cpp LLM inference KV cache compression TurboQuant

TurboQuant+ 实现了 ICLR 2026 的突破性 KV 缓存压缩，在接近 q8_0 质量和速度下实现 4.6-6.4 倍压缩。支持 turbo2/turbo3/turbo4 格式、注意力门控 Sparse V 解码（+22.8% 解码速度），以及完整的 llama.cpp Metal 集成。在 M5 Max 上运行 Qwen 3.5 35B-A3B，实现 93.9% NIAH 检索和 1.02 倍 q8_0 预填充速度。完整的 Python 原型，包含 511+ 测试，并在 Apple Silicon、NVIDIA 和 AMD 上经过社区验证。

阅读更多原文