Posts tagged with: Llama.cpp

Content related to Llama.cpp

TurboQuant+: 6.4x KV Cache Compression for LLMs

March 29, 2026

TurboQuant+ implements ICLR 2026's breakthrough KV cache compression, achieving 4.6-6.4x compression with near q8_0 quality and speed. Features turbo2/turbo3/turbo4 formats, attention-gated Sparse V decoding (+22.8% decode speed), and full llama.cpp Metal integration. Run Qwen 3.5 35B-A3B on M5 Max with 93.9% NIAH retrieval and 1.02x q8_0 prefill speed. Complete Python prototype with 511+ tests and community validation across Apple Silicon, NVIDIA, and AMD.

Run AI Locally: RunAnywhere SDKs for iOS & Android

November 12, 2025

Discover RunAnywhere SDKs, an open-source toolkit enabling privacy-first, on-device AI for iOS and Android applications. This comprehensive guide covers features like high-performance text generation, voice AI pipelines, structured outputs, and seamless model management. Learn how to integrate LLMs (like Llama.cpp) directly into your mobile apps for enhanced privacy and user experience. Whether you're building a chat application or a voice assistant, RunAnywhere offers the tools and flexibility needed to deploy AI models directly on user devices, optimize performance, and maintain data privacy. Get started with quick examples and explore the roadmap for future enhancements.