Categories
- All Posts 549
- Practical Open Source Projects 478
- Tutorial Articles 22
- Online Utilities 13
- AI news 7
- Tiny Startups Showcase 7
- Claude Code Skills 6
- Prompt Templates 5
- Hugging Face Spaces 3
- OpenClaw Use Cases 3
- LLM Learning Resources 1
- Online AI Image Tools 1
- OpenClaw Master Skills Collection 1
- Rust Training Resources 1
- AI Short Drama Tools 1
- My Favorites 0
Posts tagged with: Llama.cpp
Content related to Llama.cpp
TurboQuant+: 6.4x KV Cache Compression for LLMs
TurboQuant+ implements ICLR 2026's breakthrough KV cache compression, achieving 4.6-6.4x compression with near q8_0 quality and speed. Features turbo2/turbo3/turbo4 formats, attention-gated Sparse V decoding (+22.8% decode speed), and full llama.cpp Metal integration. Run Qwen 3.5 35B-A3B on M5 Max with 93.9% NIAH retrieval and 1.02x q8_0 prefill speed. Complete Python prototype with 511+ tests and community validation across Apple Silicon, NVIDIA, and AMD.
Run AI Locally: RunAnywhere SDKs for iOS & Android
Discover RunAnywhere SDKs, an open-source toolkit enabling privacy-first, on-device AI for iOS and Android applications. This comprehensive guide covers features like high-performance text generation, voice AI pipelines, structured outputs, and seamless model management. Learn how to integrate LLMs (like Llama.cpp) directly into your mobile apps for enhanced privacy and user experience. Whether you're building a chat application or a voice assistant, RunAnywhere offers the tools and flexibility needed to deploy AI models directly on user devices, optimize performance, and maintain data privacy. Get started with quick examples and explore the roadmap for future enhancements.