April 3, 2026
Flash-MoE runs Qwen3.5-397B-A17B (397 billion parameters) on a MacBook Pro M3 Max with 48GB RAM at 4.4+ tokens/second. Pure C/Metal inference streams 209GB model from SSD with production-quality output including tool calling. Key innovations: FMA-optimized dequant kernels (+12% speed), OS page cache expert streaming, deferred GPU compute, and hand-tuned Metal shaders. 58 experiments documented with full technical paper.
Fine-tuning large language models can be a complex and resource-intensive task. LLaMA-Factory emerges as a game-changer, offering a unified and highly efficient platform for the fine-tuning of over 100 Large Language Models (LLMs) and Vision Language Models (VLMs). This open-source project, recognized at ACL 2024, simplifies complex AI development workflows with its zero-code command-line interface and intuitive Web UI. Trusted by industry giants like Amazon and NVIDIA, LLaMA-Factory empowers developers and researchers to enhance model performance across diverse tasks, from multi-turn dialogue to multimodal understanding, using advanced techniques like QLoRA and FlashAttention-2. Explore how this powerful tool can accelerate your AI projects.
Discover Unsloth, the open-source library revolutionizing Large Language Model (LLM) fine-tuning. Achieve up to 2x faster training and reduce GPU VRAM consumption by up to 80% compared to standard methods. Unsloth supports a wide range of models like Llama, Qwen, Gemma, and Mistral, along with Text-to-Speech and Vision models. Its user-friendly approach allows for free fine-tuning via beginner-friendly notebooks, enabling efficient training even on limited hardware. Dive into efficient LLM development with Unsloth's powerful features and robust performance.
Discover MergeKit, an open-source toolkit designed for merging pre-trained large language models (LLMs). This powerful tool allows users to combine the strengths of different models without extensive training or high computational overhead. With support for various merge methods, CPU/GPU execution, and low memory usage, MergeKit is ideal for creating versatile, custom LLMs. Learn how to install, configure, and utilize this versatile toolkit to enhance your AI projects, including multi-stage merging and LoRA extraction. Whether you're a researcher or developer, MergeKit simplifies the complex process of model integration, making advanced LLM capabilities more accessible.