ComfyUI‑GGUF: Run Low‑Bit Models on Your GPU

January 20, 2026

Category: Practical Open Source Projects

Tags:

Open Source AI Models ComfyUI GGUF Quantization

ComfyUI‑GGUF: Run Low‑Bit Models on Your GPU

The recent surge of low‑bit model formats such as GGUF has made it possible to run large diffusion networks on machines with limited VRAM. ComfyUI‑GGUF is a lightweight, open‑source extension that plugs directly into the ComfyUI ecosystem, letting you load quantized GGUF files for UNet, Diffusion, and even the T5 text encoder. This guide walks through the concepts, installation steps, and real‑world usage so you can start generating high‑quality images without investing in a high‑end GPU.

Why GGUF Matters

Size and Speed: GGUF stores model weights in a compressed, column‑arithmetic format that can drop the bit‑width to 4‑bit or 3‑bit per weight while keeping model quality intact.
On‑the‑fly Dequantization: The extension automatically dequantizes weights at runtime, keeping CPU/GPU memory usage low. This is especially useful for transformer/DiT architectures like Flux.
Cross‑Platform: Whether you’re on Windows, macOS, or Linux, the repository includes platform‑specific installation guidelines.

Supported Models at a Glance

Model	Quantization	GGUF Variant
Flux 1‑Dev	Q4_0	`flux1-dev.gguf`
Flux Schnell	Q4_0	`flux1-schnell.gguf`
Stable Diffusion 3.5‑Large	Q4_0	`stable-diffusion-3.5-large.gguf`
Stable Diffusion 3.5‑Large‑Turbo	Q4_0	`stable-diffusion-3.5-large-turbo.gguf`
T5‑v1.1‑XXL	Q4_0	`t5_v1.1-xxl.gguf`

All models are dropped into the ComfyUI/models/unet folder (or the CLIP folder for T5) to be discovered by the new GGUF Unet Loader.

1️⃣ Installation Prerequisites

ComfyUI – Ensure you’re running a recent ComfyUI version (post‑October 2024) that supports custom ops.
Python 3.9+ – The extension relies on the gguf package.
Git – Clone the repo locally.

⚠️ For macOS, use torch 2.4.1. Torch 2.6.* nightly releases trigger an “M1 buffer is not large enough” error.

2️⃣ Clone the Repository

# From your ComfyUI installation root
git clone https://github.com/city96/ComfyUI-GGUF custom_nodes/ComfyUI-GGUF

After cloning, install the sole inference dependency:

pip install --upgrade gguf

If you run a stand‑alone ComfyUI portable build, run those commands within the ComfyUI_windows_portable folder and point Python to the embedded interpreter.

3️⃣ Replace the Standard Loader

Open your ComfyUI workflow editor and replace the standard Load Diffusion Model node with the new Unet Loader (GGUF) node. The node lives under the bootleg category.

💡 The node auto‑scans the unet folder for .gguf files; simply drop the quantized archive and you're ready.

4️⃣ Optional: Quantize Your Own Models

If you own a non‑quantized checkpoint, you can use the tools folder scripts.

Place the original .ckpt or .bin in tools.
Run the provided quantizer script (uses the gguf CLI under the hood). Example:

python tools/quantize.py --input sd3-large.ckpt --output sd3-large.gguf --bits 4

This will produce a sd3-large.gguf that you can place in your unet folder.

5️⃣ Experimental LoRA Support

Currently, the LoRA loader is experimental but has shown successful integration when using the built‑in LoRA nodes. Simply load your LoRA .ckpt alongside the GGUF UNet; ComfyUI will merge them at runtime.

6️⃣ Platform‑Specific Tips

Windows: Run a CMD inside ComfyUI_windows_portable, then execute the pip install -r requirements.txt command.
macOS (Sequoia): Use torch==2.4.1 to avoid buffer overflows.
Linux: Standard pip install works; ensure you have a recent CUDA toolkit if you plan to use GPU acceleration.

🚀 Running Low‑Bit Inference

After setting up, launch ComfyUI and use a simple workflow:

Add Unet Loader (GGUF).
Add a T5 Loader (GGUF) node if you need a quantized text encoder.
Insert standard Text Prompt and Sampler nodes.
Hit Generate.

You’ll notice GPU memory usage drop from ~10 GB (full precision) to ~4 GB or less, depending on the bit‑width.

📌 Takeaways

ComfyUI‑GGUF brings low‑bit inference to the forefront of creative AI tools.
It’s a clean, open‑source solution that reduces VRAM costs without compromising visual fidelity.
With a few git clone commands and a pip install, you can start running Flux 1‑Dev or Stable Diffusion 3.5 on an NVIDIA RTX 4060 or even an integrated GPU.
Experiment with quantization levels – the library supports Q4_0, Q4_1, and even Q3_0 variants.

Happy generating, and let the low‑bit dream become a reality on your desktop!

Original Article: View Original

Share this article