rag‑chunk: CLI Tool to Benchmark and Optimize RAG Chunking

rag‑chunk: CLI Tool to Benchmark and Optimize RAG Chunking

Retrieval‑Augmented Generation (RAG) is becoming a cornerstone of modern NLP pipelines, but the quality of a RAG system heavily depends on how well the source text is split into manageable chunks. Too many tiny fragments and you explode your index; too large and you lose contextual fidelity.

rag‑chunk solves this pain point with a simple command‑line interface that lets you test, benchmark, and compare multiple chunking strategies side‑by‑side. It is written in Python, released under the MIT license, and is available on PyPI so you can drop it into any container or CI workflow with minimal friction.


Core Features

Feature Description
Multiple Strategies Fixed‑size (word or token based), Sliding‑Window (context preserving), Paragraph (semantic boundaries), Recursive Character (LangChain integration).
Token‑accurate Splitting Optional tiktoken support for GPT 3.5 and 4 token limits; choose the model with --tiktoken-model.
Recall Evaluation Supply a JSON test file (examples/questions.json) to calculate how many relevant phrases appear in the top‑k retrieved chunks.
Rich CLI Output Beautiful tables powered by Rich – clear, readable, and exportable.
Export Save results to JSON, CSV, or table format; chunks can be dumped into a .chunks/ folder for inspection.
Extensible Add custom chunking logic in src.chunker.py and register it in the STRATEGIES dictionary.

Quick Start

Installation

# From PyPI
pip install rag-chunk          # basic
pip install rag-chunk[tiktoken] # with optional tiktoken support

Tip – If you are working inside a virtual environment, ensure tiktoken is installed only when you need token‐exact splitting.

Simple Chunk Generation

rag‑chunk analyze examples/ --strategy paragraph
You’ll get a table showing the number of chunks, average recall (0 for no evaluation), and the directory where the fragments live.

Benchmark All Strategies

rag‑chunk analyze examples/ \n  --strategy all \n  --chunk-size 100 \n  --overlap 20 \n  --output table
The CLI will run four strategies (fixed‑size, sliding‑window, paragraph, and recursive‑character) and report a concise comparison.

Validate with a Test File

rag‑chunk analyze examples/ \n  --strategy all \n  --chunk-size 150 \n  --overlap 30 \n  --test-file examples/questions.json \n  --top-k 3 \n  --output json > results.json
The resulting JSON will contain overall recall per strategy and detailed per‑question metrics.


Choosing the Right Strategy

Strategy When to Use Chunk Size Recommendation
Fixed‑size Uniform latency, baseline comparison 150–250 words (or tokens with --use‑tiktoken)
Sliding‑window Long paragraphs where context bleed matters 120–200 words, 20–30% overlap
Paragraph Markdown or prose with clear sections Variable – natural paragraph boundaries
Recursive‑character Highly semantically rich texts, LangChain integration As per LangChain defaults, but you can override with --chunk‑size

If the avg_recall for a strategy is lower than 0.70, consider tweaking the chunk size, changing the strategy, or adding more overlapping tokens.


Extending rag‑chunk

If you have a proprietary splitting algorithm, you can plug it in:

# src/chunker.py
from typing import List, Dict

def my_custom_chunks(text: str, chunk_size: int, overlap: int) -> List[Dict]:
    chunks = []
    # Your logic here – e.g., split by specific markdown headings
    return chunks

# Register in the global strategies
STRATEGIES = {
    "custom": my_custom_chunks,
    ...
}
Run it via the CLI:
rag‑chunk analyze docs/ --strategy custom --chunk-size 180


Real‑World Use Cases

  1. RAG Model Prototyping – Quickly measure how well your embeddings capture meaningful content.
  2. Production Index Tuning – Reduce the number of chunks to cut down storage while maintaining recall.
  3. Model‑Specific Token Boundary – For GPT‑4 with a 32k token context, generate exactly 512‑token chunks that fit.
  4. Automated CI Checks – Add rag‑chunk as a step in your CI pipeline to flag regressions in chunk quality.

Getting Help & Contributing

  • Source Code – https://github.com/messkan/rag‑chunk
  • Documentation – Read the full README at the repo or use rag‑chunk --help.
  • Issues/PRs – The repo is open for pull requests; feel free to propose new strategies or improve docs.
  • Community – Reach out on the issues page if you hit a bug or have a feature request.

TL;DR

  • rag‑chunk is an MIT‑licensed Python CLI that lets you benchmark RAG chunking strategies.*
  • Install via pip install rag‑chunk[tiktoken].*
  • Run quick benchmarks with rag‑chunk analyze <folder> --strategy all --chunk-size 150.*
  • Export results as tables, JSON or CSV.; tweak token accuracy with --use‑tiktoken.*

Take the guesswork out of chunk selection, gain actionable metrics, and accelerate your RAG pipeline development today!",

Original Article: View Original

Share this article