rag‑chunk: CLI Tool to Benchmark and Optimize RAG Chunking
rag‑chunk: CLI Tool to Benchmark and Optimize RAG Chunking
Retrieval‑Augmented Generation (RAG) is becoming a cornerstone of modern NLP pipelines, but the quality of a RAG system heavily depends on how well the source text is split into manageable chunks. Too many tiny fragments and you explode your index; too large and you lose contextual fidelity.
rag‑chunk solves this pain point with a simple command‑line interface that lets you test, benchmark, and compare multiple chunking strategies side‑by‑side. It is written in Python, released under the MIT license, and is available on PyPI so you can drop it into any container or CI workflow with minimal friction.
Core Features
| Feature | Description |
|---|---|
| Multiple Strategies | Fixed‑size (word or token based), Sliding‑Window (context preserving), Paragraph (semantic boundaries), Recursive Character (LangChain integration). |
| Token‑accurate Splitting | Optional tiktoken support for GPT 3.5 and 4 token limits; choose the model with --tiktoken-model. |
| Recall Evaluation | Supply a JSON test file (examples/questions.json) to calculate how many relevant phrases appear in the top‑k retrieved chunks. |
| Rich CLI Output | Beautiful tables powered by Rich – clear, readable, and exportable. |
| Export | Save results to JSON, CSV, or table format; chunks can be dumped into a .chunks/ folder for inspection. |
| Extensible | Add custom chunking logic in src.chunker.py and register it in the STRATEGIES dictionary. |
Quick Start
Installation
# From PyPI
pip install rag-chunk # basic
pip install rag-chunk[tiktoken] # with optional tiktoken support
Tip – If you are working inside a virtual environment, ensure
tiktokenis installed only when you need token‐exact splitting.
Simple Chunk Generation
rag‑chunk analyze examples/ --strategy paragraph
Benchmark All Strategies
rag‑chunk analyze examples/ \n --strategy all \n --chunk-size 100 \n --overlap 20 \n --output table
Validate with a Test File
rag‑chunk analyze examples/ \n --strategy all \n --chunk-size 150 \n --overlap 30 \n --test-file examples/questions.json \n --top-k 3 \n --output json > results.json
Choosing the Right Strategy
| Strategy | When to Use | Chunk Size Recommendation |
|---|---|---|
| Fixed‑size | Uniform latency, baseline comparison | 150–250 words (or tokens with --use‑tiktoken) |
| Sliding‑window | Long paragraphs where context bleed matters | 120–200 words, 20–30% overlap |
| Paragraph | Markdown or prose with clear sections | Variable – natural paragraph boundaries |
| Recursive‑character | Highly semantically rich texts, LangChain integration | As per LangChain defaults, but you can override with --chunk‑size |
If the avg_recall for a strategy is lower than 0.70, consider tweaking the chunk size, changing the strategy, or adding more overlapping tokens.
Extending rag‑chunk
If you have a proprietary splitting algorithm, you can plug it in:
# src/chunker.py
from typing import List, Dict
def my_custom_chunks(text: str, chunk_size: int, overlap: int) -> List[Dict]:
chunks = []
# Your logic here – e.g., split by specific markdown headings
return chunks
# Register in the global strategies
STRATEGIES = {
"custom": my_custom_chunks,
...
}
rag‑chunk analyze docs/ --strategy custom --chunk-size 180
Real‑World Use Cases
- RAG Model Prototyping – Quickly measure how well your embeddings capture meaningful content.
- Production Index Tuning – Reduce the number of chunks to cut down storage while maintaining recall.
- Model‑Specific Token Boundary – For GPT‑4 with a 32k token context, generate exactly 512‑token chunks that fit.
- Automated CI Checks – Add
rag‑chunkas a step in your CI pipeline to flag regressions in chunk quality.
Getting Help & Contributing
- Source Code – https://github.com/messkan/rag‑chunk
- Documentation – Read the full README at the repo or use
rag‑chunk --help. - Issues/PRs – The repo is open for pull requests; feel free to propose new strategies or improve docs.
- Community – Reach out on the issues page if you hit a bug or have a feature request.
TL;DR
- rag‑chunk is an MIT‑licensed Python CLI that lets you benchmark RAG chunking strategies.*
- Install via
pip install rag‑chunk[tiktoken].* - Run quick benchmarks with
rag‑chunk analyze <folder> --strategy all --chunk-size 150.* - Export results as tables, JSON or CSV.; tweak token accuracy with
--use‑tiktoken.*
Take the guesswork out of chunk selection, gain actionable metrics, and accelerate your RAG pipeline development today!",