rag‑chunk: RAGチャンクをベンチマークし最適化するCLIツール

January 16, 2026

タグ:

rag‑chunk: CLI Tool to Benchmark and Optimize RAG Chunking

Retrieval‑Augmented Generation (RAG) は現代の NLP パイプラインの基盤となりつつありますが、RAG システムの品質はソーステキストを管理しやすいチャンクに分割できるかに大きく依存します。チャンクが小さすぎるとインデックスが膨張し、大きすぎると文脈の忠実度が失われます。

rag‑chunk は、複数のチャンク戦略を横並びで テスト、ベンチマーク、比較 できるシンプルなコマンドラインインターフェイスを提供します。Python で書かれ、MIT ライセンスの下で公開されており、PyPI から入手できるので、コンテナや CI ワークフローに最小限のフリクションで組み込み可能です。

Core Features

Feature	Description
Multiple Strategies	Fixed‑size (word or token based)、Sliding‑Window (context preserving)、Paragraph (semantic boundaries)、Recursive Character (LangChain integration).
Token‑accurate Splitting	Optional tiktoken support for GPT 3.5 and 4 token limits; choose the model with `--tiktoken-model`.
Recall Evaluation	Supply a JSON test file (`examples/questions.json`) to calculate how many relevant phrases appear in the top‑k retrieved chunks.
Rich CLI Output	Beautiful tables powered by Rich – clear, readable, and exportable.
Export	Save results to JSON, CSV, or table format; chunks can be dumped into a `.chunks/` folder for inspection.
Extensible	Add custom chunking logic in `src.chunker.py` and register it in the `STRATEGIES` dictionary.

Quick Start

Installation

# From PyPI
pip install rag-chunk          # basic
pip install rag-chunk[tiktoken] # with optional tiktoken support

Tip – 仮想環境内で作業している場合は、トークン精度分割を必要とする際だけ tiktoken をインストールするようにしてください。

Simple Chunk Generation

rag‑chunk analyze examples/ --strategy paragraph

テーブルにチャンク数、平均リコール（評価なしの場合は 0）、およびチャンクが置かれているディレクトリが表示されます。

Benchmark All Strategies

rag‑chunk analyze examples/ \n  --strategy all \n  --chunk-size 100 \n  --overlap 20 \n  --output table

CLI は固定サイズ、スライディングウィンドウ、段落、および再帰的文字分割の 4 つの戦略を実行し、簡潔な比較を報告します。

Validate with a Test File

rag‑chunk analyze examples/ \n  --strategy all \n  --chunk-size 150 \n  --overlap 30 \n  --test-file examples/questions.json \n  --top-k 3 \n  --output json > results.json

結果の JSON には、各戦略ごとの総合リコールと質問別詳細メトリクスが含まれます。

Choosing the Right Strategy

Strategy	When to Use	Chunk Size Recommendation
固定サイズ	一定遅延、ベースライン比較	150–250 語（または `--use‑tiktoken` でトークン）
スライディングウィンドウ	長い段落で文脈漏れが問題となる	120–200 語、20–30% の重複
段落	Markdown や分かりやすいセクション	変動 – 自然な段落境界
再帰的文字	高度に意味論的に豊かなテキスト、LangChain 統合	LangChain のデフォルトに従い、必要に応じて `--chunk‑size` で上書き

avg_recall が 0.70 未満の場合はチャンクサイズの調整、戦略変更、または重複トークンの追加を検討してください。

Extending rag‑chunk

プロプライエタリな分割アルゴリズムがある場合、次のようにプラグインできます。

# src/chunker.py
from typing import List, Dict

def my_custom_chunks(text: str, chunk_size: int, overlap: int) -> List[Dict]:
    chunks = []
    # ここにロジックを実装 – 例：Markdown のヘッダーで分割
    return chunks

# グローバル戦略に登録
STRATEGIES = {
    "custom": my_custom_chunks,
    ...
}

CLI で実行:

rag‑chunk analyze docs/ --strategy custom --chunk-size 180

元の記事: オリジナルを見る

この記事を共有