PageIndex: The Open-Source Reasoning-Based RAG Framework

PageIndex: The Open-Source Reasoning‑Based RAG Framework

When you think of Retrieval Augmented Generation (RAG), the first image that comes to mind is often a large vector database, a pile of embeddings, and a search that relies on cosine similarity. In practice, this approach can struggle with the kind of long, structured documents that financial analysts, legal teams, and academic researchers routinely encounter.

Why PageIndex?

PageIndex was built to answer the question: Is a vector database really necessary for effective RAG? The answer turns out to be not at all.

  • Vectorless – Instead of converting every page or chunk into an embedding, PageIndex builds a hierarchical tree that mirrors a document’s natural sections (think table of contents). Each node can optionally contain a summary.
  • No chunking – Traditional pipelines break documents into artificial chunks that often split context. PageIndex keeps whole sections intact, preserving narrative flow.
  • Human‑like retrieval – By allowing the LLM to browse the tree, PageIndex simulates how domain experts read and reason. The model can traverse the hierarchy, ask clarification questions, and backtrack—behaving more like a human analyst.
  • Explainability & traceability – Every answer can be traced back to a specific node and page, giving developers a clear audit trail.

Core Concepts

  1. Tree Index – A JSON structure where each node contains metadata: title, start/end indices, summary, and child nodes. The tree effectively becomes a “table of contents” tailored for LLM consumption.
  2. LLM Reasoning – Instead of a nearest‑neighbor lookup, the LLM reasons over the tree, performing a tree search. It queries which branch to explore next, then selects relevant sections to pass downstream.
  3. Optional Configurations – Users can control model, maximum tokens per node, depth, and whether to include node IDs or summaries.

Quick Install & Run

# 1. Clone the repo
git clone https://github.com/VectifyAI/PageIndex.git
cd PageIndex

# 2. Install dependencies
pip install -U -r requirements.txt

# 3. Set your OpenAI API key
# Create a `.env` in the repo root
# echo "CHATGPT_API_KEY=sk-…" > .env

# 4. Generate a tree for a PDF
python run_pageindex.py --pdf_path /path/to/your/document.pdf

Optional flags let you fine‑tune the process: - --model gpt-4o – choose model - --toc-check-pages 20 – pages to scan for table of contents - --max-pages-per-node 10 – how many pages each node can hold - --if-add-node-summary yes – include a short summary for each node

For Markdown files that follow header conventions (#, ##, etc.), use --md_path instead.

Use Cases

Domain Problem How PageIndex Helps
Finance SEC filings are 200+ pages long with nested sections; embeddings often miss nuance. PageIndex builds a tree that lets an LLM zoom into exact sections, improving FinanceBench accuracy to 98.7%.
Legal Case law and contracts require exact paragraph citations. The tree preserves page ranges, so answers include precise location references.
Academia Research papers have multiple subsections; search by topic is unreliable with embeddings. Node summaries guide the LLM to the relevant section, yielding more accurate citations.
Technical Manual Firmware documents contain tables and diagrams. PageIndex can index images via OCR-free vision RAG, providing context directly from page images.

Benchmark Highlight

PageIndex’s embedded RAG system, Mafin 2.5, achieved 98.7% accuracy on the FinanceBench benchmark—outperforming several vector‑based RAG systems by a wide margin. The combination of a clean tree index and reasoning‑driven search removes many of the pitfalls of similarity‑only retrieval.

Integration Options

  • Self‑hosted – Run the Python repo locally. Works on a laptop or server.
  • Chat Platform – VectifyAI hosts a ChatGPT‑style interface that you can try instantly.
  • MCP/API – Expose the functionality with minimal code and integrate into your own pipeline.

Future Directions

  1. Multi‑Modal Retrieval – Combine text and image nodes to support PDF images without OCR.
  2. Fine‑Grained Summaries – Leverage more advanced summarization models for better node explanations.
  3. Collaboration Features – Allow multiple users to annotate node paths and share retrieval logic.

Final Thoughts

PageIndex demonstrates that a well‑structured index, coupled with LLM reasoning, can replace the traditional vector‑based approach for many real‑world tasks. For developers looking to build reliable, explainable RAG systems on long documents, the framework offers a compelling, low‑code solution that keeps the user in the loop—much like an actual human expert.

Original Article: View Original

Share this article