Open-source agentic harness for long documents. Self-validating trees. Entity graphs. Karpathy-inspired LLM wikis. Cited answers down to the pixel.
| Benchmark | Accuracy |
|---|---|
| FinanceBench (84 SEC filings, avg 143 pages) | 95% |
| DocBench Legal (51 court filings, avg 54 pages) | 96% |
If NanoIndex is useful, a ⭐ helps others find it.
Most RAG systems chop documents into chunks and turn them into embeddings. Two things break.
Structure is lost. A 200-page filing has a table of contents, numbered sections, tables with rows and columns. Chunking throws all of that away. Section 3.2 is no longer inside Section 3. A balance sheet table gets split across two chunks. The hierarchy the author wrote is gone.
Multi-hop questions fail. Many real questions need data from multiple sections. Computing a ratio requires the income statement and the balance sheet. Checking a legal clause means reading the clause, its definitions, and its exceptions. A chunk retriever finds one section, not the three you need, because the question doesn't match all of them equally in embedding space.
The result: wrong answers with citations that say "chunk_47" instead of a page and location an auditor can verify.
- Developers building RAG over long, structured documents (10-Ks, contracts, medical records)
- Teams where citation accuracy is a compliance or audit requirement
- Anyone hitting the limits of chunk-and-embed on multi-section documents
Not the right fit if: you're querying short documents (<10 pages) or need sub-second latency.
NanoIndex preserves document structure instead of destroying it. Nanonets OCR-3 extracts the table of contents, section hierarchy, and heading structure. NanoIndex builds a tree from these.
| Document type | Examples | How NanoIndex navigates |
|---|---|---|
| Structured | 10-K filings, contracts, research papers | Uses the table of contents. Agent reads the outline, goes straight to the right section. |
| Semi-structured | Earnings releases, quarterly reports | Disambiguates repetitive headings ("Reconciliation" x8 becomes "Reconciliation: Q2 2023 Segment Data"). |
| Unstructured | Transcripts, scans, flat reports | Splits by page, extracts entities (people, companies, dates, amounts). The entity graph becomes the map. |
When you ask a question, an LLM agent navigates this tree across multiple rounds. It reads page images directly. It verifies its calculations. It cites every answer with the exact page and pixel coordinates.
pip install nanoindexexport NANONETS_API_KEY=your_key # free at docstrange.nanonets.com (10K pages)
export ANTHROPIC_API_KEY=your_key # or OPENAI_API_KEY, GOOGLE_API_KEYfrom nanoindex import NanoIndex
# Pick your LLM
ni = NanoIndex(llm="anthropic:claude-sonnet-4-6")
# ni = NanoIndex(llm="openai:gpt-5.4")
# ni = NanoIndex(llm="gemini:gemini-2.5-flash")
# ni = NanoIndex(llm="ollama:llama3") # fully local
# Index a document
tree = ni.index("10k_filing.pdf")
answer = ni.ask("What was the free cash flow?", tree)
print(answer.content) # computed answer with reasoning
print(answer.citations[0].pages) # [52]
print(answer.citations[0].bounding_boxes) # exact coordinates on the pageBy default, index() builds only the tree. To also extract entities and relationships:
ni = NanoIndex(llm="anthropic:claude-sonnet-4-6", build_graph=True)
tree = ni.index("10k_filing.pdf") # tree + entity graph
graph = ni.get_graph(tree) # 921 entities, 103 relationshipsThe entity graph enables fast_vision and agentic_graph_vision modes. Without it, agentic_vision (the default) works fine using tree navigation alone.
Index once, query many times. Trees and graphs are JSON files you can save and load:
from nanoindex.utils.tree_ops import save_tree, load_tree, load_graph
# Save after indexing
save_tree(tree, "3M_2018_10K.json")
# Load later - no re-indexing needed
tree = load_tree("3M_2018_10K.json")
graph = load_graph("3M_2018_10K_graph.json")
answer = ni.ask("What was the operating margin?", tree)| Mode | LLM calls | Best for |
|---|---|---|
agentic_vision (default) |
5-8 | Highest accuracy. Agent navigates tree, reads page images. |
agentic_graph_vision |
4-6 | Entity graph seeds the search, agent reasons from there. |
fast_vision |
2 | Simple fact lookups. Cheapest. |
The harder problem is synthesis across documents: "How has 3M's revenue changed over 5 years?" or "Which company in my portfolio has the highest ROA?"
Inspired by Karpathy's LLM wiki pattern, NanoIndex compiles documents into a persistent, interlinked wiki that gets richer with every source you add and every question you ask.
from nanoindex.kb import KnowledgeBase
kb = KnowledgeBase("./sec-filings")
kb.add("3M_2018_10K.pdf") # extracts entities, builds concept pages
kb.add("3M_2019_10K.pdf") # updates existing concepts, flags changes
kb.add("3M_2020_10K.pdf") # cross-references across all three years
answer = kb.ask("How has 3M's revenue changed from 2018 to 2020?")
kb.lint() # find contradictions, stale claims, orphan pagesAdd pre-built trees and graphs directly:
from nanoindex.utils.tree_ops import load_tree, load_graph
tree = load_tree("3M_2018_10K.json")
graph = load_graph("3M_2018_10K_graph.json")
kb.add_tree(tree, graph)The wiki is a directory of markdown files. Open it in Obsidian and browse concept pages with [[backlinks]], entity graphs, and an activity log.
Three layers:
- Raw sources - your PDFs, immutable, never modified
- The wiki - markdown pages with cross-references. The LLM writes and maintains all of it.
- The schema - how the wiki is structured, what entity types to track, domain conventions
| Chunk + Embed | Microsoft GraphRAG | PageIndex | NanoIndex | |
|---|---|---|---|---|
| Indexing | Chunk text, embed | LLM per chunk | LLM per page | 1 OCR API call |
| Structure | Lost | Lost | Tree | Tree + entity graph |
| Navigation | Similarity search | Map-reduce | LLM tree walk | Multi-round agent |
| Multi-document | Vector DB | No | No | Wiki with [[backlinks]] |
| Citations | Chunk ID | None | Page number | Pixel coordinates |
| Vision | No | No | No | Page images to LLM |
| Cost per doc | Low | High | High | Low |
- Agentic extraction self-correcting structured extraction for tables and forms (invoice line items, insurance loss runs, bank statement reconciliation)
- Real-world long document benchmarks bank statement reconciliation, insurance loss run extraction, multi-document contract analysis
- Streaming tree building real-time tree construction as pages are parsed
- Multi-agent wiki multiple agents maintaining different sections of the wiki concurrently
nanoindex index report.pdf -o tree.json
nanoindex ask report.pdf "What was the revenue?"
nanoindex viz tree.jsongit clone https://github.com/nanonets/nanoindex.git && cd nanoindex
uv sync --extra dev && uv run pytest # or: pip install -e ".[dev]" && pytestEntity extraction: pip install nanoindex[gliner] (CPU) or pip install nanoindex[gliner-gpu] (GPU).
Apache 2.0. Built on Nanonets OCR-3.


