NanoIndex

Open-source agentic harness for long documents. Self-validating trees. Entity graphs. Karpathy-inspired LLM wikis. Cited answers down to the pixel.

Benchmark	Accuracy
FinanceBench (84 SEC filings, avg 143 pages)	95%
DocBench Legal (51 court filings, avg 54 pages)	96%

If NanoIndex is useful, a ⭐ helps others find it.

The problem

Most RAG systems chop documents into chunks and turn them into embeddings. Two things break.

Structure is lost. A 200-page filing has a table of contents, numbered sections, tables with rows and columns. Chunking throws all of that away. Section 3.2 is no longer inside Section 3. A balance sheet table gets split across two chunks. The hierarchy the author wrote is gone.

Multi-hop questions fail. Many real questions need data from multiple sections. Computing a ratio requires the income statement and the balance sheet. Checking a legal clause means reading the clause, its definitions, and its exceptions. A chunk retriever finds one section, not the three you need, because the question doesn't match all of them equally in embedding space.

The result: wrong answers with citations that say "chunk_47" instead of a page and location an auditor can verify.

Who is this for?

Developers building RAG over long, structured documents (10-Ks, contracts, medical records)
Teams where citation accuracy is a compliance or audit requirement
Anyone hitting the limits of chunk-and-embed on multi-section documents

Not the right fit if: you're querying short documents (<10 pages) or need sub-second latency.

Part 1: Querying within a single long document

NanoIndex preserves document structure instead of destroying it. Nanonets OCR-3 extracts the table of contents, section hierarchy, and heading structure. NanoIndex builds a tree from these.

Document type	Examples	How NanoIndex navigates
Structured	10-K filings, contracts, research papers	Uses the table of contents. Agent reads the outline, goes straight to the right section.
Semi-structured	Earnings releases, quarterly reports	Disambiguates repetitive headings ("Reconciliation" x8 becomes "Reconciliation: Q2 2023 Segment Data").
Unstructured	Transcripts, scans, flat reports	Splits by page, extracts entities (people, companies, dates, amounts). The entity graph becomes the map.

When you ask a question, an LLM agent navigates this tree across multiple rounds. It reads page images directly. It verifies its calculations. It cites every answer with the exact page and pixel coordinates.

Quick start

pip install nanoindex

export NANONETS_API_KEY=your_key    # free at docstrange.nanonets.com (10K pages)
export ANTHROPIC_API_KEY=your_key   # or OPENAI_API_KEY, GOOGLE_API_KEY

from nanoindex import NanoIndex

# Pick your LLM
ni = NanoIndex(llm="anthropic:claude-sonnet-4-6")
# ni = NanoIndex(llm="openai:gpt-5.4")
# ni = NanoIndex(llm="gemini:gemini-2.5-flash")
# ni = NanoIndex(llm="ollama:llama3")  # fully local

# Index a document
tree = ni.index("10k_filing.pdf")
answer = ni.ask("What was the free cash flow?", tree)

print(answer.content)                     # computed answer with reasoning
print(answer.citations[0].pages)          # [52]
print(answer.citations[0].bounding_boxes) # exact coordinates on the page

Build entity graph (optional)

By default, index() builds only the tree. To also extract entities and relationships:

ni = NanoIndex(llm="anthropic:claude-sonnet-4-6", build_graph=True)
tree = ni.index("10k_filing.pdf")  # tree + entity graph
graph = ni.get_graph(tree)         # 921 entities, 103 relationships

The entity graph enables fast_vision and agentic_graph_vision modes. Without it, agentic_vision (the default) works fine using tree navigation alone.

Save and reload trees

Index once, query many times. Trees and graphs are JSON files you can save and load:

from nanoindex.utils.tree_ops import save_tree, load_tree, load_graph

# Save after indexing
save_tree(tree, "3M_2018_10K.json")

# Load later - no re-indexing needed
tree = load_tree("3M_2018_10K.json")
graph = load_graph("3M_2018_10K_graph.json")
answer = ni.ask("What was the operating margin?", tree)

Query modes

Mode	LLM calls	Best for
`agentic_vision` (default)	5-8	Highest accuracy. Agent navigates tree, reads page images.
`agentic_graph_vision`	4-6	Entity graph seeds the search, agent reasons from there.
`fast_vision`	2	Simple fact lookups. Cheapest.

Part 2: Querying across multiple documents (Karpathy-inspired wiki)

The harder problem is synthesis across documents: "How has 3M's revenue changed over 5 years?" or "Which company in my portfolio has the highest ROA?"

Inspired by Karpathy's LLM wiki pattern, NanoIndex compiles documents into a persistent, interlinked wiki that gets richer with every source you add and every question you ask.

from nanoindex.kb import KnowledgeBase

kb = KnowledgeBase("./sec-filings")
kb.add("3M_2018_10K.pdf")     # extracts entities, builds concept pages
kb.add("3M_2019_10K.pdf")     # updates existing concepts, flags changes
kb.add("3M_2020_10K.pdf")     # cross-references across all three years

answer = kb.ask("How has 3M's revenue changed from 2018 to 2020?")
kb.lint()  # find contradictions, stale claims, orphan pages

Add pre-built trees and graphs directly:

from nanoindex.utils.tree_ops import load_tree, load_graph

tree = load_tree("3M_2018_10K.json")
graph = load_graph("3M_2018_10K_graph.json")
kb.add_tree(tree, graph)

The wiki is a directory of markdown files. Open it in Obsidian and browse concept pages with [[backlinks]], entity graphs, and an activity log.

Three layers:

Raw sources - your PDFs, immutable, never modified
The wiki - markdown pages with cross-references. The LLM writes and maintains all of it.
The schema - how the wiki is structured, what entity types to track, domain conventions

How it compares

	Chunk + Embed	Microsoft GraphRAG	PageIndex	NanoIndex
Indexing	Chunk text, embed	LLM per chunk	LLM per page	1 OCR API call
Structure	Lost	Lost	Tree	Tree + entity graph
Navigation	Similarity search	Map-reduce	LLM tree walk	Multi-round agent
Multi-document	Vector DB	No	No	Wiki with [[backlinks]]
Citations	Chunk ID	None	Page number	Pixel coordinates
Vision	No	No	No	Page images to LLM
Cost per doc	Low	High	High	Low

Roadmap

Agentic extraction self-correcting structured extraction for tables and forms (invoice line items, insurance loss runs, bank statement reconciliation)
Real-world long document benchmarks bank statement reconciliation, insurance loss run extraction, multi-document contract analysis
Streaming tree building real-time tree construction as pages are parsed
Multi-agent wiki multiple agents maintaining different sections of the wiki concurrently

CLI

nanoindex index report.pdf -o tree.json
nanoindex ask report.pdf "What was the revenue?"
nanoindex viz tree.json

Development

git clone https://github.com/nanonets/nanoindex.git && cd nanoindex
uv sync --extra dev && uv run pytest    # or: pip install -e ".[dev]" && pytest

Entity extraction: pip install nanoindex[gliner] (CPU) or pip install nanoindex[gliner-gpu] (GPU).

Apache 2.0. Built on Nanonets OCR-3.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
.github		.github
assets		assets
docs		docs
examples		examples
nanoindex		nanoindex
tests		tests
viz		viz
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NanoIndex

The problem

Who is this for?

Part 1: Querying within a single long document

Quick start

Build entity graph (optional)

Save and reload trees

Query modes

Part 2: Querying across multiple documents (Karpathy-inspired wiki)

How it compares

Roadmap

CLI

Development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

NanoIndex

The problem

Who is this for?

Part 1: Querying within a single long document

Quick start

Build entity graph (optional)

Save and reload trees

Query modes

Part 2: Querying across multiple documents (Karpathy-inspired wiki)

How it compares

Roadmap

CLI

Development

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages