Skip to content

Expose local PDFs via MCP or the standalone pdf-reader CLI with deterministic chunking, sandboxed access, and semantic search.

License

Notifications You must be signed in to change notification settings

patriciomartinns/pdf-toolbox

Repository files navigation

PDF Toolbox

Python License Status

Expose local PDFs to MCP-compatible agents or run the standalone pdf-reader CLI with deterministic chunking, semantic search, and configurable defaults.


Highlights

  • FastMCP/STDIO server ready for Cursor, VS Code, Claude, and other MCP clients.
  • Typer/Click/Rich CLI (pdf-reader) prints JSON for easy piping.
  • read_pdf – extracts ordered text with page-window controls for quick inspection.
  • search_pdf – runs semantic similarity search over cached embeddings with custom top_k, score threshold, and chunk parameters.
  • describe_pdf_sections – emits deterministic chunks for classic RAG flows or, with --mode tables, returns structured tables (bbox, headers, cells) detected straight from the pages.
  • configure_pdf_defaults – adjusts chunk size/overlap, page windows, and the default embedding model at runtime.
  • Strict .pdf validation, sandboxed base path, and aggressive caching.

Documentation

Quick install (uv)

Run the MCP server directly

# Run the MCP server directly
uvx --from git+https://github.com/patriciomartinns/pdf-toolbox -- pdf-toolbox --quiet

# Install/run the CLI
uv tool install --from git+https://github.com/patriciomartinns/pdf-toolbox pdf-reader
pdf-reader --help

Note: If you had the old mcp-pdf-reader CLI installed via uv tool install, run uv tool uninstall mcp-pdf-reader before installing pdf-reader to avoid conflicts.

CLI quick tour

Command Purpose Example
pdf-reader read-pdf Extract ordered text for a bounded page range. pdf-reader read-pdf reports/Q1.pdf --start-page 3 --end-page 5
pdf-reader search-pdf Run semantic similarity search over cached embeddings. pdf-reader search-pdf reports/Q1.pdf "rate limiting" --top-k 8
pdf-reader describe-pdf-sections List deterministic chunks with offsets for RAG pipelines. pdf-reader describe-pdf-sections reports/Q1.pdf --max-chunks 5
pdf-reader configure-pdf-defaults Update runtime defaults for chunk size/overlap/page window/model. pdf-reader configure-pdf-defaults --chunk-size 600 --chunk-overlap 120 --max-pages 10

Tip: the first search-pdf invocation on a new document downloads the SentenceTransformers model and builds embeddings, so it can take longer once per model/PDF combo. Subsequent searches reuse the cache.

See the docs/ folder for full recipes covering both CLI commands and MCP client configuration. Questions or ideas? Open an issue on github.com/patriciomartinns/pdf-toolbox.

About

Expose local PDFs via MCP or the standalone pdf-reader CLI with deterministic chunking, sandboxed access, and semantic search.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Sponsor this project

 

Packages

No packages published

Languages