Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -199,4 +199,6 @@ ENV/
.vscode/
.playwright-mcp/

examples/
examples/

venv1/
14 changes: 2 additions & 12 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,8 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co

DocStrange is a Python library for extracting and converting documents (PDFs, Word, Excel, PowerPoint, images, URLs) into multiple formats (Markdown, JSON, CSV, HTML) with intelligent content extraction and advanced OCR capabilities.

The library offers three processing modes:
The library offers two processing modes:
- **Cloud Mode (default)**: Instant conversion using cloud API
- **CPU Mode**: Local processing for privacy
- **GPU Mode**: Local processing with GPU acceleration

## Commands
Expand Down Expand Up @@ -117,16 +116,10 @@ python -m twine upload dist/*
- Authentication: `docstrange login` or API key
- Best for: Quick processing without GPU

### CPU Mode
- Force with `cpu=True` parameter
- Uses local neural models
- Requires model downloads (~500MB first run)
- Best for: Privacy-sensitive documents

### GPU Mode
- Force with `gpu=True` parameter
- Requires CUDA-compatible GPU
- Faster than CPU for large documents
- Fastest local processing
- Best for: Batch processing, high-volume workloads

## Authentication & Rate Limits
Expand Down Expand Up @@ -215,9 +208,6 @@ structured = result.extract_data(json_schema=schema)

### Force local processing
```python
# CPU mode
extractor = DocumentExtractor(cpu=True)

# GPU mode (requires CUDA)
extractor = DocumentExtractor(gpu=True)
```
Expand Down
35 changes: 8 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ DocStrange converts documents to Markdown, JSON, CSV, and HTML quickly and accur
> Extract documents data instantly with the cloud processing - no complex setup needed

> **🔒 Local Processing !**
> Use `cpu` or `gpu` mode for 100% local processing - no data sent anywhere, everything stays on your machine.
> Use `gpu` mode for 100% local processing - no data sent anywhere, everything stays on your machine.


## **What's New**
Expand All @@ -56,7 +56,7 @@ Convert and extract data from PDF, DOCX, images, and more into clean Markdown an

`DocStrange` is a Python library for converting a wide range of document formats—including **PDF**, **DOCX**, **PPTX**, **XLSX**, and **images** — into clean, usable data. It produces LLM-optimized **Markdown**, structured **JSON** (with schema support), **HTML**, and **CSV** outputs, making it an ideal tool for preparing content for RAG pipelines and other AI applications.

The library offers both a powerful cloud API and a 100% private, offline mode that runs locally on your CPU or GPU. Developed by **Nanonets**, DocStrange is built on a powerful pipeline of OCR and layout detection models and currently requires **Python >=3.8**.
The library offers both a powerful cloud API and a 100% private, offline mode that runs locally on your GPU. Developed by **Nanonets**, DocStrange is built on a powerful pipeline of OCR and layout detection models and currently requires **Python >=3.8**.

**To report a bug or request a feature, [please file an issue](https://github.com/NanoNets/docstrange/issues). To ask a question or request assistance, please use the [discussions forum](https://github.com/NanoNets/docstrange/discussions).**

Expand Down Expand Up @@ -185,12 +185,9 @@ print(structured_data)

**Local Processing**

For complete privacy and offline capability, run DocStrange entirely on your own machine. You can specify whether to use your CPU or GPU for processing.
For complete privacy and offline capability, run DocStrange entirely on your own machine using GPU processing.

```python
# Force local CPU processing
extractor = DocumentExtractor(cpu=True)

# Force local GPU processing (requires CUDA)
extractor = DocumentExtractor(gpu=True)
```
Expand All @@ -201,7 +198,7 @@ extractor = DocumentExtractor(gpu=True)

💡 Want a GUI? Run the simple, drag-and-drop local web interface for private, offline document conversion.

For users who prefer a graphical interface, DocStrange includes a powerful, self-hosted web UI. This allows for easy drag-and-drop conversion of PDF, DOCX, and other files directly in your browser, with 100% private, offline processing on your own CPU or GPU. The interface automatically downloads required models on its first run.
For users who prefer a graphical interface, DocStrange includes a powerful, self-hosted web UI. This allows for easy drag-and-drop conversion of PDF, DOCX, and other files directly in your browser, with 100% private, offline processing on your own GPU. The interface automatically downloads required models on its first run.

### How to get started?

Expand Down Expand Up @@ -230,9 +227,9 @@ python -c "from docstrange.web_app import run_web_app; run_web_app()"

- 🖱️ Drag & Drop Interface: Simply drag files onto the upload area.
- 📁 Multiple File Types: Supports PDF, DOCX, XLSX, PPTX, images, and more.
- ⚙️ Processing Modes: Choose between Local CPU and Local GPU processing.
- ⚙️ Processing Modes: Choose between Cloud and Local GPU processing.
- 📊 Multiple Output Formats: Get Markdown, HTML, JSON, CSV, and Flat JSON.
- 🔒 100% Local Processing: No data leaves your machine.
- 🔒 Privacy Options: Choose between cloud processing (default) or local GPU processing.
- 📱 Responsive Design: Works on desktop, tablet, and mobile

### **Supported File Types:**
Expand All @@ -245,9 +242,8 @@ python -c "from docstrange.web_app import run_web_app; run_web_app()"

### **Processing Modes:**

- **Local CPU**: Works offline, slower but private (default)
- **Cloud processing:** For instant, zero-setup conversion, you can head over to [docstrange.nanonets.com](http://docstrange.nanonets.com/) **—** no setup (default)
- **Local GPU**: Fastest local processing, requires CUDA support
- **Cloud processing:** For instant, zero-setup conversion, you can head over to [docstrange.nanonets.com](http://docstrange.nanonets.com/) **—** no setup

### **Output Formats:**

Expand Down Expand Up @@ -295,7 +291,7 @@ docstrange web --port 8001

- The interface automatically detects GPU availability
- GPU option will be disabled if CUDA is not available
- CPU mode will be selected automatically
- Error will be thrown

3. Model Download Issues:

Expand Down Expand Up @@ -367,13 +363,6 @@ csv_data = result.extract_csv()
print(csv_data)
```

**Requirements for enhanced JSON (if using cpu=True):**

- Install: `pip install 'docstrange[local-llm]'`
- [Install Ollama](https://ollama.ai/) and run: `ollama serve`
- Pull a model: `ollama pull llama3.2`

*If Ollama is not available, the library automatically falls back to the standard JSON parser.*

**c. Extract Specific Fields & Structured Data**

Expand Down Expand Up @@ -484,11 +473,6 @@ contract_schema = {
contract_data = contract.extract_data(json_schema=contract_schema)
```

**Local extraction requirements (if using cpu=True):**

- Install ollama package: `pip install 'docstrange[local-llm]'`
- [Install Ollama](https://ollama.ai/) and run: `ollama serve`
- Pull a model: `ollama pull llama3.2`

**e. Chain with LLM**

Expand Down Expand Up @@ -591,7 +575,6 @@ docstrange document.pdf
docstrange document.pdf --api-key YOUR_API_KEY

# Local processing modes
docstrange document.pdf --cpu-mode
docstrange document.pdf --gpu-mode

# Different output formats
Expand Down Expand Up @@ -629,8 +612,6 @@ docstrange document.pdf --output json --extract-fields title author date summary
# Or use API key for 10k docs/month access (alternative to login)
docstrange document.pdf --api-key YOUR_API_KEY --output json --extract-fields title author date summary

# Force local processing with field extraction (requires Ollama)
docstrange document.pdf --cpu-mode --output json --extract-fields key_points conclusions recommendations
```

**Example schema.json file:**
Expand Down
24 changes: 6 additions & 18 deletions docstrange/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -182,9 +182,6 @@ def main():
# Convert with free API key with increased limits
docstrange document.pdf --api-key YOUR_API_KEY

# Force local CPU processing
docstrange document.pdf --cpu-mode

# Force local GPU processing
docstrange document.pdf --gpu-mode

Expand All @@ -207,10 +204,10 @@ def main():
# Convert multiple files
docstrange file1.pdf file2.docx file3.xlsx --output markdown

# Extract specific fields using Ollama (CPU mode only) or cloud
# Extract specific fields using cloud processing
docstrange invoice.pdf --output json --extract-fields invoice_number total_amount vendor_name

# Extract using JSON schema (Ollama for CPU mode, cloud for default mode)
# Extract using JSON schema with cloud processing
docstrange document.pdf --output json --json-schema schema.json

# Save output to file
Expand Down Expand Up @@ -242,12 +239,6 @@ def main():
)

# Processing mode arguments
parser.add_argument(
"--cpu-mode",
action="store_true",
help="Force local CPU-only processing (disables cloud mode)"
)

parser.add_argument(
"--gpu-mode",
action="store_true",
Expand Down Expand Up @@ -280,12 +271,12 @@ def main():
parser.add_argument(
"--extract-fields",
nargs="+",
help="Extract specific fields using Ollama (CPU mode) or cloud (default mode) (e.g., --extract-fields invoice_number total_amount)"
help="Extract specific fields using cloud processing (e.g., --extract-fields invoice_number total_amount)"
)

parser.add_argument(
"--json-schema",
help="JSON schema file for structured extraction using Ollama (CPU mode) or cloud (default mode)"
help="JSON schema file for structured extraction using cloud processing"
)

parser.add_argument(
Expand Down Expand Up @@ -361,7 +352,6 @@ def main():
extractor = DocumentExtractor(
api_key=args.api_key,
model=args.model,
cpu=args.cpu_mode,
gpu=args.gpu_mode
)
print_supported_formats(extractor)
Expand Down Expand Up @@ -404,12 +394,11 @@ def main():
extractor = DocumentExtractor(
api_key=args.api_key,
model=args.model,
cpu=args.cpu_mode,
gpu=args.gpu_mode
)

if args.verbose:
mode = "local" if (args.cpu_mode or args.gpu_mode) else "cloud"
mode = "local" if args.gpu_mode else "cloud"
print(f"Initialized extractor in {mode} mode:")
print(f" - Output format: {args.output}")
if mode == "cloud":
Expand All @@ -418,8 +407,7 @@ def main():
if args.model:
print(f" - Model: {args.model}")
else:
processor_type = "GPU" if args.gpu_mode else "CPU"
print(f" - Local processing: {processor_type}")
print(f" - Local processing: GPU")
print()

# Process inputs
Expand Down
44 changes: 13 additions & 31 deletions docstrange/extractor.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,6 @@ def __init__(
ocr_enabled: bool = True,
api_key: Optional[str] = None,
model: Optional[str] = None,
cpu: bool = False,
gpu: bool = False
):
"""Initialize the file extractor.
Expand All @@ -45,35 +44,28 @@ def __init__(
ocr_enabled: Whether to enable OCR for image and PDF processing
api_key: API key for cloud processing (optional). Prefer 'docstrange login' for 10k docs/month; API key from https://app.nanonets.com/#/keys is an alternative
model: Model to use for cloud processing (gemini, openapi) - only for cloud mode
cpu: Force local CPU-only processing (disables cloud mode)
gpu: Force local GPU processing (disables cloud mode, requires GPU)

Note:
- Cloud mode is the default unless cpu or gpu is specified
- Cloud mode is the default unless gpu is specified
- Without login or API key, limited calls per day
- For 10k docs/month, run 'docstrange login' (recommended) or use an API key from https://app.nanonets.com/#/keys
"""
self.preserve_layout = preserve_layout
self.include_images = include_images
self.api_key = api_key
self.model = model
self.cpu = cpu
self.gpu = gpu

# Determine processing mode
# Cloud mode is default unless CPU/GPU preference is explicitly set
self.cloud_mode = not (self.cpu or self.gpu)

# Validate CPU/GPU preferences
if self.cpu and self.gpu:
raise ValueError("Cannot specify both cpu and gpu. Choose one or neither.")
# Cloud mode is default unless GPU preference is explicitly set
self.cloud_mode = not self.gpu

# Check GPU availability if GPU preference is set
if self.gpu and not should_use_gpu_processor():
raise RuntimeError(
"GPU preference specified but no GPU is available. "
"Please ensure CUDA is installed and a compatible GPU is present, "
"or use cpu=True for CPU-only processing."
"Please ensure CUDA is installed and a compatible GPU is present."
)

# Default to True if not explicitly set
Expand Down Expand Up @@ -157,7 +149,7 @@ def authenticate(self, force_reauth: bool = False) -> bool:
return False

def _setup_local_processors(self):
"""Setup local processors based on CPU/GPU preferences."""
"""Setup local processors based on GPU preferences."""
local_processors = [
PDFProcessor(preserve_layout=self.preserve_layout, include_images=self.include_images, ocr_enabled=self.ocr_enabled),
DOCXProcessor(preserve_layout=self.preserve_layout, include_images=self.include_images),
Expand All @@ -169,19 +161,11 @@ def _setup_local_processors(self):
URLProcessor(preserve_layout=self.preserve_layout, include_images=self.include_images),
]

# Add GPU processor based on preferences and availability
gpu_available = should_use_gpu_processor()

if self.cpu:
logger.info("CPU preference specified - using CPU-based processors only")
elif self.gpu:
if gpu_available:
logger.info("GPU preference specified - adding GPU processor with Nanonets OCR")
gpu_processor = GPUProcessor(preserve_layout=self.preserve_layout, include_images=self.include_images, ocr_enabled=self.ocr_enabled)
local_processors.append(gpu_processor)
else:
# This should not happen due to validation in __init__, but just in case
raise RuntimeError("GPU preference specified but no GPU is available")
# Add GPU processor if GPU preference is specified
if self.gpu:
logger.info("GPU preference specified - adding GPU processor with Nanonets OCR")
gpu_processor = GPUProcessor(preserve_layout=self.preserve_layout, include_images=self.include_images, ocr_enabled=self.ocr_enabled)
local_processors.append(gpu_processor)

self.processors.extend(local_processors)

Expand Down Expand Up @@ -312,14 +296,12 @@ def get_processing_mode(self) -> str:
"""
if self.cloud_mode and self.api_key:
return "cloud"
elif self.cpu:
return "cpu_forced"
elif self.gpu:
return "gpu_forced"
elif should_use_gpu_processor():
return "gpu_auto"
else:
return "cpu_auto"
return "cloud"

def _get_processor(self, file_path: str):
"""Get the appropriate processor for the file.
Expand All @@ -340,7 +322,7 @@ def _get_processor(self, file_path: str):
gpu_available = should_use_gpu_processor()

# Try GPU processor only if format is supported AND (gpu OR auto-gpu)
if not self.cpu and ext in gpu_supported_formats and (self.gpu or (gpu_available and not self.gpu)):
if ext in gpu_supported_formats and (self.gpu or (gpu_available and not self.gpu)):
for processor in self.processors:
if isinstance(processor, GPUProcessor):
if self.gpu:
Expand All @@ -349,7 +331,7 @@ def _get_processor(self, file_path: str):
logger.info(f"Using GPU processor with Nanonets OCR for {file_path} (GPU available and format supported)")
return processor

# Fallback to normal processor selection (CPU processors)
# Fallback to normal processor selection
for processor in self.processors:
if processor.can_process(file_path):
# Skip GPU processor in fallback mode to avoid infinite loops
Expand Down
2 changes: 1 addition & 1 deletion docstrange/pipeline/model_downloader.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ def download_models(self, force: bool = False, progress: bool = True) -> Path:
if gpu_available:
logger.info("GPU detected - including Nanonets OCR model")
else:
logger.info("No GPU detected - skipping Nanonets OCR model (CPU-only mode)")
logger.info("No GPU detected - skipping Nanonets OCR model (cloud mode)")

models_to_download = [
("Layout Model", self.LAYOUT_MODEL),
Expand Down
Loading