DeepSeek-OCR-Instant-PDF

Quick python demo for processing pdf with DeepSeek-OCR and result normalization. For the .mmd files obtained from DeepSeek-OCR, you can use the mmdNormalizer.py to transform it to the following json structure:

{
	"id": filename (without ".mmd"),
	"sections":{
		"section":{
			"sid": incremental int,
			"heading": "" | "heading content...",
			"heading_level": int,
			"contents": "" | [
				"paragraph 1",
				"paragraph 2",
				...
			]
		},
		...
	}
}

where heading_level refers to the last heading the program detects for this section. We also added an approach to enhance compatibility with multi-page pdf files. A sample input and output case is given:

Input

# Chapter 1
Some text split by

<--- Page Split --->

page boundary.

## Section 1.1
More content here.

Output JSON:

{
  "id": "document",
  "sections": [
    {
      "sid": 1,
      "heading": "Chapter 1",
      "heading_level": 1,
      "contents": ["Some text split by page boundary."]
    },
    {
      "sid": 2,
      "heading": "Chapter 1: Section 1.1",
      "heading_level": 2,
      "contents": ["More content here."]
    }
  ]
}

Prerequisites

This is an application based on DeepSeek-OCR. Please follow the instructions on the official installation of DeepSeek-OCR.

git clone https://github.com/deepseek-ai/DeepSeek-OCR.git
cd DeepSeek-OCR

# conda env 
conda create -n deepseek-ocr python=3.12.9 -y
conda activate deepseek-ocr

# install DeepSeek-OCR
wget https://github.com/vllm-project/vllm/releases/download/v0.8.5/vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl
pip install -r requirements.txt
pip install flash-attn==2.7.3 --no-build-isolation

# download DeepSeek-OCR model
hf download deepseek-ai/DeepSeek-OCR --local-dir model # change 'model' to your custom path

Scripts Documentation

This directory contains various scripts for processing PDF documents using DeepSeek-OCR and normalizing the results.

Overview

The scripts are organized into three main categories:

PDF Processing - Convert PDFs to Markdown using DeepSeek-OCR
Result Normalization - Clean and structure the markdown output
Configuration - Centralized configuration management

Configuration

`config.py`

Central configuration file for all processing scripts.

Key Settings:

# Image Processing Modes
BASE_SIZE = 1024      # Base image size for global view
IMAGE_SIZE = 640      # Image size for local patches
CROP_MODE = True      # Enable dynamic cropping
MIN_CROPS = 2         # Minimum crop tiles
MAX_CROPS = 6         # Maximum crop tiles (reduce if low GPU memory)

# Processing Settings
MAX_CONCURRENCY = 100  # Batch size (reduce if low GPU memory)
NUM_WORKERS = 64       # Image preprocessing workers
SKIP_REPEAT = True     # Skip pages without EOS token

# Model
MODEL_PATH = 'deepseek-ai/DeepSeek-OCR'  # Path to model

# Prompt
PROMPT = '<image>\n<|grounding|>Convert the document to markdown.'

Preset Modes:

Tiny: base_size=512, image_size=512, crop_mode=False
Small: base_size=640, image_size=640, crop_mode=False
Base: base_size=1024, image_size=1024, crop_mode=False
Large: base_size=1280, image_size=1280, crop_mode=False
Gundam (recommended): base_size=1024, image_size=640, crop_mode=True

PDF Processing Scripts

1. `pdf_processor.py`

Core PDF processing class with DeepSeek-OCR integration.

Features:

PDF to image conversion
OCR inference with layout detection
Bounding box visualization
Image extraction from documents

Usage (as module):

from pdf_processor import PDFProcessor

processor = PDFProcessor()

result = processor.process_pdf(
    input_path="document.pdf",
    output_dir="output/",
    prompt="<image>\n<|grounding|>Convert the document to markdown.",
    skip_repeat=True,
    crop_mode=True
)

processor.cleanup()

Output:

output/
├── document.mmd                # Clean markdown
├── document_det.mmd            # Markdown with detection tags
├── document_layouts.pdf        # Annotated PDF with bounding boxes
└── images/                     # Extracted images
    ├── 0_0.jpg
    ├── 0_1.jpg
    └── ...

2. `task_parallel.py`

Multi-GPU batch processing - Distributes PDFs across multiple GPUs for parallel processing.

When to use: Processing multiple PDFs with multiple GPUs available.

Usage:

# Process directory with 2 GPUs
python task_parallel.py -p pdfs/ -o output --num-gpus 2

# Use specific GPUs
python task_parallel.py -p pdfs/ -o output --gpu-ids 0,3

# Process single PDF
python task_parallel.py -p document.pdf -o output

Arguments:

Argument	Description	Default
`-p, --path`	Input PDF file or directory	Required
`-o, --output`	Output directory	Required
`--num-gpus`	Number of GPUs to use	2
`--gpu-ids`	Specific GPU IDs (e.g., '0,1,3')	None
`--prompt`	Custom OCR prompt	From config
`--dpi`	Image conversion DPI	144
`--no-skip-repeat`	Process all pages	False
`--no-crop-mode`	Disable crop mode	False

How it works:

Splits PDF files across multiple GPUs
Each GPU runs a separate process (pdf_worker.py)
Processes PDFs in parallel for faster throughput

3. `tensor_parallel.py`

Tensor parallelism - Splits the model across multiple GPUs to reduce memory usage per GPU.

When to use:

Large PDFs that require high max_model_len
Limited GPU memory per device
Prefer throughput over latency

Usage:

# Use 2 GPUs with tensor parallelism
python tensor_parallel.py -p document.pdf -o output --tensor-parallel-size 2

# Use specific GPUs
python tensor_parallel.py -p document.pdf -o output --tensor-parallel-size 2 --gpu-ids 0,1

# Process directory (sequential)
python tensor_parallel.py -p pdfs/ -o output --tensor-parallel-size 4

Arguments:

Argument	Description	Default
`-p, --path`	Input PDF file or directory	Required
`-o, --output`	Output directory	Required
`--tensor-parallel-size`	Number of GPUs to split model across	2
`--gpu-ids`	Specific GPU IDs	None
`--model-path`	Path to model	From config
`--prompt`	Custom OCR prompt	From config
`--max-model-len`	Maximum sequence length	8192
`--gpu-memory-utilization`	GPU memory ratio	0.9

Memory Comparison:

Mode	Task Parallel	Tensor Parallel
GPU 0	40GB (full model)	20GB (half model)
GPU 1	40GB (full model)	20GB (half model)
Total	80GB, 2 PDFs parallel	40GB, 1 PDF sequential

4. `pdf_worker.py`

Worker script for task_parallel.py - Do not run directly.

This script is automatically launched by task_parallel.py for each GPU worker.

Result Normalization

`mmdNormalizer.py`

Converts markdown output to structured JSON format with proper section hierarchy.

Features:

Removes page split markers
Reconnects split paragraphs across pages
Extracts hierarchical section structure
Combines nested headings (e.g., "Chapter 1: Introduction: Background")

Usage:

# Single file
python mmdNormalizer.py -p output/document.mmd

# Directory
python mmdNormalizer.py -p output/ -o normalized/

Arguments:

Argument	Description
`-p, --path`	Input markdown file or directory
`-o, --output`	Output directory (optional, defaults to input location)

Workflow Examples

Example 1: Process Single PDF

# 1. Process PDF
python -c "
from pdf_processor import PDFProcessor
processor = PDFProcessor()
processor.process_pdf('document.pdf', 'output/')
processor.cleanup()
"

# 2. Normalize result
python mmdNormalizer.py -p output/document.mmd

Example 2: Batch Process with Multiple GPUs

# Process 100 PDFs using 4 GPUs
python task_parallel.py -p pdfs/ -o output --num-gpus 4

# Normalize all results
python mmdNormalizer.py -p output/ -o normalized/

Example 3: Process Large PDF with Limited GPU Memory

# Use tensor parallelism to split model across 2 GPUs
python tensor_parallel.py \
  -p large_document.pdf \
  -o output \
  --tensor-parallel-size 2 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.85

# Normalize
python mmdNormalizer.py -p output/large_document.mmd

Choosing the Right Script

Scenario	Recommended Script	Reason
Single PDF, 1 GPU	`pdf_processor.py`	Simple and direct
Multiple PDFs, multiple GPUs	`task_parallel.py`	Parallel processing for speed
Large PDF, limited memory	`tensor_parallel.py`	Distribute model across GPUs
Very large PDF (>100 pages)	`tensor_parallel.py` with high `max_model_len`	More tokens for long documents

Common Prompts

Configure in config.py or pass via --prompt:

# Document OCR (default)
"<image>\n<|grounding|>Convert the document to markdown."

# Free OCR (no layout detection)
"<image>\nFree OCR."

# Figure parsing
"<image>\nParse the figure."

# General description
"<image>\nDescribe this image in detail."

# Object localization
"<image>\nLocate <|ref|>text here<|/ref|> in the image."

Troubleshooting

Out of Memory (OOM)

Reduce MAX_CROPS in config.py (try 4 or 2)
Reduce MAX_CONCURRENCY (try 50 or 25)
Use tensor_parallel.py to split model
Lower --gpu-memory-utilization (try 0.8)

Pages Being Skipped

Set SKIP_REPEAT = False in config.py or use --no-skip-repeat

Poor OCR Quality

Increase IMAGE_SIZE and BASE_SIZE
Enable CROP_MODE = True
Try different prompts
Increase DPI: --dpi 200

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepSeek-OCR-Instant-PDF

Prerequisites

Scripts Documentation

Overview

Configuration

`config.py`

PDF Processing Scripts

1. `pdf_processor.py`

2. `task_parallel.py`

3. `tensor_parallel.py`

4. `pdf_worker.py`

Result Normalization

`mmdNormalizer.py`

Workflow Examples

Example 1: Process Single PDF

Example 2: Batch Process with Multiple GPUs

Example 3: Process Large PDF with Limited GPU Memory

Choosing the Right Script

Common Prompts

Troubleshooting

Out of Memory (OOM)

Pages Being Skipped

Poor OCR Quality

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DeepSeek-OCR-Instant-PDF

Prerequisites

Scripts Documentation

Overview

Configuration

config.py

PDF Processing Scripts

1. pdf_processor.py

2. task_parallel.py

3. tensor_parallel.py

4. pdf_worker.py

Result Normalization

mmdNormalizer.py

Workflow Examples

Example 1: Process Single PDF

Example 2: Batch Process with Multiple GPUs

Example 3: Process Large PDF with Limited GPU Memory

Choosing the Right Script

Common Prompts

Troubleshooting

Out of Memory (OOM)

Pages Being Skipped

Poor OCR Quality

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`config.py`

1. `pdf_processor.py`

2. `task_parallel.py`

3. `tensor_parallel.py`

4. `pdf_worker.py`

`mmdNormalizer.py`

Packages