Bug Report: PDF Processing Returns Empty Content
Summary
DocStrange processes PDF successfully (reports "1 successful") but returns empty content for all output formats. The PDF is valid and readable by other tools.
Environment
- docstrange version: 1.1.5
- OS: macOS (Darwin)
- Python: (from mise/pipx installation)
- Authentication: Authenticated cloud mode (10k/month free calls)
- PDF details:
- File:
2512.14012.pdf (likely arXiv paper)
- Size: 1.4 MB
- Pages: 11 (confirmed with
file command)
- Format: PDF 1.7
Steps to Reproduce
- Basic markdown conversion (fails):
docstrange ~/Downloads/2512.14012.pdf --output markdown --verbose
Output:
Processing: /Users/ramarivera/Downloads/2512.14012.pdf
Summary: 1 successful, 0 failed
Initialized extractor in cloud mode:
- Output format: markdown
- Auth: authenticated (10k/month) free calls
[empty output]
- JSON with field extraction (fails):
docstrange ~/Downloads/2512.14012.pdf --output json --extract-fields title abstract authors
Output:
{
"document": {
"raw_content": ""
},
"format": "json_parse_error",
"error": "Expecting value: line 1 column 1 (char 0)"
}
- With OCR enabled (fails):
docstrange ~/Downloads/2512.14012.pdf --output markdown --ocr-enabled --verbose
Still returns empty content.
- With Gemini model (fails):
docstrange ~/Downloads/2512.14012.pdf --model gemini --output markdown --verbose
Still returns empty content.
- Saving to file (fails):
docstrange ~/Downloads/2512.14012.pdf --output markdown --output-file output.md
Creates output.md with 0 bytes.
Expected Behavior
- Should extract text content from the PDF
- Should return markdown/JSON with the document content
- Should not report "successful" if content extraction failed
Actual Behavior
- Reports "1 successful" in summary
- Returns completely empty content (
raw_content: "")
- No error messages or warnings about why extraction failed
- Output files are empty (0 bytes for markdown, error JSON for json format)
Additional Testing
✅ DocStrange works with simple text files:
echo "Test document" > test.txt
docstrange test.txt --output markdown
Returns:
# Text Document
Test document
❌ This specific PDF fails consistently
Tried all combinations of:
- Output formats: markdown, json, text, html
- Models: default (nanonets), gemini
- Flags: --ocr-enabled, --include-images, --preserve-layout
- All produce empty content
Possible Causes
- Silent failure in PDF text extraction (no error logged)
- Cloud API returning empty response without error
- PDF might have embedded text that's not being detected
- PDF might need OCR but OCR isn't triggering properly
Related Issues
Diagnostic Commands
# Verify PDF is valid
file 2512.14012.pdf
# Output: PDF document, version 1.7, 11 pages
# Check credentials
ls -la ~/.docstrange/credentials.json
# Exists and was created during authentication
# Test with verbose mode
docstrange 2512.14012.pdf --output json --verbose
# Shows successful authentication but empty content
Request
- Could you investigate why PDFs report "successful" but return empty content?
- Should there be more detailed error logging when extraction silently fails?
- Is there a way to get detailed debug logs to see what's happening during processing?
Sample File
The PDF that fails: 2512.14012.pdf (arXiv paper, 1.4MB, 11 pages)
I can provide the file if needed for debugging.
Note: Similar issues were searched before filing. This report was created with AI assistance.
Bug Report: PDF Processing Returns Empty Content
Summary
DocStrange processes PDF successfully (reports "1 successful") but returns empty content for all output formats. The PDF is valid and readable by other tools.
Environment
2512.14012.pdf(likely arXiv paper)filecommand)Steps to Reproduce
docstrange ~/Downloads/2512.14012.pdf --output markdown --verboseOutput:
docstrange ~/Downloads/2512.14012.pdf --output json --extract-fields title abstract authorsOutput:
{ "document": { "raw_content": "" }, "format": "json_parse_error", "error": "Expecting value: line 1 column 1 (char 0)" }docstrange ~/Downloads/2512.14012.pdf --output markdown --ocr-enabled --verboseStill returns empty content.
docstrange ~/Downloads/2512.14012.pdf --model gemini --output markdown --verboseStill returns empty content.
docstrange ~/Downloads/2512.14012.pdf --output markdown --output-file output.mdCreates
output.mdwith 0 bytes.Expected Behavior
Actual Behavior
raw_content: "")Additional Testing
✅ DocStrange works with simple text files:
Returns:
# Text Document Test document❌ This specific PDF fails consistently
Tried all combinations of:
Possible Causes
Related Issues
Diagnostic Commands
Request
Sample File
The PDF that fails:
2512.14012.pdf(arXiv paper, 1.4MB, 11 pages)I can provide the file if needed for debugging.
Note: Similar issues were searched before filing. This report was created with AI assistance.