Skip to content

[Bug]: Using local LLM to import larger documents fails #836

@papst01

Description

@papst01

🔍 Bug Summary

Using multipages PDFs in local LLMs fails, because the content is too big, not truncated

📖 Description

When I use bigger PDFs like 10++ pages I see the analysis failing reproduceable.
Paperless AI runs in an LXC (on ProxMox) from https://community-scripts.github.io/ProxmoxVE/scripts?id=paperless-ai

🔄 Steps to Reproduce

Install, the LXC via the community script.
Link it to Paperless NGX
Use Qwen3-4B-GGUF (served by a Lemonade server)
Import a multi-page PDF to paperless NGX
Trigger Paperless AI to start processing or wait for the cron
Open Paperless AI logs (/opt/paperless-ai/logs/logs.txt)
Wait for the ERROR (for me within a minute) - you will not see an error in the UI only in the logs

✅ Expected Behavior

Run even bigger documents without problems

❌ Actual Behavior

big documents could not be analysed with local LLM at all

🏷️ Paperless-AI Version

3.0.9

📜 Docker Logs

logs from /opt/paperless-ai/logs/logs.txt:
[2026-01-25T10:55:56.976Z] [INFO] [DEBUG] Found own user ID: 4
[2026-01-25T10:55:57.251Z] [INFO] [DEBUG] Fetched page 1, got 35 tags. [DEBUG] Total so far: 35
[2026-01-25T10:55:57.258Z] [INFO] [DEBUG] Found own user ID: 4
[2026-01-25T10:55:57.261Z] [INFO] [DEBUG] Fetched page 1, got 100 documents. [DEBUG] Total so far: 100
[2026-01-25T10:55:57.590Z] [INFO] [DEBUG] Fetched page 2, got 73 documents. [DEBUG] Total so far: 173
[2026-01-25T10:55:57.691Z] [INFO] [DEBUG] Finished fetching. Found 173 documents.
[2026-01-25T10:55:57.764Z] [INFO] [DEBUG] Document 182 rights for AI User - processed
[2026-01-25T10:55:57.885Z] [INFO] Thumbnail not cached, fetching from Paperless
[2026-01-25T10:55:57.919Z] [INFO] [DEBUG] Using character-based token estimation for model: Qwen3-4B-GGUF
[2026-01-25T10:55:57.920Z] [INFO] [DEBUG] Token calculation - Prompt: 605, Reserved: 2605, Available: 125395
[2026-01-25T10:55:57.920Z] [INFO] [DEBUG] Use existing data: yes, Restrictions applied based on useExistingData setting
[2026-01-25T10:55:57.920Z] [INFO] [DEBUG] External API data: none
[2026-01-25T10:55:57.920Z] [INFO] [DEBUG] Using character-based truncation for model: Qwen3-4B-GGUF
[2026-01-25T10:56:51.409Z] [INFO] [DEBUG] [25.01.26, 11:55] Custom OpenAI request sent
[2026-01-25T10:56:51.409Z] [INFO] [DEBUG] [25.01.26, 11:55] Total tokens: 4089
[2026-01-25T10:56:51.410Z] [ERROR] Failed to parse JSON response: SyntaxError: Unexpected token '*', "**Warranty"... is not valid JSON
    at JSON.parse (<anonymous>)
    at CustomOpenAIService.analyzeDocument (/opt/paperless-ai/services/customService.js:219:31)
    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
    at async processDocument (/opt/paperless-ai/routes/setup.js:1603:16)
    at async /opt/paperless-ai/routes/setup.js:1527:28
[2026-01-25T10:56:51.411Z] [ERROR] Failed to analyze document: Error: Invalid JSON response from API
    at CustomOpenAIService.analyzeDocument (/opt/paperless-ai/services/customService.js:226:15)
    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
    at async processDocument (/opt/paperless-ai/routes/setup.js:1603:16)
    at async /opt/paperless-ai/routes/setup.js:1527:28
[2026-01-25T10:56:51.411Z] [INFO] Repsonse from AI service: {
  document: { tags: [], correspondent: null },
  metrics: null,
  error: 'Invalid JSON response from API'
}
[2026-01-25T10:56:51.411Z] [ERROR] [ERROR] processing document 182: Error: [ERROR] Document analysis failed: Invalid JSON response from API
    at processDocument (/opt/paperless-ai/routes/setup.js:1607:11)
    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
    at async /opt/paperless-ai/routes/setup.js:1527:28
[2026-01-25T10:56:51.411Z] [INFO] [INFO] Task completed

📜 Paperless-ngx Logs

not relevant

🖼️ Screenshots of your settings page

Image

🖥️ Desktop Environment

Windows

💻 OS Version

Win 11

🌐 Browser

Firefox

🔢 Browser Version

No response

🌐 Mobile Browser

No response

📝 Additional Information

  • I have checked existing issues and this is not a duplicate
  • I have tried debugging this issue on my own
  • I can provide a fix and submit a PR
  • I am sure that this problem is affecting everyone, not only me
  • I have provided all required information above

📌 Extra Notes

My approach to fix this: Adding a new variable in /opt/paperless-ai/data/.env
CONTENT_MAX_LENGTH=200

Adapting /opt/paperless-ai/services/serviceUtils.js added the following lines at line 105:

        if (process.env.CONTENT_MAX_LENGTH) {
           console.log('[DEBUG] Truncating content to max length (CONTENT_MAX_LENGTH):', process.env.CONTENT_MAX_LENGTH);
           const truncedText = text.substring(0, process.env.CONTENT_MAX_LENGTH);

           // Try to break at a word boundary if possible
           const lstSpaceIndex = truncedText.lastIndexOf(' ');
           return truncedText.substring(0, lstSpaceIndex);
        }
        console.log('[DEBUG] CONTENT_MAX_LENGTH not defined, going ahead');

Would be nice to have a way to set this CONTENT_MAX_LENGTH in UI, because every time I change the config I have to add it manually... And of course to have the bug fixed ;-)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions