Automated quality checking for generated PDFs
Quick Reference: Pipeline Orchestration | Common Workflows | FAQ
The PDF validation system automatically scans generated PDFs for rendering issues and structural problems. It detects unresolved references (??), missing citations, warnings, errors, and verifies document structure by extracting the first N words.
Following the thin orchestrator pattern, the implementation consists of:
- Business Logic (
infrastructure/validation/content/pdf_validator.py): Core validation algorithms - CLI Interface (
infrastructure/validation/cli/main.py): Command-line interface - Orchestrator (
scripts/04_validate_output.py): Pipeline integration - Tests (
tests/infra_tests/validation/test_pdf_validator.py): coverage with data - Integration (
scripts/execute_pipeline.py): validation stage after render (see RUN_GUIDE.md; script04_validate_output.pymaps to pipeline “validate”)
Core validation module containing all business logic:
extract_text_from_pdf(pdf_path): Extract text from PDF files using pypdfscan_for_issues(text): Scan for rendering problems- Unresolved references (??)
- Warnings
- Errors
- Missing citations [?]
extract_first_n_words(text, n): Extract first N words for structure verificationvalidate_pdf_rendering(pdf_path, n_words): validation report
Command-line interface that:
- Imports methods from
infrastructure/validation/content/pdf_validator.py - Handles command-line arguments
- Formats and prints validation reports
- Returns appropriate exit codes:
0: No issues detected1: Issues found (with detailed report)2: Error during validation
# Validate outputs for one project (PDFs, markdown, integrity under projects/{name}/output/)
uv run python scripts/04_validate_output.py --project code_project
# Validate a specific PDF using CLI
uv run python -m infrastructure.validation.cli pdf output/code_project/pdf/code_project_combined.pdf
# Validate with verbose output
uv run python -m infrastructure.validation.cli pdf output/code_project/pdf/code_project_combined.pdf --verbose
# Validate markdown files
uv run python -m infrastructure.validation.cli markdown projects/{name}/manuscript/The core pipeline runs validation after PDF render via 04_validate_output.py:
# Full core pipeline (includes validation after render)
uv run python scripts/execute_pipeline.py --project {name} --core-only
# Or use the interactive menu
./run.sh04_validate_output.py alone does not clean or re-render; it checks existing artifacts under projects/{name}/output/ (and related paths) for the given --project.
Note: For a full pipeline run, use --skip-validation only when iterating quickly; run validation before release builds.
🔍 Validating PDF: code_project_combined.pdf
======================================================================
📋 PDF VALIDATION REPORT
======================================================================
📄 File: code_project_combined.pdf
⚠️ Found 11 rendering issue(s):
• Unresolved references (??): 11
----------------------------------------------------------------------
📖 First 200 words of document:
----------------------------------------------------------------------
References 1 [1] Alice Brown and Robert Wilson. Advanced optimization
techniques for machine learning. In Proceedings of the International
Conference on Machine Learning, pages 456–467. ICML, 2022...
----------------------------------------------------------------------
======================================================================
- ✅ test coverage of
infrastructure/validation/content/pdf_validator.py - ✅ Tests with PDFs (no mocks)
- ✅ Tests edge cases and error handling
- ✅ Validates against actual project PDF when available
- ✅ Script existence and executability
- ✅ Import verification
- ✅ End-to-end validation on actual PDFs
- ✅ Error handling for nonexistent files
- ✅ Help text verification
Run tests:
uv run pytest tests/infra_tests/validation/test_pdf_validator.py -v
uv run pytest tests/infra_tests/validation/test_pdf_validator.py \
--cov=infrastructure.validation.content.pdf_validator \
--cov-report=term-missingLaTeX/Markdown references that didn't resolve properly:
- Missing section labels
- Undefined equation references
- Broken figure/table references
- Bibliography issues
Solution: Ensure all \label{} commands are properly defined and referenced.
Bibliography references that couldn't be resolved:
- Missing BibTeX entries
- Incorrect citation keys
- Bibliography file not found
Solution: Check references.bib and ensure all cited keys exist.
First words showing incorrect page order:
- References appearing before title page
- Missing abstract or introduction
- Incorrect page ordering
Solution: Check manuscript source files and preamble order.
pypdf>=5.0: PDF text extraction (replaces deprecated PyPDF2)reportlab>=4.0: PDF generation for tests
These are automatically managed by uv and defined in pyproject.toml.
Following TDD principles:
- Write tests first in
tests/infra_tests/validation/ - Implement business logic in
infrastructure/validation/content/pdf_validator.py - Create CLI interface in
infrastructure/validation/cli/main.py - Integrate into build pipeline via
scripts/04_validate_output.py - Verify test coverage requirements met
Potential improvements:
- Detect more LaTeX warning patterns
- Validate cross-reference consistency
- Check for orphaned figures/tables
- Verify equation numbering sequence
- Generate diff reports between PDF versions
- HTML report generation with highlighted issues
- Integration with CI/CD pipelines (markdown validation + tests in GitHub Actions)
- Configurable issue severity levels
Ensure you're running from the repository root:
cd /path/to/template
uv run python -m infrastructure.validation.cli pdf output/code_project/pdf/code_project_combined.pdfGenerate PDFs first:
uv run python scripts/execute_pipeline.py --project {name} --core-onlyOr run the pipeline:
# Standard build with validation
uv run python scripts/execute_pipeline.py --project {name} --core-only
# With options
uv run python scripts/execute_pipeline.py --project {name} --core-only --verbose --log-file build.logThis typically indicates:
- LaTeX compilation warnings were ignored
- Missing label definitions
- Bibliography not properly processed
Check compilation logs under projects/{name}/output/pdf/ or output/{name}/pdf/ (e.g. *_compile.log or .log next to the TeX build) for detailed LaTeX warnings.