PDF Validation Feature

Automated quality checking for generated PDFs

Quick Reference: Pipeline Orchestration | Common Workflows | FAQ

Overview

The PDF validation system automatically scans generated PDFs for rendering issues and structural problems. It detects unresolved references (??), missing citations, warnings, errors, and verifies document structure by extracting the first N words.

Architecture

Following the thin orchestrator pattern, the implementation consists of:

Business Logic (infrastructure/validation/content/pdf_validator.py): Core validation algorithms
CLI Interface (infrastructure/validation/cli/main.py): Command-line interface
Orchestrator (scripts/04_validate_output.py): Pipeline integration
Tests (tests/infra_tests/validation/test_pdf_validator.py): coverage with data
Integration (scripts/execute_pipeline.py): validation stage after render (see RUN_GUIDE.md; script 04_validate_output.py maps to pipeline “validate”)

Components

infrastructure/validation/content/pdf_validator.py

Core validation module containing all business logic:

extract_text_from_pdf(pdf_path): Extract text from PDF files using pypdf
scan_for_issues(text): Scan for rendering problems
- Unresolved references (??)
- Warnings
- Errors
- Missing citations [?]
extract_first_n_words(text, n): Extract first N words for structure verification
validate_pdf_rendering(pdf_path, n_words): validation report

infrastructure/validation/cli/main.py

Command-line interface that:

Imports methods from infrastructure/validation/content/pdf_validator.py
Handles command-line arguments
Formats and prints validation reports
Returns appropriate exit codes:
- 0: No issues detected
- 1: Issues found (with detailed report)
- 2: Error during validation

Usage

Standalone Validation

# Validate outputs for one project (PDFs, markdown, integrity under projects/{name}/output/)
uv run python scripts/04_validate_output.py --project code_project

# Validate a specific PDF using CLI
uv run python -m infrastructure.validation.cli pdf output/code_project/pdf/code_project_combined.pdf

# Validate with verbose output
uv run python -m infrastructure.validation.cli pdf output/code_project/pdf/code_project_combined.pdf --verbose

# Validate markdown files
uv run python -m infrastructure.validation.cli markdown projects/{name}/manuscript/

Automated Validation

The core pipeline runs validation after PDF render via 04_validate_output.py:

# Full core pipeline (includes validation after render)
uv run python scripts/execute_pipeline.py --project {name} --core-only

# Or use the interactive menu
./run.sh

04_validate_output.py alone does not clean or re-render; it checks existing artifacts under projects/{name}/output/ (and related paths) for the given --project.

Note: For a full pipeline run, use --skip-validation only when iterating quickly; run validation before release builds.

Sample Output

🔍 Validating PDF: code_project_combined.pdf

======================================================================
📋 PDF VALIDATION REPORT
======================================================================
📄 File: code_project_combined.pdf

⚠️  Found 11 rendering issue(s):
   • Unresolved references (??): 11

----------------------------------------------------------------------
📖 First 200 words of document:
----------------------------------------------------------------------
References 1 [1] Alice Brown and Robert Wilson. Advanced optimization 
techniques for machine learning. In Proceedings of the International 
Conference on Machine Learning, pages 456–467. ICML, 2022...
----------------------------------------------------------------------
======================================================================

Test Coverage

Unit Tests (test_pdf_validator.py)

✅ test coverage of infrastructure/validation/content/pdf_validator.py
✅ Tests with PDFs (no mocks)
✅ Tests edge cases and error handling
✅ Validates against actual project PDF when available

Integration Tests (test_pdf_validator.py)

✅ Script existence and executability
✅ Import verification
✅ End-to-end validation on actual PDFs
✅ Error handling for nonexistent files
✅ Help text verification

Run tests:

uv run pytest tests/infra_tests/validation/test_pdf_validator.py -v

uv run pytest tests/infra_tests/validation/test_pdf_validator.py \
  --cov=infrastructure.validation.content.pdf_validator \
  --cov-report=term-missing

Common Issues Detected

Unresolved References (??)

LaTeX/Markdown references that didn't resolve properly:

Missing section labels
Undefined equation references
Broken figure/table references
Bibliography issues

Solution: Ensure all \label{} commands are properly defined and referenced.

Missing Citations [?]

Bibliography references that couldn't be resolved:

Missing BibTeX entries
Incorrect citation keys
Bibliography file not found

Solution: Check references.bib and ensure all cited keys exist.

Document Structure Issues

First words showing incorrect page order:

References appearing before title page
Missing abstract or introduction
Incorrect page ordering

Solution: Check manuscript source files and preamble order.

Dependencies

pypdf>=5.0: PDF text extraction (replaces deprecated PyPDF2)
reportlab>=4.0: PDF generation for tests

These are automatically managed by uv and defined in pyproject.toml.

Development Workflow

Following TDD principles:

Write tests first in tests/infra_tests/validation/
Implement business logic in infrastructure/validation/content/pdf_validator.py
Create CLI interface in infrastructure/validation/cli/main.py
Integrate into build pipeline via scripts/04_validate_output.py
Verify test coverage requirements met

Future Enhancements

Potential improvements:

Detect more LaTeX warning patterns
Validate cross-reference consistency
Check for orphaned figures/tables
Verify equation numbering sequence
Generate diff reports between PDF versions
HTML report generation with highlighted issues
Integration with CI/CD pipelines (markdown validation + tests in GitHub Actions)
Configurable issue severity levels

Troubleshooting

"Module pdf_validator not found"

Ensure you're running from the repository root:

cd /path/to/template
uv run python -m infrastructure.validation.cli pdf output/code_project/pdf/code_project_combined.pdf

"PDF file not found"

Generate PDFs first:

uv run python scripts/execute_pipeline.py --project {name} --core-only

Or run the pipeline:

# Standard build with validation
uv run python scripts/execute_pipeline.py --project {name} --core-only

# With options
uv run python scripts/execute_pipeline.py --project {name} --core-only --verbose --log-file build.log

High number of ?? issues

This typically indicates:

LaTeX compilation warnings were ignored
Missing label definitions
Bibliography not properly processed

Check compilation logs under projects/{name}/output/pdf/ or output/{name}/pdf/ (e.g. *_compile.log or .log next to the TeX build) for detailed LaTeX warnings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF Validation Feature

Overview

Architecture

Components

infrastructure/validation/content/pdf_validator.py

infrastructure/validation/cli/main.py

Usage

Standalone Validation

Automated Validation

Sample Output

Test Coverage

Unit Tests (test_pdf_validator.py)

Integration Tests (test_pdf_validator.py)

Common Issues Detected

Unresolved References (??)

Missing Citations [?]

Document Structure Issues

Dependencies

Development Workflow

Future Enhancements

Troubleshooting

"Module pdf_validator not found"

"PDF file not found"

High number of ?? issues

References

FilesExpand file tree

pdf-validation.md

Latest commit

History

pdf-validation.md

File metadata and controls

PDF Validation Feature

Overview

Architecture

Components

infrastructure/validation/content/pdf_validator.py

infrastructure/validation/cli/main.py

Usage

Standalone Validation

Automated Validation

Sample Output

Test Coverage

Unit Tests (test_pdf_validator.py)

Integration Tests (test_pdf_validator.py)

Common Issues Detected

Unresolved References (??)

Missing Citations [?]

Document Structure Issues

Dependencies

Development Workflow

Future Enhancements

Troubleshooting

"Module pdf_validator not found"

"PDF file not found"

High number of ?? issues

References