Skip to content

Latest commit

 

History

History
239 lines (162 loc) · 7.71 KB

File metadata and controls

239 lines (162 loc) · 7.71 KB

PDF Validation Feature

Automated quality checking for generated PDFs

Quick Reference: Pipeline Orchestration | Common Workflows | FAQ

Overview

The PDF validation system automatically scans generated PDFs for rendering issues and structural problems. It detects unresolved references (??), missing citations, warnings, errors, and verifies document structure by extracting the first N words.

Architecture

Following the thin orchestrator pattern, the implementation consists of:

  1. Business Logic (infrastructure/validation/content/pdf_validator.py): Core validation algorithms
  2. CLI Interface (infrastructure/validation/cli/main.py): Command-line interface
  3. Orchestrator (scripts/04_validate_output.py): Pipeline integration
  4. Tests (tests/infra_tests/validation/test_pdf_validator.py): coverage with data
  5. Integration (scripts/execute_pipeline.py): validation stage after render (see RUN_GUIDE.md; script 04_validate_output.py maps to pipeline “validate”)

Components

infrastructure/validation/content/pdf_validator.py

Core validation module containing all business logic:

  • extract_text_from_pdf(pdf_path): Extract text from PDF files using pypdf
  • scan_for_issues(text): Scan for rendering problems
    • Unresolved references (??)
    • Warnings
    • Errors
    • Missing citations [?]
  • extract_first_n_words(text, n): Extract first N words for structure verification
  • validate_pdf_rendering(pdf_path, n_words): validation report

infrastructure/validation/cli/main.py

Command-line interface that:

  • Imports methods from infrastructure/validation/content/pdf_validator.py
  • Handles command-line arguments
  • Formats and prints validation reports
  • Returns appropriate exit codes:
    • 0: No issues detected
    • 1: Issues found (with detailed report)
    • 2: Error during validation

Usage

Standalone Validation

# Validate outputs for one project (PDFs, markdown, integrity under projects/{name}/output/)
uv run python scripts/04_validate_output.py --project code_project

# Validate a specific PDF using CLI
uv run python -m infrastructure.validation.cli pdf output/code_project/pdf/code_project_combined.pdf

# Validate with verbose output
uv run python -m infrastructure.validation.cli pdf output/code_project/pdf/code_project_combined.pdf --verbose

# Validate markdown files
uv run python -m infrastructure.validation.cli markdown projects/{name}/manuscript/

Automated Validation

The core pipeline runs validation after PDF render via 04_validate_output.py:

# Full core pipeline (includes validation after render)
uv run python scripts/execute_pipeline.py --project {name} --core-only

# Or use the interactive menu
./run.sh

04_validate_output.py alone does not clean or re-render; it checks existing artifacts under projects/{name}/output/ (and related paths) for the given --project.

Note: For a full pipeline run, use --skip-validation only when iterating quickly; run validation before release builds.

Sample Output

🔍 Validating PDF: code_project_combined.pdf

======================================================================
📋 PDF VALIDATION REPORT
======================================================================
📄 File: code_project_combined.pdf

⚠️  Found 11 rendering issue(s):
   • Unresolved references (??): 11

----------------------------------------------------------------------
📖 First 200 words of document:
----------------------------------------------------------------------
References 1 [1] Alice Brown and Robert Wilson. Advanced optimization 
techniques for machine learning. In Proceedings of the International 
Conference on Machine Learning, pages 456–467. ICML, 2022...
----------------------------------------------------------------------
======================================================================

Test Coverage

Unit Tests (test_pdf_validator.py)

  • ✅ test coverage of infrastructure/validation/content/pdf_validator.py
  • ✅ Tests with PDFs (no mocks)
  • ✅ Tests edge cases and error handling
  • ✅ Validates against actual project PDF when available

Integration Tests (test_pdf_validator.py)

  • ✅ Script existence and executability
  • ✅ Import verification
  • ✅ End-to-end validation on actual PDFs
  • ✅ Error handling for nonexistent files
  • ✅ Help text verification

Run tests:

uv run pytest tests/infra_tests/validation/test_pdf_validator.py -v

uv run pytest tests/infra_tests/validation/test_pdf_validator.py \
  --cov=infrastructure.validation.content.pdf_validator \
  --cov-report=term-missing

Common Issues Detected

Unresolved References (??)

LaTeX/Markdown references that didn't resolve properly:

  • Missing section labels
  • Undefined equation references
  • Broken figure/table references
  • Bibliography issues

Solution: Ensure all \label{} commands are properly defined and referenced.

Missing Citations [?]

Bibliography references that couldn't be resolved:

  • Missing BibTeX entries
  • Incorrect citation keys
  • Bibliography file not found

Solution: Check references.bib and ensure all cited keys exist.

Document Structure Issues

First words showing incorrect page order:

  • References appearing before title page
  • Missing abstract or introduction
  • Incorrect page ordering

Solution: Check manuscript source files and preamble order.

Dependencies

  • pypdf>=5.0: PDF text extraction (replaces deprecated PyPDF2)
  • reportlab>=4.0: PDF generation for tests

These are automatically managed by uv and defined in pyproject.toml.

Development Workflow

Following TDD principles:

  1. Write tests first in tests/infra_tests/validation/
  2. Implement business logic in infrastructure/validation/content/pdf_validator.py
  3. Create CLI interface in infrastructure/validation/cli/main.py
  4. Integrate into build pipeline via scripts/04_validate_output.py
  5. Verify test coverage requirements met

Future Enhancements

Potential improvements:

  • Detect more LaTeX warning patterns
  • Validate cross-reference consistency
  • Check for orphaned figures/tables
  • Verify equation numbering sequence
  • Generate diff reports between PDF versions
  • HTML report generation with highlighted issues
  • Integration with CI/CD pipelines (markdown validation + tests in GitHub Actions)
  • Configurable issue severity levels

Troubleshooting

"Module pdf_validator not found"

Ensure you're running from the repository root:

cd /path/to/template
uv run python -m infrastructure.validation.cli pdf output/code_project/pdf/code_project_combined.pdf

"PDF file not found"

Generate PDFs first:

uv run python scripts/execute_pipeline.py --project {name} --core-only

Or run the pipeline:

# Standard build with validation
uv run python scripts/execute_pipeline.py --project {name} --core-only

# With options
uv run python scripts/execute_pipeline.py --project {name} --core-only --verbose --log-file build.log

High number of ?? issues

This typically indicates:

  1. LaTeX compilation warnings were ignored
  2. Missing label definitions
  3. Bibliography not properly processed

Check compilation logs under projects/{name}/output/pdf/ or output/{name}/pdf/ (e.g. *_compile.log or .log next to the TeX build) for detailed LaTeX warnings.

References