Skip to content

pre-ingest VCF validation: format check, sort order, well-formedness #94

@vineetver

Description

@vineetver

Before committing to a multi-hour ingest, validate the inputs upfront:

  • Format verification: confirm files are actually VCF/BCF, not just by extension. Read the magic bytes (##fileformat=VCF or BCF magic). Currently we trust the extension and only fail mid-parse if wrong.
  • Sort order: check that records within each file are sorted by position. Unsorted input produces unsorted parquet which degrades downstream query performance. Warn or error.
  • Chromosome ordering across files: when multiple files cover the same chromosome, verify position ranges don't overlap or detect interleaving. Sort files by chromosome before chunking to workers.
  • Well-formedness: validate the first N records (100?) for structural issues beyond what noodles catches: empty REF/ALT, malformed GT fields, unexpected INFO/FORMAT structure.

All checks should run in the fast header-validation phase before spawning workers. Fail early with clear messages, not mid-run after processing half the data.

Related: #93, #28

Metadata

Metadata

Assignees

No one assigned

    Labels

    correctnessStatistical accuracy, validationingestVCF/genotype ingest pipeline

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions