Before committing to a multi-hour ingest, validate the inputs upfront:
- Format verification: confirm files are actually VCF/BCF, not just by extension. Read the magic bytes (##fileformat=VCF or BCF magic). Currently we trust the extension and only fail mid-parse if wrong.
- Sort order: check that records within each file are sorted by position. Unsorted input produces unsorted parquet which degrades downstream query performance. Warn or error.
- Chromosome ordering across files: when multiple files cover the same chromosome, verify position ranges don't overlap or detect interleaving. Sort files by chromosome before chunking to workers.
- Well-formedness: validate the first N records (100?) for structural issues beyond what noodles catches: empty REF/ALT, malformed GT fields, unexpected INFO/FORMAT structure.
All checks should run in the fast header-validation phase before spawning workers. Fail early with clear messages, not mid-run after processing half the data.
Related: #93, #28
Before committing to a multi-hour ingest, validate the inputs upfront:
All checks should run in the fast header-validation phase before spawning workers. Fail early with clear messages, not mid-run after processing half the data.
Related: #93, #28