Skip to content

direct VCF to sparse_g.bin: skip dense genotype parquet for cohort builds #96

@vineetver

Description

@vineetver

The cohort build path writes a dense genotype parquet (200K samples * N variants * 4 bytes, ~13GB working memory) then reads it back to extract non-zero carriers into sparse_g.bin (~3MB). The dense parquet is a throwaway intermediate.

Write carrier lists directly to sparse_g.bin during VCF ingest. MAF is a byproduct of counting carriers. Eliminates GenotypeWriter, the FixedSizeListBuilder, and the /4 memory budget formula.

Sparse carrier format is always smaller than dense for biallelic variants (3 bytes per carrier vs 4 bytes per sample, carriers <= total samples). The savings range from 2.7x at MAF 50% to 4000x at MAF 0.01%. The format is lossless at any allele frequency.

Dense is a derived view, not a storage format. Any future analysis needing dense (GWAS regression, export) reconstructs it on the fly from sparse: allocate zeros, scatter carriers. One variant at a time, streaming.

Memory impact per worker: ~1.3GB (dense batch builder) → ~3MB (sparse carrier buffer). The OOM problems, budget splitting (#84), and batch sizing (#91) all simplify.

Related: #59, #84, #91, #95

Metadata

Metadata

Assignees

No one assigned

    Labels

    ingestVCF/genotype ingest pipelineperformanceOptimization and profiling

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions