The cohort build path writes a dense genotype parquet (200K samples * N variants * 4 bytes, ~13GB working memory) then reads it back to extract non-zero carriers into sparse_g.bin (~3MB). The dense parquet is a throwaway intermediate.
Write carrier lists directly to sparse_g.bin during VCF ingest. MAF is a byproduct of counting carriers. Eliminates GenotypeWriter, the FixedSizeListBuilder, and the /4 memory budget formula.
Sparse carrier format is always smaller than dense for biallelic variants (3 bytes per carrier vs 4 bytes per sample, carriers <= total samples). The savings range from 2.7x at MAF 50% to 4000x at MAF 0.01%. The format is lossless at any allele frequency.
Dense is a derived view, not a storage format. Any future analysis needing dense (GWAS regression, export) reconstructs it on the fly from sparse: allocate zeros, scatter carriers. One variant at a time, streaming.
Memory impact per worker: ~1.3GB (dense batch builder) → ~3MB (sparse carrier buffer). The OOM problems, budget splitting (#84), and batch sizing (#91) all simplify.
Related: #59, #84, #91, #95
The cohort build path writes a dense genotype parquet (200K samples * N variants * 4 bytes, ~13GB working memory) then reads it back to extract non-zero carriers into sparse_g.bin (~3MB). The dense parquet is a throwaway intermediate.
Write carrier lists directly to sparse_g.bin during VCF ingest. MAF is a byproduct of counting carriers. Eliminates GenotypeWriter, the FixedSizeListBuilder, and the /4 memory budget formula.
Sparse carrier format is always smaller than dense for biallelic variants (3 bytes per carrier vs 4 bytes per sample, carriers <= total samples). The savings range from 2.7x at MAF 50% to 4000x at MAF 0.01%. The format is lossless at any allele frequency.
Dense is a derived view, not a storage format. Any future analysis needing dense (GWAS regression, export) reconstructs it on the fly from sparse: allocate zeros, scatter carriers. One variant at a time, streaming.
Memory impact per worker: ~1.3GB (dense batch builder) → ~3MB (sparse carrier buffer). The OOM problems, budget splitting (#84), and batch sizing (#91) all simplify.
Related: #59, #84, #91, #95