Skip to content

split memory budget between reader and writer in VCF ingest #84

@vineetver

Description

@vineetver

Reader and writer both derive batch sizes from the same undivided memory budget. They compete for RAM without coordination.

The writer dominates: GenotypeWriter allocates batch_size * n_samples * 4 for the FixedSizeListBuilder, and Arrow's finish() temporarily doubles that. Current workaround is budget / 4 which wastes half the memory.

Split the budget explicitly: reader gets what it needs (BGZF buffers, line buffer), writer gets the rest for larger batches and fewer flushes. Also: CohortPool claims the full budget during ingest even though DataFusion is idle.

Files: src/ingest/vcf.rs, src/staar/genotype.rs, src/resource.rs, src/engine.rs

Related: #74, #82, #59, #83

Metadata

Metadata

Assignees

No one assigned

    Labels

    ingestVCF/genotype ingest pipelineperformanceOptimization and profiling

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions