Skip to content

Rdata I/O: phenotype input and per-mask output for STAARpipelineSummary #109

@vineetver

Description

@vineetver

Running STAARpipeline-Tutorial end-to-end via favor-cli needs Rdata at both ends:

  • phenotype often shipped as an .Rdata data frame
  • STAARpipelineSummary calls get(load(.)) on per-shard output files and expects named R objects

Current state: phenotype load is CSV/TSV only; outputs are parquet + JSON metadata.

Tutorial expectations (from STAARpipelineSummary scripts):

  • individual: one data frame per shard, filename <output>_<chr>_<groupid>.Rdata, columns CHR,POS,REF,ALT,ALT_AF,MAC,N,pvalue,Score,SE,Est
  • gene-centric coding/noncoding/ncRNA: list of mask data frames; columns include Gene,Chr,Category,#SNV,cMAC,MAF_cutoff,STAAR-O,ACAT-O,STAAR-S(1,25),STAAR-S(1,1),STAAR-B(1,25),STAAR-B(1,1),STAAR-A(1,25),STAAR-A(1,1), plus per-annotation sub p-values
  • sliding window: same column shape keyed by chr,start_loc,end_loc
  • SCANG: list with SCANG_O/S/B _res, _top1, _emthr

Needs:

  • Rdata reader for phenotype input (serde-rdata or equivalent)
  • Rdata writer for per-shard outputs
  • --output-format flag accepting parquet (default), rdata, or both
  • object and column names match STAARpipelineSummary load sites exactly

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requeststaarSTAAR rare-variant association

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions