Skip to content

Sparse ancestry-adjusted GRM builder (FastSparseGRM / Lin-Dey) #99

@vineetver

Description

@vineetver

Goal

Make sparse ancestry-adjusted GRM a first-class stage in the STAAR pipeline. Default path uses FastSparseGRM-style construction (Lin/Dey 2024). Auto-invoked by favor staar with cache + skip conditions, not a mandatory pre-step users must remember.

Reference: Lin X, Dey R, Li X, Li Z. Scalable analysis of large multi-ancestry biobanks by leveraging sparse ancestry-adjusted sample-relatedness. Res Sq Preprint 2024. doi:10.21203/rs.3.rs-5343361/v1. PMC11601839.
Software: rounakdey/FastSparseGRM (R).

What FastSparseGRM Produces

Three artifacts feed the existing null-model fit (GMMAT glmmkin port). Not a replacement solver.

artifact consumed by purpose
sparse K (block-diagonal) --kinship random-effect covariance
PCs (n x k) phenotype covariates fixed-effect ancestry adjustment
variant subset used run manifest provenance

Algorithm:

  1. LD-prune common variants
  2. Compute genetic PCs (hdpca, bias-corrected)
  3. Regress each SNP on top-k PCs, residualize
  4. Pairwise kinship on residualized genotypes
  5. Threshold at cutoff (default 0.022) -> block-sparse K

Pipeline Integration

New stage EnsureGrm, parallel to existing EnsureStore / EnsureScoreCache:

stage            auto-invoked?   skip if?                              cache key
EnsureStore      yes             cohort manifest exists                 VCF + annotation hash
EnsureGrm        yes             --kinship provided                     (store hash, grm config) hash
                                 OR relatedness probe returns none
                                 OR cache hit
FitNullModel     yes             null cache hit                         (pheno, covar, K) hash
EnsureScoreCache yes             score cache exists                     (mask, MAF, store) hash
  • favor staar runs EnsureGrm automatically when --kinship is not supplied
  • fast pre-probe on a subset of samples to detect any relatedness; if none, skip GRM build and fall through to Glm/Logistic null
  • cached by input hash; rerun is free
  • --dry-run surfaces the stage
  • --format json emits the decision (built / cache-hit / skipped-unrelated / provided-by-user)

Also expose standalone favor grm subcommand for pre-building, scripting, or inspection. Same builder, same cache.

Correctness Default

Auto-invoke matters: a user with a related multi-ancestry cohort who forgets the GRM gets inflated type-I error silently. Pre-probe + cache means the default path is correct and cheap on reruns.

Skip Conditions (explicit)

  • --kinship <path> supplied: trust the user, skip build
  • relatedness probe on random sample subset shows zero pairs above kinship cutoff: skip build, log decision
  • cache hit on (store hash, grm config): reuse, skip build

Needs

  • favor grm subcommand that produces K + PCs + manifest
  • EnsureGrm stage wired into src/staar/pipeline.rs with the same run.json + cache discipline as other stages
  • relatedness probe (cheap pairwise kinship on a sample subset) to drive the skip decision
  • machine-readable output honoring --format json
  • docs: when FastSparseGRM helps (related + mixed ancestry), when it is skipped (unrelated, single ancestry, user-provided K)

Out Of Scope

  • replacing the null solver. glmmkin port stays.
  • rebuilding PCA from scratch. Reuse hdpca ideas or existing Rust/faer PCA; no need to port the R code verbatim.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requeststaarSTAAR rare-variant association

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions