Goal
Make sparse ancestry-adjusted GRM a first-class stage in the STAAR pipeline. Default path uses FastSparseGRM-style construction (Lin/Dey 2024). Auto-invoked by favor staar with cache + skip conditions, not a mandatory pre-step users must remember.
Reference: Lin X, Dey R, Li X, Li Z. Scalable analysis of large multi-ancestry biobanks by leveraging sparse ancestry-adjusted sample-relatedness. Res Sq Preprint 2024. doi:10.21203/rs.3.rs-5343361/v1. PMC11601839.
Software: rounakdey/FastSparseGRM (R).
What FastSparseGRM Produces
Three artifacts feed the existing null-model fit (GMMAT glmmkin port). Not a replacement solver.
| artifact |
consumed by |
purpose |
| sparse K (block-diagonal) |
--kinship |
random-effect covariance |
| PCs (n x k) |
phenotype covariates |
fixed-effect ancestry adjustment |
| variant subset used |
run manifest |
provenance |
Algorithm:
- LD-prune common variants
- Compute genetic PCs (hdpca, bias-corrected)
- Regress each SNP on top-k PCs, residualize
- Pairwise kinship on residualized genotypes
- Threshold at cutoff (default 0.022) -> block-sparse K
Pipeline Integration
New stage EnsureGrm, parallel to existing EnsureStore / EnsureScoreCache:
stage auto-invoked? skip if? cache key
EnsureStore yes cohort manifest exists VCF + annotation hash
EnsureGrm yes --kinship provided (store hash, grm config) hash
OR relatedness probe returns none
OR cache hit
FitNullModel yes null cache hit (pheno, covar, K) hash
EnsureScoreCache yes score cache exists (mask, MAF, store) hash
favor staar runs EnsureGrm automatically when --kinship is not supplied
- fast pre-probe on a subset of samples to detect any relatedness; if none, skip GRM build and fall through to Glm/Logistic null
- cached by input hash; rerun is free
--dry-run surfaces the stage
--format json emits the decision (built / cache-hit / skipped-unrelated / provided-by-user)
Also expose standalone favor grm subcommand for pre-building, scripting, or inspection. Same builder, same cache.
Correctness Default
Auto-invoke matters: a user with a related multi-ancestry cohort who forgets the GRM gets inflated type-I error silently. Pre-probe + cache means the default path is correct and cheap on reruns.
Skip Conditions (explicit)
--kinship <path> supplied: trust the user, skip build
- relatedness probe on random sample subset shows zero pairs above kinship cutoff: skip build, log decision
- cache hit on (store hash, grm config): reuse, skip build
Needs
favor grm subcommand that produces K + PCs + manifest
EnsureGrm stage wired into src/staar/pipeline.rs with the same run.json + cache discipline as other stages
- relatedness probe (cheap pairwise kinship on a sample subset) to drive the skip decision
- machine-readable output honoring
--format json
- docs: when FastSparseGRM helps (related + mixed ancestry), when it is skipped (unrelated, single ancestry, user-provided K)
Out Of Scope
- replacing the null solver.
glmmkin port stays.
- rebuilding PCA from scratch. Reuse hdpca ideas or existing Rust/faer PCA; no need to port the R code verbatim.
Goal
Make sparse ancestry-adjusted GRM a first-class stage in the STAAR pipeline. Default path uses FastSparseGRM-style construction (Lin/Dey 2024). Auto-invoked by
favor staarwith cache + skip conditions, not a mandatory pre-step users must remember.Reference: Lin X, Dey R, Li X, Li Z. Scalable analysis of large multi-ancestry biobanks by leveraging sparse ancestry-adjusted sample-relatedness. Res Sq Preprint 2024. doi:10.21203/rs.3.rs-5343361/v1. PMC11601839.
Software: rounakdey/FastSparseGRM (R).
What FastSparseGRM Produces
Three artifacts feed the existing null-model fit (GMMAT glmmkin port). Not a replacement solver.
--kinshipAlgorithm:
Pipeline Integration
New stage
EnsureGrm, parallel to existingEnsureStore/EnsureScoreCache:favor staarrunsEnsureGrmautomatically when--kinshipis not supplied--dry-runsurfaces the stage--format jsonemits the decision (built / cache-hit / skipped-unrelated / provided-by-user)Also expose standalone
favor grmsubcommand for pre-building, scripting, or inspection. Same builder, same cache.Correctness Default
Auto-invoke matters: a user with a related multi-ancestry cohort who forgets the GRM gets inflated type-I error silently. Pre-probe + cache means the default path is correct and cheap on reruns.
Skip Conditions (explicit)
--kinship <path>supplied: trust the user, skip buildNeeds
favor grmsubcommand that produces K + PCs + manifestEnsureGrmstage wired intosrc/staar/pipeline.rswith the samerun.json+ cache discipline as other stages--format jsonOut Of Scope
glmmkinport stays.