Core RNA-seq analysis and workflow orchestration for METAINFORMANT.
Core RNA-seq analysis and workflow orchestration for METAINFORMANT.
- Architecture
- Submodules
- Key Classes
- Usage
- Workflow Steps
- Download Strategy: ENA-First with NCBI Fallback
- Index Complexity Management
- GWAS Integration
- Related
graph TD
subgraph "RNA Module"
E[engine/] --> |workflow.py| W[Workflow Execution]
E --> |monitoring.py| M[Progress Monitoring]
E --> |discovery.py| D[Species Discovery]
E --> |streaming_orchestrator.py| SO[Multi-Species Pipeline]
A[amalgkit/] --> |amalgkit.py| AK[Amalgkit Wrapper]
A --> |genome_prep.py| G[Genome Preparation]
A --> |metadata_filter.py| MD[Metadata Handling]
C[core/] --> |configs.py| CF[Configuration]
C --> |cleanup.py| CL[Cleanup Utilities]
R[retrieval/] --> |ena_downloader.py| ENA[ENA Download]
AN[analysis/] --> |expression_core.py| EX[Expression Analysis]
end
| Module | Purpose |
|---|---|
engine/ |
Workflow execution, monitoring, orchestration |
amalgkit/ |
Amalgkit tool wrapper and API |
core/ |
Configuration, cleanup, dependencies |
retrieval/ |
ENA FASTQ data retrieval |
analysis/ |
Expression matrix analysis, QC, validation |
deconvolution/ |
Cell-type deconvolution from bulk RNA-seq |
splicing/ |
Alternative splicing analysis |
AmalgkitWorkflowConfig— Workflow configuration loaded from YAMLStreamingPipelineOrchestrator— Multi-species ENA-first orchestratorStreamingPipeline— Per-sample download→quant→cleanup pipelineProgressTracker— Real-time progress state management
AmalgkitParams— Typed parameter container for amalgkit CLI callsbuild_amalgkit_command()— CLI command builderrun_amalgkit()— Execute any amalgkit stepGenomePreparator— Reference genome download and Kallisto indexingTissueNormalizer— Tissue label normalization via mappings
from metainformant.rna.engine.workflow import AmalgkitWorkflowConfig, execute_workflow
# Load configuration
config = AmalgkitWorkflowConfig.load("config/amalgkit/amalgkit_pogonomyrmex_barbatus.yaml")
# Execute workflow
result = execute_workflow(config, steps=["getfastq", "quant", "merge"])| Step | Description |
|---|---|
metadata |
Fetch sample metadata from NCBI |
select |
Filter to valid RNA-seq samples |
getfastq |
Download SRA → extract FASTQ |
quant |
Quantify with kallisto |
merge |
Combine abundance files |
curate |
Quality control and filtering |
All species use a two-tier download strategy managed by the StreamingPipelineOrchestrator:
- ENA primary — Direct FTP/HTTP downloads of
.fastq.gzfrom European Nucleotide Archive. Bypasses slowprefetch+fasterq-dumpextraction. - NCBI fallback — If ENA download fails, falls back to
fasterq-dumpfrom NCBI SRA.
- Entry point:
scripts/rna/run_workflow.py→StreamingPipelineOrchestrator - Concurrency: Up to 16 parallel workers with SQLite-backed progress tracking
- Scheduling: Size-ordered (smallest samples first) for maximum throughput
- Monitoring: Real-time TUI via
scripts/rna/monitor_tui.py
When the NCBI Fallback is triggered across hundreds of parallel workers, fasterq-dump utilizes internal scratch directories. In a containerized Docker context (ghcr.io/docxology/metainformant/pipeline), this dumps hundreds of Gigabytes into the unmapped overlay filesystem (/app/fasterq.tmp.* and /tmp/sra-cache/) directly onto the VM's OS disk (/dev/root), rather than the mapped data volumes. If the root partition hits 100%, the entire pipeline will deadlock and OS calls will silently fail. This is recovered via VM reboots and manual internal purges (rm -rf).
For genomes with high repetitive content (e.g., Harpegnathos saltator), standard kallisto index may stall.
Symptoms:
kallisto quantprocesses hang indefinitely with 100% CPU.Max EC size> 3000 in index stats.
Solution — IndexComplexityManager in amalgkit/index_prep.py:
- Automatically filters
XR_andNR_(non-coding RNA) transcripts. - Removes transcripts < 200bp and duplicates.
- Rebuilds index with reduced complexity.
This strategy solved the Harpegnathos stall (Max EC: ~3015) by reducing index size and complexity. It is now applied automatically for any species.
RNA expression data can be integrated with GWAS variants for eQTL analysis:
from metainformant.multiomics.analysis import integration
from metainformant.gwas.finemapping.colocalization import eqtl_coloc
# Prepare expression data for integration
rna_data = integration.from_rna_expression(
expression_df,
normalize=True
)
# Run colocalization with GWAS summary statistics
result = eqtl_coloc(
gwas_z=gwas_zscores,
eqtl_z=expression_zscores,
gene_id="LOC123456"
)- Orchestration & Performance Guide — ENA-first amalgkit streaming pipeline
- Troubleshooting & Hacks — IO contention & SRA setup fixes
See metainformant.multiomics for comprehensive integration methods.
-
API Reference — Type signatures, error codes, data structures
-
Agent Coordination Hub — Multi-agent orchestration patterns, workflows, safety
-
scripts/rna/ - Workflow scripts
-
config/amalgkit/ - Configuration files
-
metainformant.multiomics - GWAS-expression integration