RNA Module

Core RNA-seq analysis and workflow orchestration for METAINFORMANT.

Overview

Core RNA-seq analysis and workflow orchestration for METAINFORMANT.

Architecture
Submodules
Key Classes
- Workflow Engine
- Amalgkit Wrapper
Usage
Workflow Steps
Download Strategy: ENA-First with NCBI Fallback
- SRA Cache Management (The Docker Overlay Danger)
Index Complexity Management
GWAS Integration
Related

Architecture

graph TD
    subgraph "RNA Module"
        E[engine/] --> |workflow.py| W[Workflow Execution]
        E --> |monitoring.py| M[Progress Monitoring]
        E --> |discovery.py| D[Species Discovery]
        E --> |streaming_orchestrator.py| SO[Multi-Species Pipeline]
        
        A[amalgkit/] --> |amalgkit.py| AK[Amalgkit Wrapper]
        A --> |genome_prep.py| G[Genome Preparation]
        A --> |metadata_filter.py| MD[Metadata Handling]
        
        C[core/] --> |configs.py| CF[Configuration]
        C --> |cleanup.py| CL[Cleanup Utilities]
        
        R[retrieval/] --> |ena_downloader.py| ENA[ENA Download]
        
        AN[analysis/] --> |expression_core.py| EX[Expression Analysis]
    end

Submodules

Module	Purpose
`engine/`	Workflow execution, monitoring, orchestration
`amalgkit/`	Amalgkit tool wrapper and API
`core/`	Configuration, cleanup, dependencies
`retrieval/`	ENA FASTQ data retrieval
`analysis/`	Expression matrix analysis, QC, validation
`deconvolution/`	Cell-type deconvolution from bulk RNA-seq
`splicing/`	Alternative splicing analysis

Key Classes

Workflow Engine

AmalgkitWorkflowConfig — Workflow configuration loaded from YAML
StreamingPipelineOrchestrator — Multi-species ENA-first orchestrator
StreamingPipeline — Per-sample download→quant→cleanup pipeline
ProgressTracker — Real-time progress state management

Amalgkit Wrapper

AmalgkitParams — Typed parameter container for amalgkit CLI calls
build_amalgkit_command() — CLI command builder
run_amalgkit() — Execute any amalgkit step
GenomePreparator — Reference genome download and Kallisto indexing
TissueNormalizer — Tissue label normalization via mappings

Usage

from metainformant.rna.engine.workflow import AmalgkitWorkflowConfig, execute_workflow

# Load configuration
config = AmalgkitWorkflowConfig.load("config/amalgkit/amalgkit_pogonomyrmex_barbatus.yaml")

# Execute workflow
result = execute_workflow(config, steps=["getfastq", "quant", "merge"])

Workflow Steps

Step	Description
`metadata`	Fetch sample metadata from NCBI
`select`	Filter to valid RNA-seq samples
`getfastq`	Download SRA → extract FASTQ
`quant`	Quantify with kallisto
`merge`	Combine abundance files
`curate`	Quality control and filtering

Download Strategy: ENA-First with NCBI Fallback

All species use a two-tier download strategy managed by the StreamingPipelineOrchestrator:

ENA primary — Direct FTP/HTTP downloads of .fastq.gz from European Nucleotide Archive. Bypasses slow prefetch + fasterq-dump extraction.
NCBI fallback — If ENA download fails, falls back to fasterq-dump from NCBI SRA.

Entry point: scripts/rna/run_workflow.py → StreamingPipelineOrchestrator
Concurrency: Up to 16 parallel workers with SQLite-backed progress tracking
Scheduling: Size-ordered (smallest samples first) for maximum throughput
Monitoring: Real-time TUI via scripts/rna/monitor_tui.py

SRA Cache Management (The Docker Overlay Danger)

When the NCBI Fallback is triggered across hundreds of parallel workers, fasterq-dump utilizes internal scratch directories. In a containerized Docker context (ghcr.io/docxology/metainformant/pipeline), this dumps hundreds of Gigabytes into the unmapped overlay filesystem (/app/fasterq.tmp.* and /tmp/sra-cache/) directly onto the VM's OS disk (/dev/root), rather than the mapped data volumes. If the root partition hits 100%, the entire pipeline will deadlock and OS calls will silently fail. This is recovered via VM reboots and manual internal purges (rm -rf).

Index Complexity Management

For genomes with high repetitive content (e.g., Harpegnathos saltator), standard kallisto index may stall.

Symptoms:

kallisto quant processes hang indefinitely with 100% CPU.
Max EC size > 3000 in index stats.

Solution — IndexComplexityManager in amalgkit/index_prep.py:

Automatically filters XR_ and NR_ (non-coding RNA) transcripts.
Removes transcripts < 200bp and duplicates.
Rebuilds index with reduced complexity.

This strategy solved the Harpegnathos stall (Max EC: ~3015) by reducing index size and complexity. It is now applied automatically for any species.

GWAS Integration

RNA expression data can be integrated with GWAS variants for eQTL analysis:

from metainformant.multiomics.analysis import integration
from metainformant.gwas.finemapping.colocalization import eqtl_coloc

# Prepare expression data for integration
rna_data = integration.from_rna_expression(
    expression_df,
    normalize=True
)

# Run colocalization with GWAS summary statistics
result = eqtl_coloc(
    gwas_z=gwas_zscores,
    eqtl_z=expression_zscores,
    gene_id="LOC123456"
)

Orchestration & Performance Guide — ENA-first amalgkit streaming pipeline
Troubleshooting & Hacks — IO contention & SRA setup fixes

See metainformant.multiomics for comprehensive integration methods.

API Reference — Type signatures, error codes, data structures
Agent Coordination Hub — Multi-agent orchestration patterns, workflows, safety
scripts/rna/ - Workflow scripts
config/amalgkit/ - Configuration files
config/amalgkit/amalgkit_faq.md - FAQ
metainformant.multiomics - GWAS-expression integration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RNA Module

Overview

Table of Contents

Architecture

Submodules

Key Classes

Workflow Engine

Amalgkit Wrapper

Usage

Workflow Steps

Download Strategy: ENA-First with NCBI Fallback

SRA Cache Management (The Docker Overlay Danger)

Index Complexity Management

GWAS Integration

Related

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

RNA Module

Overview

Table of Contents

Architecture

Submodules

Key Classes

Workflow Engine

Amalgkit Wrapper

Usage

Workflow Steps

Download Strategy: ENA-First with NCBI Fallback

SRA Cache Management (The Docker Overlay Danger)

Index Complexity Management

GWAS Integration

Related