Amalgkit Integration

METAINFORMANT wraps amalgkit — a command-line toolkit for large-scale RNA-seq meta-analysis — to orchestrate all 11 processing steps across 23 ant/bee species.

What Amalgkit Does

Amalgkit handles the full pipeline from public database → expression matrix:

Downloads SRA metadata from NCBI
Filters samples by quality criteria
Downloads FASTQ files (via ENA direct wget in this project)
Quantifies with kallisto pseudoalignment
Merges per-sample abundances into expression matrices
Cross-species normalization (CSTMM) and correlation analysis (CSCA)

How METAINFORMANT Uses It

scripts/rna/run_all_species.py
 iterates species configs → run_workflow.py per species
 per sample: download_ena.py (wget) → amalgkit quant → delete FASTQ

The orchestrator (run_all_species.py) drives sequential per-species execution. Each species config lives in config/amalgkit/amalgkit_<species>.yaml.

Directory Structure

docs/rna/amalgkit/
 README.md ← This file
 guide.md ← Quick-start guide
 amalgkit.md ← Wrapper details and configuration
 commands.md ← Genome setup script reference
 cross_species_pipeline.md ← CSTMM + CSCA cross-species steps
 monitoring.md ← How to check pipeline progress
 genome_preparation.md ← Genome download and kallisto index build
 genome_setup_guide.md ← Step-by-step genome setup
 FUNCTIONS.md ← Python function index
 PATH_RESOLUTION.md ← Path resolution reference
 R_INSTALLATION.md ← R environment setup
 r_packages.md ← R package management
 testing_coverage.md ← Test coverage details
 AGENTS.md ← AI contribution notes
 PAI.md ← Persistent AI context
 steps/ ← Per-step documentation (01–11)

11-Step Pipeline

#	Step	Purpose	Key Output
1	`metadata`	Fetch SRA sample metadata from NCBI	`work/metadata/metadata.tsv`
2	`config`	Generate amalgkit config files	`work/config_base/`
3	`select`	Filter samples by quality/tissue criteria	`work/metadata/pivot_qualified.tsv`
4	`getfastq`	Download FASTQ from ENA (wget)	`fastq/getfastq/<SRR>/`
5	`integrate`	Integrate local FASTQ paths into metadata	`work/metadata/metadata_integrated.tsv`
6	`quant`	Quantify with kallisto (per sample)	`work/quant/<SRR>/abundance.tsv`
7	`merge`	Merge sample abundances into matrix	`merged/merged_abundance.tsv`
8	`cstmm`	Cross-species TMM normalization	`cstmm/`
9	`curate`	Quality control, outlier removal	`curate/`
10	`csca`	Cross-species correlation analysis	`csca/`
11	`sanity`	Validate outputs	`work/sanity/`

See steps/README.md for detailed per-step documentation.

Installation

# Install amalgkit into the project venv
uv pip install git+https://github.com/kfuku52/amalgkit

# Verify
amalgkit --help

Quick Reference

# Run a single species end-to-end
python3 scripts/rna/run_workflow.py config/amalgkit/amalgkit_pogonomyrmex_barbatus.yaml

# Run all 23 species sequentially (background)
nohup python3 scripts/rna/run_all_species.py \
  > output/amalgkit/run_all_species_incremental.log 2>&1 &

# Check status
python3 scripts/rna/run_workflow.py config/amalgkit/amalgkit_pogonomyrmex_barbatus.yaml --status

# Monitor live progress
tail -f output/amalgkit/run_all_species_incremental.log
ps aux | grep wget | grep -v grep | wc -l  # active download workers
python3 scripts/rna/report_completed.py     # quant counts per species

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Amalgkit Integration

What Amalgkit Does

How METAINFORMANT Uses It

Directory Structure

11-Step Pipeline

Installation

Quick Reference

Related Documentation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Amalgkit Integration

What Amalgkit Does

How METAINFORMANT Uses It

Directory Structure

11-Step Pipeline

Installation

Quick Reference

Related Documentation