Frequently asked questions and solutions for the amalgkit RNA-seq workflow.
The pipeline uses per-sample concurrency within chunks. Each chunk of 6 samples is processed simultaneously using a ThreadPoolExecutor. For each sample, the steps are:
getfastq— Download and extract FASTQ filesquant— Quantify with kallistocleanup— Delete FASTQ files immediately
This means as soon as one sample finishes downloading, it starts quantifying while other samples continue downloading.
Recommended: 16 samples per chunk (default in run_all_species.sh --chunk-size 16).
| Chunk Size | Threads/Sample | Disk Usage | Stability |
|---|---|---|---|
| 4 | 4 | ~40 GB temp | Very stable |
| 6 | 2 | ~60 GB temp | Stable (default) |
| 8 | 2 | ~80 GB temp | May hit disk limits |
| 16 | 1 | ~160 GB temp | Not recommended |
Minimum: 80 GB free for 6 concurrent samples
| Component | Size |
|---|---|
| Per sample temp (FASTQ) | 5-15 GB |
| Per sample output (quant) | ~2 MB |
| Kallisto index | ~100-500 MB |
| Reference genome | ~200-500 MB |
| Rate | Assessment |
|---|---|
| <4/hr | Slow - check network/disk |
| 4-8/hr | Normal |
| >10/hr | Excellent |
Cause: fasterq-dump ran out of temp space.
Solutions:
- Reduce chunk size (6 → 4) in
run_all_species.sh - Free disk space
- These samples are logged for retry on next run
Cause: Race condition with multiple samples accessing same directory.
Solution: Reduce chunk size in run_all_species.sh or ensure sequential directory creation.
Cause: Previous download interrupted, lock file remains.
Solution: Delete stale lock files:
find output/amalgkit/*/fastq -name "*.lock" -deleteCause: Sample download/extraction exceeded time limit.
Solutions:
- Network issues - retry later
- Very large sample (>5 GB) - may need manual download
Cause: Process crashed without cleanup.
Solution:
# Stop the pipeline
pkill -f run_all_species
# Clean orphaned files
rm -rf output/amalgkit/*/fastq/getfastq/SRR*
# Restart
nohup bash scripts/rna/run_all_species.sh > output/amalgkit/run_all_species_incremental.log 2>&1 &The pipeline automatically resumes thanks to redo: no in the configs:
nohup bash scripts/rna/run_all_species.sh > output/amalgkit/run_all_species_incremental.log 2>&1 &It scans quant/ for completed samples and skips them.
- Address the issue (usually disk space or network)
- Restart the pipeline — it will retry any sample not yet in
quant/
# 1. Stop pipeline
pkill -f run_all_species
# 2. Clean all temp files
rm -rf output/amalgkit/*/fastq/getfastq/SRR*
# 3. Check disk
df -h /home
# 4. Restart with smaller chunk if needed (edit run_all_species.sh --chunk-size 4)
nohup bash scripts/rna/run_all_species.sh > output/amalgkit/run_all_species_incremental.log 2>&1 &- Total samples: ~7,342
- Estimated time: Very long at current throughput
- Large samples: Some metagenomic samples (PRJNA1364028) are 5-7 GB each
- Total samples: 110
- Notes: Some samples may fail due to SRA issues, not workflow bugs
| Script | Purpose |
|---|---|
scripts/rna/run_all_species.sh |
Main pipeline runner (sequential species, concurrent samples) |
scripts/rna/run_workflow.py |
Per-species workflow orchestrator |
scripts/package/generate_custom_summary.py |
Progress monitoring with ETAs |
# Run full pipeline (background)
nohup bash scripts/rna/run_all_species.sh > output/amalgkit/run_all_species_incremental.log 2>&1 &
# Check progress
.venv/bin/python scripts/package/generate_custom_summary.py
# Check which processes are active
ps -fC amalgkit | grep SRRoutput/amalgkit/*/fastq/getfastq/SRR*- temp FASTQ files (always safe after quant)
output/amalgkit/*/work/quant/- final quantification resultsoutput/amalgkit/*/work/metadata/- sample metadataoutput/amalgkit/*/work/index/- kallisto index
Use the tissue normalization script:
.venv/bin/python scripts/rna/normalize_tissue_metadata.py \
--input output/amalgkit/apis_mellifera_all/work/metadata/metadata.tsv \
--mapping config/amalgkit/tissue_mapping.yaml \
--patches config/amalgkit/tissue_patches.yaml \
--output output/amalgkit/apis_mellifera_all/work/metadata/metadata_normalized.tsvEdit config/amalgkit/tissue_mapping.yaml:
brain:
- brain
- Brain
- whole brain
- mushroom bodyEdit config/amalgkit/tissue_patches.yaml:
samples:
SRR12345678: brain # From manual research
bioprojects:
PRJEB100586: brain # All samples in project- Amalgkit GitHub
- config/amalgkit/README.md - Configuration guide
- config/amalgkit/AGENTS.md - Agent directives