Amalgkit FAQ

Frequently asked questions and solutions for the amalgkit RNA-seq workflow.

Performance & Optimization

Q: How does the pipeline process samples?

The pipeline uses per-sample concurrency within chunks. Each chunk of 6 samples is processed simultaneously using a ThreadPoolExecutor. For each sample, the steps are:

getfastq — Download and extract FASTQ files
quant — Quantify with kallisto
cleanup — Delete FASTQ files immediately

This means as soon as one sample finishes downloading, it starts quantifying while other samples continue downloading.

Q: How many concurrent samples should I use?

Recommended: 16 samples per chunk (default in run_all_species.sh --chunk-size 16).

Chunk Size	Threads/Sample	Disk Usage	Stability
4	4	~40 GB temp	Very stable
6	2	~60 GB temp	Stable (default)
8	2	~80 GB temp	May hit disk limits
16	1	~160 GB temp	Not recommended

Q: How much disk space do I need?

Minimum: 80 GB free for 6 concurrent samples

Component	Size
Per sample temp (FASTQ)	5-15 GB
Per sample output (quant)	~2 MB
Kallisto index	~100-500 MB
Reference genome	~200-500 MB

Q: What's a good processing rate?

Rate	Assessment
<4/hr	Slow - check network/disk
4-8/hr	Normal
>10/hr	Excellent

Common Errors

Q: "disk-limit exceeded" during extraction

Cause: fasterq-dump ran out of temp space.

Solutions:

Reduce chunk size (6 → 4) in run_all_species.sh
Free disk space
These samples are logged for retry on next run

Q: "prefetch failed: path not found"

Cause: Race condition with multiple samples accessing same directory.

Solution: Reduce chunk size in run_all_species.sh or ensure sequential directory creation.

Q: "prefetch failed: Lock file exists"

Cause: Previous download interrupted, lock file remains.

Solution: Delete stale lock files:

find output/amalgkit/*/fastq -name "*.lock" -delete

Q: "Timeout" errors

Cause: Sample download/extraction exceeded time limit.

Solutions:

Network issues - retry later
Very large sample (>5 GB) - may need manual download

Q: Samples stuck for hours

Cause: Process crashed without cleanup.

Solution:

# Stop the pipeline
pkill -f run_all_species

# Clean orphaned files
rm -rf output/amalgkit/*/fastq/getfastq/SRR*

# Restart
nohup bash scripts/rna/run_all_species.sh > output/amalgkit/run_all_species_incremental.log 2>&1 &

Recovery Procedures

Q: How do I resume after a crash?

The pipeline automatically resumes thanks to redo: no in the configs:

nohup bash scripts/rna/run_all_species.sh > output/amalgkit/run_all_species_incremental.log 2>&1 &

It scans quant/ for completed samples and skips them.

Q: How do I retry failed samples?

Address the issue (usually disk space or network)
Restart the pipeline — it will retry any sample not yet in quant/

Q: How do I recover from disk full?

# 1. Stop pipeline
pkill -f run_all_species

# 2. Clean all temp files
rm -rf output/amalgkit/*/fastq/getfastq/SRR*

# 3. Check disk
df -h /home

# 4. Restart with smaller chunk if needed (edit run_all_species.sh --chunk-size 4)
nohup bash scripts/rna/run_all_species.sh > output/amalgkit/run_all_species_incremental.log 2>&1 &

Species-Specific Notes

Apis mellifera (Honeybee)

Total samples: ~7,342
Estimated time: Very long at current throughput
Large samples: Some metagenomic samples (PRJNA1364028) are 5-7 GB each

Pogonomyrmex barbatus (Harvester Ant)

Total samples: 110
Notes: Some samples may fail due to SRA issues, not workflow bugs

Scripts Reference

Script	Purpose
`scripts/rna/run_all_species.sh`	Main pipeline runner (sequential species, concurrent samples)
`scripts/rna/run_workflow.py`	Per-species workflow orchestrator
`scripts/package/generate_custom_summary.py`	Progress monitoring with ETAs

Running the Pipeline

# Run full pipeline (background)
nohup bash scripts/rna/run_all_species.sh > output/amalgkit/run_all_species_incremental.log 2>&1 &

# Check progress
.venv/bin/python scripts/package/generate_custom_summary.py

# Check which processes are active
ps -fC amalgkit | grep SRR

Disk Cleanup

Safe to delete

output/amalgkit/*/fastq/getfastq/SRR* - temp FASTQ files (always safe after quant)

NOT safe to delete

output/amalgkit/*/work/quant/ - final quantification results
output/amalgkit/*/work/metadata/ - sample metadata
output/amalgkit/*/work/index/ - kallisto index

Tissue Normalization

Q: How do I normalize tissue metadata?

Use the tissue normalization script:

.venv/bin/python scripts/rna/normalize_tissue_metadata.py \
  --input output/amalgkit/apis_mellifera_all/work/metadata/metadata.tsv \
  --mapping config/amalgkit/tissue_mapping.yaml \
  --patches config/amalgkit/tissue_patches.yaml \
  --output output/amalgkit/apis_mellifera_all/work/metadata/metadata_normalized.tsv

Q: How do I add tissue mappings?

Edit config/amalgkit/tissue_mapping.yaml:

brain:
  - brain
  - Brain
  - whole brain
  - mushroom body

Q: How do I patch missing tissue values?

Edit config/amalgkit/tissue_patches.yaml:

samples:
  SRR12345678: brain  # From manual research
bioprojects:
  PRJEB100586: brain  # All samples in project

Related Resources

Amalgkit GitHub
config/amalgkit/README.md - Configuration guide
config/amalgkit/AGENTS.md - Agent directives

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Amalgkit FAQ

Performance & Optimization

Q: How does the pipeline process samples?

Q: How many concurrent samples should I use?

Q: How much disk space do I need?

Q: What's a good processing rate?

Common Errors

Q: "disk-limit exceeded" during extraction

Q: "prefetch failed: path not found"

Q: "prefetch failed: Lock file exists"

Q: "Timeout" errors

Q: Samples stuck for hours

Recovery Procedures

Q: How do I resume after a crash?

Q: How do I retry failed samples?

Q: How do I recover from disk full?

Species-Specific Notes

Apis mellifera (Honeybee)

Pogonomyrmex barbatus (Harvester Ant)

Scripts Reference

Running the Pipeline

Disk Cleanup

Safe to delete

NOT safe to delete

Tissue Normalization

Q: How do I normalize tissue metadata?

Q: How do I add tissue mappings?

Q: How do I patch missing tissue values?

Related Resources

FilesExpand file tree

amalgkit_faq.md

Latest commit

History

amalgkit_faq.md

File metadata and controls

Amalgkit FAQ

Performance & Optimization

Q: How does the pipeline process samples?

Q: How many concurrent samples should I use?

Q: How much disk space do I need?

Q: What's a good processing rate?

Common Errors

Q: "disk-limit exceeded" during extraction

Q: "prefetch failed: path not found"

Q: "prefetch failed: Lock file exists"

Q: "Timeout" errors

Q: Samples stuck for hours

Recovery Procedures

Q: How do I resume after a crash?

Q: How do I retry failed samples?

Q: How do I recover from disk full?

Species-Specific Notes

Apis mellifera (Honeybee)

Pogonomyrmex barbatus (Harvester Ant)

Scripts Reference

Running the Pipeline

Disk Cleanup

Safe to delete

NOT safe to delete

Tissue Normalization

Q: How do I normalize tissue metadata?

Q: How do I add tissue mappings?

Q: How do I patch missing tissue values?

Related Resources