The eQTL (expression Quantitative Trait Loci) Integration Pipeline bridges the gap between genomic variation and transcriptomic expression, linking the Amalgkit RNA-seq Pipeline with the GWAS DNA Pipeline.
By analyzing how genetic variants (from VCFs) correlate with gene expression levels (from Amalgkit quantification), this pipeline identifies genetic markers that regulate gene transcription.
Note: eQTL is a cross-cutting integration pipeline — it does not have its own
src/metainformant/eqtl/module. Instead, the core logic lives inmetainformant.gwas.finemapping.eqtl,metainformant.gwas.visualization.eqtl_visualization, andmetainformant.multiomics.analysis.integration. Scripts are inscripts/eqtl/.
| Doc | Description |
|---|---|
| Pipeline Guide | Step-by-step walkthrough of transcriptome SNP calling |
| Configuration Reference | All YAML config options |
| This page | Overview, architecture, and integration scripts |
graph TD
subgraph "Upstream Pipelines"
RNA[Amalgkit Pipeline\nabundance.tsv]
DNA[GWAS Pipeline\nPopulation VCF]
end
subgraph "eQTL Integration Pipeline"
L[Load Matrices] --> F[Filter & Normalize]
F --> S[cis-eQTL Scan]
S --> M[Multiple Testing Correction]
M --> A[Annotate Results]
end
subgraph "Outputs"
RES[cis_eqtl_annotated.tsv]
VOL[Volcano Plots]
BOX[Top eQTL Boxplots]
end
RNA --> L
DNA --> L
A --> RES
A --> VOL
A --> BOX
The repository provides end-to-end execution scripts in the scripts/eqtl/ directory highlighting both synthetic and real-world usage.
This workflow uses real, quantified Amalgkit transcriptomic data generated by the pipeline (e.g., from Apis mellifera). Because matched population-scale whole-genome sequencing frequently doesn't exist out of the box for these exact RNA samples, the script generates context-aware synthetic genotypes linked to the real gene positions.
uv run python scripts/eqtl/run_eqtl_real.pyKey Steps:
- Loads real Kallisto expression data from Amalgkit
work/quant/subdirectories. - Filters out low-expression genes (mean TPM < threshold).
- Parses actual gene loci.
- Synthesizes genotype variants surrounding these genes based on realistic allele frequencies.
- Runs cis-eQTL scanning (500kb windows).
- Generates full summary statistics, volcano plots, and effect size boxplots.
A fully synthetic pipeline designed for rapid testing, methods development, and CI/CD validation. It simulates 100 samples, 50 genes, and 500 variants with a defined set of "true" eQTL effects.
uv run python scripts/eqtl/run_eqtl_demo.pyExtracts SNP variants directly from RNA-seq data by re-downloading FASTQs, aligning with HISAT2, and calling variants with bcftools. Produces per-sample VCFs and population genetics summaries.
# With CLI args
uv run python scripts/eqtl/rna_snp_pipeline.py --species amellifera --n-samples 3
# With YAML config
uv run python scripts/eqtl/rna_snp_pipeline.py --config config/eqtl/eqtl_amellifera.yamlSee Pipeline Guide and Configuration Reference for details.
Under the hood, the eQTL workflows rely on the highly optimized functions located in:
metainformant.gwas.finemapping.eqtl: Core statistical scanning, matrix operations, effect size calculation, andload_transcriptome_variants()for VCF→matrix conversion.metainformant.gwas.visualization.eqtl_visualization: Plotting utilities for volcano plots, summary grids, and genotype/expression boxplots.metainformant.multiomics.analysis.integration: Helper functions to convert and harmonize VCFs and expressionDataFrames.
- Amalgkit Total Pipeline: Upstream transcriptomics.
- GWAS Total Pipeline: Upstream genomic variants.
- Multiomics: Advanced omics integration (Joint PCA, NMF).