A Snakemake workflow for the identification of variants in bacterial genomes using nanopore long-read sequencing.
The usage of this workflow is described in the Snakemake Workflow Catalog.
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and its DOI (see above).
This workflow provides a simple and easy-to-use framework for the identification of structural and small nucleotide variants in bacterial genomes using nanopore long-read sequencing data.
The snakemake-ont-bacterial-variants workflow is built using snakemake and consists of the following steps:
- Quality check of sequencing data (
FastQC) - Filtering of input sequencing data by read length and quality (
Filtlong) - Mapping to reference genome (
NGMLR) - Calling of structural and single nucleotide variants (SVs:
cuteSVandSniffles2; SNVs:Clair3andMedaka) - Filtering of identified variants (e.g., by variant quality or genomic regions;
BCFtoolsandVCFtools) - Generate report with final results (
R markdown,igv-reports, andMultiQC)
If you would like to contribute, report issues, or suggest features, please get in touch on GitHub.
Step 1: Install snakemake with conda in a new conda environment.
conda create -n <ENV> snakemake pandasStep 2: Activate conda environment with snakemake
conda activate <ENV>Important note:
All other dependencies for the workflow are automatically pulled as conda environments by snakemake, when running the workflow with the --use-conda parameter (recommended).
When run without automatically built conda environments, all packages need to be installed manually:
NanoPlotMultiQCFiltlongNGMLRminimap2samtoolsbedtoolsMedakaClair3cuteSVSniffles2bcftoolsVCFtoolsr-tidyverser-rmarkdownr-dtigv-reports
The workflow requires the following files to be located in the data directory:
- Whole-genome sequencing data in
*.fastq.gzformat indata/fastq - Reference genome(s) in
*.faformat indata/reference
Optionally, users can provide:
- Reference genome annotation in
*.gffformat indata/annotation(for feature annotation in IGV report) - A
*.bedfile with genomic regions to ignore for variant calling indata/masked_region
Please ensure that the chromosome names in *.gff and *.bed files are the same as in the reference genome.
Input data files are provided in the samples.tsv table, whose location is inidcated in the config.yml file. The samplesheet must comply with the following structure:
sampledefines the sample name that will be used throughout the workflow and thus needs to be unique.fastqprovides the path to the sample's*.fastq.gzfile.referenceprovides the path to the reference genome*.fafile (may be the same for several / all samples).annotationprovides the path to the optional reference genome annotation in*.gfffile (may be the same for several / all samples). If no annotation is provided, you must entern/a!masked_regionsprovides the path to the optional*.bedfile for filtering genomic regions (may be the same for several / all samples). If no*.bedfile is provided, you must entern/a!
| sample | fastq | reference | annotation | masked_regions |
|---|---|---|---|---|
| <sample1> | data/fastq/<fastq1>.fastq.gz | data/reference/<ref1>.fa | data/annotation/<anno1>.gff | data/masked_region/<region1>.bed |
| <sample2> | data/fastq/<fastq2>.fastq.gz | data/reference/<ref2>.fa | data/annotation/<anno2>.gff | data/masked_region/<region2>.bed |
| ... | ... | ... | ... | ... |
| <sampleN> | data/fastq/<fastqN>.fastq.gz | data/reference/<refN>.fa | data/annotation/<annoN>.gff | data/masked_region/<regionN>.bed |
Before executing the workflow, you may want to adjust several options and parameters in the default config file config/config.yml:
- Directories:
indir: Input directory for all input files,databy default (see above)outdir: Output directory (relative to working directory),resultsby default
- Sample information:
samples: Path to samplesheet (relative to working directory),samplesheet/samples.tsvby defaultlibprepkit: Kit from ONT used for library preparation, e.g.SQK-NBD114.24basecalling_model: Model used for basecalling of raw sequencing data (required for variant calling usingMedaka), currently supported models are:r1041_e82_400bps_sup_v4.2.0r1041_e82_400bps_sup_v4.3.0
- Tool parameters:
- The number of cores can be adjusted here for the following tools:
NGMLR,NanoPlot,MultiQC,Medaka,Clair3,Sniffles2, andcuteSV - You may further adjust the run parameters for the following tools (please refer to the reference provided for more details on run parameters):
Filtlong: By default, reads are filtered for a minimum length of 500 bp and a mean accuracy of at least 90% (Q10), with 90% of the longest and highest-quailty reads to be kept.Clair3: Variants are called on all contigs in a haploid-sensitive, ONT-specific mode using--include_all_ctgs --haploid_sensitive --platform ont.cuteSV: Variants are called with the suggested parameters for ONT data (--max_cluster_bias_INS 100 --diff_ratio_merging_INS 0.3 --max_cluster_bias_DEL 100 --diff_ratio_merging_DEL 0.3) and the genotyping option enabled (--genotype).
- The number of cores can be adjusted here for the following tools:
- Filtering of variants:
- The variant quality thresholds can be adjusted here for all four variant callers
remove_common_variants: IfTrue, variants which have been identified in all samples with the same reference genome by one tool are filtered out. This is helpful in case all samples derive from a strain, whose genome sequence already differs from the used reference sequence. IfFalse, all variants are reported.
- Reporting options:
igv_region_length: Neighboring variants with a maximum bp distance indicated here [1 by default] are reported in one region in the IGV variant report. Increasing this parameter will reduce the file size of the resulting IGV HTML report, if hotspots / regions with many variants exist in a sample.
To run the workflow from the command line, change the working directory:
cd /path/to/snakemake-ont-bacterial-variantsBefore running the entire workflow, you may perform a dry run using:
snakemake --dry-runTo run the complete workflow with your files using conda, execute the following command (the definition of the number of cores is mandatory):
snakemake --cores 10 --sdm condaYou may also run the workflow on the provided test data using:
snakemake --cores 10 --sdm conda --directory .testThe most important output files and reports are found in variant_reports for each sample in a separate sub-directory variant_reports/<sample>/:
<sample>_overview.html: Custom overview HTML report comprising read and mapping statistics as well as identified variants.<sample>_IGV.html: HTML report with alignment data for all variants listed in<sample>_overview.html(generated withigv-reports).<sample>.<tool>.vcf: File with all variants for each tool used (after filtering).
Further, variants_reports contains MultiQC.html, an interactive HTML report generated by MultiQC with read statistics on raw, filtered and aligned reads.
Log files from the generation of above reports can be found in variant_reports/logs/:
<sample>.collect_vcfs.log: Copying of variant files to above destinations<sample>.igv_reports.log: Generation of IGV HTML report<sample>.prepare_vcfs.log: Merging of neighboring variants in genomic regions forigv-reports<sample>.report.log: Generation of final overview HTML report using R Markdown
In addition, the workflow generates the following output files in the corresponding directories:
filtered_reads
<sample>.fastq.gz: Length- and quality-filtered readslogs/<sample>.log: Filtering log file(s)
mapping
<sample>.bam: Alignment file of filtered reads to sample-specific reference genome (produced byNGMLR)<sample>.bam.bai: Alignment index filelogs/<sample>.*.log: Log files forNGMLR,samtools, andbedtools(calculation of genome coverage and read end distributions [as temporary files only])
qc
Read statistics on all raw, filtered and aligned reads are found here (generated with NanoPlot):
raw_reads/<sample>/: Directory with output files with read statistics on all raw reads (log files infiltered_reads/logs/<sample>.log).filtered_reads/<sample>/: Directory with output files with read statistics on all filtered reads (log files infiltered_reads/logs/<sample>.log).aligned_reads/<sample>/: Directory with output files with read statistics on all aligned reads (log files inaligned_reads/logs/<sample>.log).
The raw output from MultiQC - based on above NanoPlot results - is located in the directory multiqc, containing the interactive HTML report multiqc/multiqc_report.html.
SNV
For both tools (medaka and clair3):
<tool>/<sample>/: Directory with all raw output files from variant calling tool<tool>/<sample>.vcf: File with identified variants<tool>/<sample>.filtered.vcf: File with variants filtered by variant quality, overlap with other samples from same reference genome (only ifremove_common_variants: Trueinconfig.yml), and by genomic region if provided (comparemasked_regionin samplesheet)<tool>/logs/<sample>.*: Standard error (stderr) and standard output (stdout) files for the identification of variants using eithermedakaorclair3<tool>/logs/<sample>_filtering.log: Log file produced byVCFtoolsfor final filtering step to generate<tool>/<sample>.filtered.vcf
If remove_common_variants: True in config.yml, the following files are additionally produced:
<tool>/common_variants.vcf: File with variants called by the tool which are shared by all samples with the same reference genome<tool>/common_variants.txt: Tab-delimited file deduced from<tool>/common_variants.vcflisting contig and variant start position only<tool>/logs/<sample>_indexing.*: Standard error (stderr) and standard output (stdout) for indexing of*.vcffiles usingbcftools<tool>/logs/bcftools.*: Standard error (stderr) and standard output (stdout) for identification of shared variants across samples uisngbcftools
For clair3, data from the downloaded model is additionally found in the directory clair3/model/.
SV
For both tools (cutesv and sniffles2):
<tool>/<sample>.vcf: File with identified variants<tool>/<sample>.filtered.vcf: File with variants filtered by variant quality, overlap with other samples from same reference genome (only ifremove_common_variants: Trueinconfig.yml), and by genomic region if provided (comparemasked_regionin samplesheet)<tool>/logs/<sample>.log: Log file for the identification of variants using eithercutesvorsniffles2<tool>/logs/<sample>_filtering.log: Log file produced byVCFtoolsfor final filtering step to generate<tool>/<sample>.filtered.vcf
If remove_common_variants: True in config.yml, the following files are additionally produced:
<tool>/common_variants.vcf: File with variants called by the tool which are shared by all samples with the same reference genome<tool>/common_variants.txt: Tab-delimited file deduced from<tool>/common_variants.vcflisting contig and variant start position only<tool>/logs/<sample>_indexing.*: Standard error (stderr) and standard output (stdout) for indexing of*.vcffiles usingbcftools<tool>/logs/bcftools.*: Standard error (stderr) and standard output (stdout) for identification of shared variants across samples uisngbcftools
For cutesv, all raw output files from the initial variant calling can be found in the directory cutesv/<sample>/.
Dr. Thomas Fabian Wulff
- Affiliation: Max-Planck-Unit for the Science of Pathogens (MPUSP), Berlin, Germany
- ORCID: https://orcid.org/0000-0001-7166-0899
Please also visit the MPUSP GitHub page at https://github.com/MPUSP for further information on this workflow and other projects!
The following software and tools have been used in this workflow:
BCFtools
Danecek, P., Bonfield, J.K., Liddle, J. et al. Twelve years of SAMtools and BCFtools. GigaScience 10(2), giab008, 2021. (https://doi.org/10.1093/gigascience/giab008)
bedtools
Quinlan, A.R., Hall, I.M., BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26(6), 841-842, 2010. ( https://doi.org/10.1093/bioinformatics/btq033)
Clair3
Zheng, Z., Li, S., Su, J. et al. Symphonizing pileup and full-alignment for deep learning-based long-read variant calling. Nature Computational Science 2, 797–803, 2022. (https://doi.org/10.1038/s43588-022-00387-x)
cuteSV
Jiang, T., Liu, Y., Jiang, Y. et al. Long-read-based human genomic structural variation detection with cuteSV. Genome Biology 21, 189, 2020. (https://doi.org/10.1186/s13059-020-02107-y)
Filtlong
igv-reports
MultiQC
Ewels, P., Magnusson, M., Lundin, S., et al. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32(19) 3047–3048, 2016. (https://doi.org/10.1093/bioinformatics/btw354)
NanoPlot
Coster, W.D., Rademakers, R. NanoPack2: population-scale evaluation of long-read sequencing data. Bioinformatics 39(5), btad311, 2023. (https://doi.org/10.1093/bioinformatics/btad311)
NGMLR
Sedlazeck, F.J., Rescheneder, P., Smolka, M. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nature Methods 15, 461–468, 2018. (https://doi.org/10.1038/s41592-018-0001-7)
SAMtools
Li, H., Handsaker, B., Wysoker, A. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25(16), 2078–2079, 2009. (https://doi.org/10.1093/bioinformatics/btp352)
Snakemake
Mölder, F., Jablonski, K.P., Letcher, B. et al. Sustainable data analysis with Snakemake. F1000Research 10:33, 2021. (https://doi.org/10.12688/f1000research.29032.2)
Sniffles2
Smolka, M., Paulin, L.F., Grochowski, C.M. et al. Detection of mosaic and population-level structural variants with Sniffles2. Nature Biotechnology 2024. (https://doi.org/10.1038/s41587-023-02024-y)
VCFtools
Danecek, P., Auton, A., Abecasis, G. et al. The variant call format and VCFtools. Bioinformatics 27(15), 2156–2158, 2011. (https://doi.org/10.1093/bioinformatics/btr330)