Skip to content

Commit 20e70b2

Browse files
committed
feat: added panaroo as a genome comparison tool, closes #6
1 parent 938ac49 commit 20e70b2

File tree

9 files changed

+129
-11
lines changed

9 files changed

+129
-11
lines changed

.test/config/config.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,3 +21,8 @@ quast:
2121
reference_fasta: ""
2222
reference_gff: ""
2323
extra: ""
24+
25+
panaroo:
26+
remove_source: "cmsearch"
27+
remove_feature: "tRNA|rRNA|ncRNA|exon|sequence_feature"
28+
extra: "--clean-mode strict --remove-invalid-genes"

README.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
[![GitHub actions status](https://github.com/MPUSP/snakemake-assembly-postprocessing/actions/workflows/main.yml/badge.svg)](https://github.com/MPUSP/snakemake-assembly-postprocessing/actions/workflows/main.yml)
55
[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
66
[![run with apptainer](https://img.shields.io/badge/run%20with-apptainer-1D355C.svg?labelColor=000000)](https://apptainer.org/)
7+
[![workflow catalog](https://img.shields.io/badge/Snakemake%20workflow%20catalog-darkgreen)](https://snakemake.github.io/snakemake-workflow-catalog/docs/workflows/MPUSP/snakemake-assembly-postprocessing)
78

89
A Snakemake workflow for the post-processing of microbial genome assemblies.
910

@@ -20,6 +21,8 @@ If you use this workflow in a paper, don't forget to give credits to the authors
2021
1. NCBI's Prokaryotic Genome Annotation Pipeline ([PGAP](https://github.com/ncbi/pgap)). Note: needs to be installed manually
2122
2. [prokka](https://github.com/tseemann/prokka), a fast and light-weight prokaryotic annotation tool
2223
3. [bakta](https://github.com/oschwengers/bakta), a fast, alignment-free annotation tool. Note: Bakta will automatically download its companion database from zenodo (light: 1.5 GB, full: 40 GB)
24+
3. Create a QC report for the assemblies using [Quast](https://github.com/ablab/quast)
25+
4. Create a pangenome analysis (orthologs/homologs) using [Panaroo](https://gthlab.au/panaroo/)
2326

2427
## Requirements
2528

@@ -64,6 +67,14 @@ conda activate snakemake-assembly-postprocessing
6467

6568
## References
6669

70+
> Seemann T. _Prokka: rapid prokaryotic genome annotation_. Bioinformatics. **2014** Jul 15;30(14):2068-9. PMID: 24642063. https://doi.org/10.1093/bioinformatics/btu153.
71+
72+
> Schwengers O, Jelonek L, Dieckmann MA, Beyvers S, Blom J, Goesmann A. _Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification_. Microb Genom, 7(11):000685 **2021**. PMID: 34739369. https://doi.org/10.1099/mgen.0.000685.
73+
6774
> Li W, O'Neill KR, Haft DH, DiCuccio M, Chetvernin V, Badretdin A, Coulouris G, Chitsaz F, Derbyshire MK, Durkin AS, Gonzales NR, Gwadz M, Lanczycki CJ, Song JS, Thanki N, Wang J, Yamashita RA, Yang M, Zheng C, Marchler-Bauer A, Thibaud-Nissen F. _RefSeq: Expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation._ Nucleic Acids Res, **2021** Jan 8;49(D1):D1020-D1028. https://doi.org/10.1093/nar/gkaa1105
6875
76+
> Gurevich A, Saveliev V, Vyahhi N, Tesler G. _QUAST: quality assessment tool for genome assemblies_. Bioinformatics. 29(8):1072-5, **2013**. PMID: 23422339. https://doi.org/10.1093/bioinformatics/btt086.
77+
78+
> Tonkin-Hill G, MacAlasdair N, Ruis C, Weimann A, Horesh G, Lees JA, Gladstone RA, Lo S, Beaudoin C, Floto RA, Frost SDW, Corander J, Bentley SD, Parkhill J. _Producing polished prokaryotic pangenomes with the Panaroo pipeline_. Genome Biol. 21(1):180, **2020**. PMID: 32698896. https://doi.org/10.1186/s13059-020-02090-4.
79+
6980
> Köster, J., Mölder, F., Jablonski, K. P., Letcher, B., Hall, M. B., Tomkins-Tinch, C. H., Sochat, V., Forster, J., Lee, S., Twardziok, S. O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., & Nahnsen, S. _Sustainable data analysis with Snakemake_. F1000Research, 10:33, 10, 33, **2021**. https://doi.org/10.12688/f1000research.29032.2.

config/config.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,3 +21,8 @@ quast:
2121
reference_fasta: ""
2222
reference_gff: ""
2323
extra: ""
24+
25+
panaroo:
26+
remove_source: "cmsearch"
27+
remove_feature: "tRNA|rRNA|ncRNA|exon|sequence_feature"
28+
extra: "--clean-mode strict --remove-invalid-genes"

config/schemas/config.schema.yml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,18 @@ properties:
8080
extra:
8181
type: string
8282
description: Extra command-line arguments for QUAST
83+
panaroo:
84+
type: object
85+
properties:
86+
remove_source:
87+
type: string
88+
description: Source types to remove in Panaroo (regex supported)
89+
remove_feature:
90+
type: string
91+
description: Feature types to remove in Panaroo (regex supported)
92+
extra:
93+
type: string
94+
description: Extra command-line arguments for Panaroo
8395

8496
required:
8597
- samplesheet

workflow/Snakefile

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -67,8 +67,5 @@ onerror:
6767
# -----------------------------------------------------
6868
rule all:
6969
input:
70-
expand(
71-
"results/qc/quast/{tool}/report.txt",
72-
tool=config["tool"],
73-
),
70+
get_final_input,
7471
default_target: True

workflow/envs/panaroo.yml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
name: panaroo
2+
channels:
3+
- conda-forge
4+
- bioconda
5+
- nodefaults
6+
dependencies:
7+
- numpy=1.26.4
8+
- scipy=1.11.4
9+
- panaroo=1.5.2

workflow/rules/annotate.smk

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -138,13 +138,13 @@ rule get_bakta_db:
138138
"""
139139
if [ {params.download_db} != 'none' ]; then
140140
echo 'The most recent of the following available Bakta DBs is downloaded:' > {log};
141-
bakta_db list >> {log};
141+
bakta_db list &>> {log};
142142
bakta_db download --output {params.outdir} --type {params.download_db} &>> {log};
143143
else
144144
echo 'Using Bakta DB from supplied input dir: {params.existing_db}' > {log};
145145
ln -s {params.existing_db} {output.db};
146-
echo 'Update ARMFinderPlus DB using supplied input dir: {params.existing_db}' > {log};
147-
amrfinder_update --force_update --database {params.existing_db}/amrfinderplus-db >> {log}
146+
echo 'Update ARMFinderPlus DB using supplied input dir: {params.existing_db}' >> {log};
147+
amrfinder_update --force_update --database {params.existing_db}/amrfinderplus-db &>> {log}
148148
fi
149149
"""
150150

workflow/rules/common.smk

Lines changed: 24 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,6 @@ def get_fasta(wildcards):
2323
sample = wildcards.sample
2424
if sample not in samples.index:
2525
raise ValueError(f"Sample {sample} not found in samplesheet.")
26-
# return the fasta file path
2726
return samples.loc[sample, "file"]
2827

2928

@@ -35,9 +34,31 @@ def get_quast_fasta(wildcards):
3534
)
3635

3736

38-
def get_quast_gff(wildcards):
37+
def get_panaroo_gff(wildcards):
3938
return expand(
40-
"results/annotation/{tool}/{sample}/{sample}.gff",
39+
"results/qc/panaroo/{tool}/prepare/{sample}.gff",
4140
tool=wildcards.tool,
4241
sample=samples.index,
4342
)
43+
44+
45+
def get_panaroo_fasta(wildcards):
46+
return expand(
47+
"results/qc/panaroo/{tool}/prepare/{sample}.fna",
48+
tool=wildcards.tool,
49+
sample=samples.index,
50+
)
51+
52+
53+
def get_final_input(wildcards):
54+
inputs = []
55+
inputs += expand(
56+
"results/qc/quast/{tool}/report.txt",
57+
tool=config["tool"],
58+
)
59+
if len(samples.index) > 1:
60+
inputs += expand(
61+
"results/qc/panaroo/{tool}/summary_statistics.txt",
62+
tool=config["tool"],
63+
)
64+
return inputs

workflow/rules/qc.smk

Lines changed: 59 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ rule quast:
2020
else []
2121
),
2222
extra=config["quast"]["extra"],
23-
threads: workflow.cores * 0.25
23+
threads: 4
2424
log:
2525
"results/qc/quast/{tool}/quast.log",
2626
shell:
@@ -34,3 +34,61 @@ rule quast:
3434
{input.fasta} \
3535
> {log} 2>&1
3636
"""
37+
38+
39+
rule prepare_panaroo:
40+
input:
41+
fasta="results/annotation/{tool}/{sample}/{sample}.fna",
42+
gff="results/annotation/{tool}/{sample}/{sample}.gff",
43+
output:
44+
fasta="results/qc/panaroo/{tool}/prepare/{sample}.fna",
45+
gff="results/qc/panaroo/{tool}/prepare/{sample}.gff",
46+
conda:
47+
"../envs/panaroo.yml"
48+
message:
49+
"""--- Prepare input files for pan-genome alignment ---"""
50+
params:
51+
remove_source=config["panaroo"]["remove_source"],
52+
remove_feature=config["panaroo"]["remove_feature"],
53+
log:
54+
"results/qc/panaroo/{tool}/prepare/{sample}.log",
55+
shell:
56+
"""
57+
echo 'Preparing annotation for Panaroo:' > {log};
58+
echo ' - formatting seqnames in FASTA files' >> {log};
59+
awk '{{ sub(/>.*\\|/, ">"); sub(/[[:space:]].*$/, ""); print }}' \
60+
{input.fasta} > {output.fasta} 2>> {log};
61+
echo ' - removing sequences and selected features in GFF files' >> {log};
62+
awk ' /^##FASTA/ {{exit}} $2 !~ /{params.remove_source}/ && $3 !~ /{params.remove_feature}/ {{print}}' \
63+
{input.gff} > {output.gff} 2>> {log}
64+
"""
65+
66+
67+
rule panaroo:
68+
input:
69+
gff=get_panaroo_gff,
70+
fasta=get_panaroo_fasta,
71+
output:
72+
stats="results/qc/panaroo/{tool}/summary_statistics.txt",
73+
conda:
74+
"../envs/panaroo.yml"
75+
message:
76+
"""--- Running PANAROO to create pangenome from all annotations ---"""
77+
params:
78+
outdir=lambda wc, output: os.path.dirname(output.stats),
79+
extra=config["panaroo"]["extra"],
80+
threads: 4
81+
log:
82+
"results/qc/panaroo/{tool}/panaroo.log",
83+
shell:
84+
"""
85+
printf '%s\n' {input.gff} | \
86+
paste -d ' ' - <(printf '%s\n' {input.fasta}) \
87+
> {params.outdir}/input_files.txt;
88+
panaroo \
89+
-i {params.outdir}/input_files.txt \
90+
-o {params.outdir} \
91+
-t {threads} \
92+
{params.extra} \
93+
> {log} 2>&1
94+
"""

0 commit comments

Comments
 (0)