Hello,
I am analysing PacBio Sequel II full-length 16S rRNA CCS data (~1450 bp) using the DADA2 long-read workflow, but I am observing an unusually high number of unique sequences.
Dataset
Platform: PacBio Sequel II
Data type: CCS reads
Target: full-length 16S (~1450 bp)
~10,000–13,000 reads per sample
Primers trimmed using cutadapt
CCS minimum passes ≥5
Primers:
F: AGRGTTTGATYNTGGCTCAG
R: TASGGHTACCTTGTTASGACTT
During dereplication:
derepFastq()
almost all reads appear to be unique (for example ~11,200 unique sequences from ~12,300 reads).
After running:
learnErrors()
dada()
removeBimeraDenovo()
only a very small number of reads remain per sample.
This seems similar to issue #1975 where it was noted that when almost every read is unique, DADA2 cannot infer ASVs because repeated observations of the same sequence are required.
Before concluding that the data is unsuitable for DADA2, I wanted to ask:
Is such a high unique/read ratio expected for PacBio full-length 16S CCS data?
Could this behaviour be caused by mixed read orientation or incomplete primer trimming?
Are there recommended filtering parameters for PacBio full-length 16S data prior to denoising?
I can provide example FASTQ files or the full script if helpful.
Thank you for any suggestions.
Hello,
I am analysing PacBio Sequel II full-length 16S rRNA CCS data (~1450 bp) using the DADA2 long-read workflow, but I am observing an unusually high number of unique sequences.
Dataset
Platform: PacBio Sequel II
Data type: CCS reads
Target: full-length 16S (~1450 bp)
~10,000–13,000 reads per sample
Primers trimmed using cutadapt
CCS minimum passes ≥5
Primers:
F: AGRGTTTGATYNTGGCTCAG
R: TASGGHTACCTTGTTASGACTT
During dereplication:
derepFastq()
almost all reads appear to be unique (for example ~11,200 unique sequences from ~12,300 reads).
After running:
learnErrors()
dada()
removeBimeraDenovo()
only a very small number of reads remain per sample.
This seems similar to issue #1975 where it was noted that when almost every read is unique, DADA2 cannot infer ASVs because repeated observations of the same sequence are required.
Before concluding that the data is unsuitable for DADA2, I wanted to ask:
Is such a high unique/read ratio expected for PacBio full-length 16S CCS data?
Could this behaviour be caused by mixed read orientation or incomplete primer trimming?
Are there recommended filtering parameters for PacBio full-length 16S data prior to denoising?
I can provide example FASTQ files or the full script if helpful.
Thank you for any suggestions.