PacBio full-length 16S CCS data: almost all reads unique during dereplication and very few reads retained after denoising

Hello,

I am analysing PacBio Sequel II full-length 16S rRNA CCS data (~1450 bp) using the DADA2 long-read workflow, but I am observing an unusually high number of unique sequences.

Dataset

Platform: PacBio Sequel II

Data type: CCS reads

Target: full-length 16S (~1450 bp)

~10,000–13,000 reads per sample

Primers trimmed using cutadapt

CCS minimum passes ≥5

Primers:

F: AGRGTTTGATYNTGGCTCAG
R: TASGGHTACCTTGTTASGACTT

During dereplication:

derepFastq()

almost all reads appear to be unique (for example ~11,200 unique sequences from ~12,300 reads).

After running:

learnErrors()
dada()
removeBimeraDenovo()

only a very small number of reads remain per sample.

This seems similar to issue #1975 where it was noted that when almost every read is unique, DADA2 cannot infer ASVs because repeated observations of the same sequence are required.

Before concluding that the data is unsuitable for DADA2, I wanted to ask:

Is such a high unique/read ratio expected for PacBio full-length 16S CCS data?

Could this behaviour be caused by mixed read orientation or incomplete primer trimming?

Are there recommended filtering parameters for PacBio full-length 16S data prior to denoising?

I can provide example FASTQ files or the full script if helpful.

Thank you for any suggestions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PacBio full-length 16S CCS data: almost all reads unique during dereplication and very few reads retained after denoising #2182

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

PacBio full-length 16S CCS data: almost all reads unique during dereplication and very few reads retained after denoising #2182

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions