Skip to content

PacBio full-length 16S CCS data: almost all reads unique during dereplication and very few reads retained after denoising #2182

@alitcd

Description

@alitcd

Hello,

I am analysing PacBio Sequel II full-length 16S rRNA CCS data (~1450 bp) using the DADA2 long-read workflow, but I am observing an unusually high number of unique sequences.

Dataset

Platform: PacBio Sequel II

Data type: CCS reads

Target: full-length 16S (~1450 bp)

~10,000–13,000 reads per sample

Primers trimmed using cutadapt

CCS minimum passes ≥5

Primers:

F: AGRGTTTGATYNTGGCTCAG
R: TASGGHTACCTTGTTASGACTT

During dereplication:

derepFastq()

almost all reads appear to be unique (for example ~11,200 unique sequences from ~12,300 reads).

After running:

learnErrors()
dada()
removeBimeraDenovo()

only a very small number of reads remain per sample.

This seems similar to issue #1975 where it was noted that when almost every read is unique, DADA2 cannot infer ASVs because repeated observations of the same sequence are required.

Before concluding that the data is unsuitable for DADA2, I wanted to ask:

Is such a high unique/read ratio expected for PacBio full-length 16S CCS data?

Could this behaviour be caused by mixed read orientation or incomplete primer trimming?

Are there recommended filtering parameters for PacBio full-length 16S data prior to denoising?

I can provide example FASTQ files or the full script if helpful.

Thank you for any suggestions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions