Skip to content

Add ONT telomere test data (GIAB HG002)#1947

Merged
pinin4fjords merged 1 commit intonf-core:modulesfrom
pinin4fjords:telogator2-test-data
Mar 24, 2026
Merged

Add ONT telomere test data (GIAB HG002)#1947
pinin4fjords merged 1 commit intonf-core:modulesfrom
pinin4fjords:telogator2-test-data

Conversation

@pinin4fjords
Copy link
Member

Summary

  • Add 17 real ONT telomeric reads from GIAB HG002 for testing telomere analysis tools (e.g. telogator2)
  • Source: GIAB 2025.01 ONT release, SUP basecalling, R10.4.1
  • Regions: last 10 kb of chr1 and chr2 (telomeric ends), downsampled to ~17 reads
  • Files: data/genomics/homo_sapiens/nanopore/bam/HG002_ont_telomere/HG002_ont_tel_sub.bam (~491 KB) + index
  • Includes README with regeneration instructions

Context

Needed for the new telogator2 nf-core module (nf-core/modules#11033). Existing PacBio test BAMs don't contain telomere reads, so telogator2 can only test its "no telomere reads" fallback path.

🤖 Generated with Claude Code

17 real telomeric ONT reads for testing telogator2 and other telomere
analysis tools. Downsampled from GIAB 2025.01 ONT release (SUP
basecalling, R10.4.1).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Copy link

@fellen31 fellen31 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pinin4fjords
Copy link
Member Author

Thanks @fellen31 !

@pinin4fjords pinin4fjords merged commit e8f3e14 into nf-core:modules Mar 24, 2026
1 check passed
@pinin4fjords pinin4fjords deleted the telogator2-test-data branch March 24, 2026 12:08
pinin4fjords added a commit to nf-core/modules that referenced this pull request Mar 24, 2026
- tlens: tlens_by_allele.tsv (primary result)
- plots: *.png (allele and violin plots, optional)
- qc: qc directory (stats, read lengths, metadata)

Also revert modules_testdata_base_path now that
nf-core/test-datasets#1947 is merged.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
github-merge-queue bot pushed a commit to nf-core/modules that referenced this pull request Mar 24, 2026
* Add new module: telogator2

Add nf-core module for telogator2, a tool for allele-specific telomere
length estimation and TVR characterization from long-read sequencing
data (ONT/PacBio).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* add telomere test data and temporarily point to fork

- Add ONT telomere reads test (exercises real analysis path)
- Keep PacBio no-telomere test (exercises graceful fallback)
- Temporarily override modules_testdata_base_path to
  pinin4fjords/test-datasets#telogator2-test-data (revert before merge)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix: combine fasta and fai into single reference channel

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix: simplify process script and assert failure on no telomere reads

Remove error-catching wrapper from telogator2 process script. When no
telomere reads are found the tool now fails with a clear error message,
which the no-telomere test asserts against.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* refactor: emit individual files instead of directory

- tlens: tlens_by_allele.tsv (primary result)
- plots: *.png (allele and violin plots, optional)
- qc: qc directory (stats, read lengths, metadata)

Also revert modules_testdata_base_path now that
nf-core/test-datasets#1947 is merged.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix: exclude non-deterministic rng.txt from snapshot

The qc/rng.txt file contains a random seed that differs across runs.
Assert qc output exists but don't snapshot it.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix: handle non-deterministic telogator2 outputs in tests

Set fixed random seed (--rng 42) via test config. Assert tlens header
structure rather than md5 since TL values vary across runs due to
minimap2 non-determinism. Assert plots and qc exist without
snapshotting.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix: address PR review feedback

- Use module_args pattern for ext.args in test config
- Snapshot output file names (not md5s) for non-deterministic outputs
- Remove PNGs from stub (plots are optional)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix: make plots a required output

The two main plots (all_final_alleles.png, violin_atl.png) are always
produced on a successful run. Remove optional flag and add them back
to the stub.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* refactor: split QC directory into individual output channels

Emit cmd, stats, qc_readlens, readlens, and rng as separate channels
instead of a single qc directory. Touch all QC files in stub.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants