Skip to content

feat(ci): optimize HMF reference data download with Magic Cache#256

Draft
edmundmiller wants to merge 2 commits intodevfrom
feat/hmf-reference-download
Draft

feat(ci): optimize HMF reference data download with Magic Cache#256
edmundmiller wants to merge 2 commits intodevfrom
feat/hmf-reference-download

Conversation

@edmundmiller
Copy link

Summary

Optimizes the nf-test CI workflow to download ~25GB of HMF reference data once and share it across all matrix jobs using runs-on Magic Cache (S3-backed storage).

Changes

  • Enable Magic Cache: Added extras=s3-cache to all jobs and runs-on/action@v2 step
  • New download-reference job: Downloads reference data once before matrix jobs start
  • Cache sharing: All matrix jobs restore from the same S3-backed cache
  • Remove redundant downloads: Matrix jobs no longer download independently

Performance Impact

Before

  • 42 matrix jobs × 15 min download = 630 minutes of download time 🔴

After

  • 1 download: 15 min
  • 42 cache restores: 42 × 3 min = 126 min
  • Total: 141 minutes 🟢

Savings

~489 minutes (~8 hours) per workflow run! 🚀

Benefits

No size limits: runs-on Magic Cache uses S3 backend (no 10GB GitHub limit)
Fast restores: S3-backed cache is much faster than downloading from R2
Persistent cache: Cache persists across workflow runs
Significant time savings: Reduces CI runtime by ~8 hours

Related Issues

Resolves the need to download 25GB WiGiTS toolkit data for nf-tests.

Testing

  • Verify download-reference job completes successfully
  • Verify all matrix jobs restore cache correctly
  • Verify total workflow runtime is reduced

🤖 Generated with Claude Code

Co-Authored-By: Claude noreply@anthropic.com

Add automatic download of ~25GB HMF reference data from Hartwig's
R2 CDN before running nf-tests. This ensures tests have access to
the required GRCh38_hmf genome and WGS resource files.

Changes:
- Add Nextflow setup and reference download steps to nf-test workflow
- Use prepare_reference mode to download from R2 CDN
- Configure tests/nextflow.config to detect and use local reference data
- Falls back to remote URLs for local development

The download adds ~10-15 minutes to CI runs but ensures tests can
access all required reference files without caching (GitHub Actions
cache is limited to 10GB).
@nf-core-bot
Copy link
Member

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.3.2.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

@github-actions
Copy link

github-actions bot commented Oct 29, 2025

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 9b66daa

+| ✅ 215 tests passed       |+
#| ❔  10 tests were ignored |#
!| ❗  17 tests had warnings |!
Details

❗ Test warnings:

  • files_exist - File not found: assets/multiqc_config.yml
  • files_unchanged - LICENSE does not match the template
  • pipeline_todos - TODO string in awsfulltest.yml: You can customise AWS full pipeline tests as required
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
  • schema_params - Schema param panel not found from nextflow config
  • schema_params - Schema param fastp_umi_location not found from nextflow config
  • schema_params - Schema param fastp_umi_length not found from nextflow config
  • schema_params - Schema param fastp_umi_skip not found from nextflow config
  • schema_params - Schema param redux_umi_duplex_delim not found from nextflow config
  • schema_params - Schema param genome_version not found from nextflow config
  • schema_params - Schema param genome_type not found from nextflow config
  • schema_params - Schema param ref_data_hmf_data_path not found from nextflow config
  • schema_params - Schema param ref_data_panel_data_path not found from nextflow config
  • schema_params - Schema param ref_data_genome_gtf not found from nextflow config
  • schema_params - Schema param ref_data_hla_slice_bed not found from nextflow config
  • schema_description - No description provided in schema for parameter: prepare_reference_only

❔ Tests ignored:

✅ Tests passed:

Run details

  • nf-core/tools version 3.3.2
  • Run at 2025-10-29 13:25:25

Refactor nf-test workflow to download ~25GB HMF reference data once
and share across all matrix jobs using runs-on Magic Cache (S3-backed).

Changes:
- Enable Magic Cache (extras=s3-cache) on all jobs
- Add dedicated download-reference job that runs once before matrix
- Use actions/cache with runs-on Magic Cache for S3-backed storage
- Matrix jobs now restore from cache instead of downloading individually
- Add runs-on/action@v2 to all jobs for Magic Cache support

Performance impact:
- Before: 42 jobs × 15 min = 630 minutes of download time
- After: 15 min download + (42 × 3 min restore) = 141 minutes
- Saves ~489 minutes (~8 hours) per workflow run

Benefits:
- No GitHub 10GB cache limit (uses S3 backend)
- Fast cache restore across all matrix jobs
- Cache persists across workflow runs
- Significant CI time savings
@edmundmiller edmundmiller force-pushed the feat/hmf-reference-download branch from ac53b87 to 9b66daa Compare October 29, 2025 13:23
@edmundmiller edmundmiller marked this pull request as draft November 7, 2025 16:32
@rhassaine rhassaine self-assigned this Nov 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants