Optimize daily_package_downloads with partitioning, clustering, and bootstrap migration#22
Draft
Optimize daily_package_downloads with partitioning, clustering, and bootstrap migration#22
Conversation
…clustering Co-authored-by: brabster <[email protected]>
Co-authored-by: brabster <[email protected]>
…data Co-authored-by: brabster <[email protected]>
Co-authored-by: brabster <[email protected]>
Co-authored-by: brabster <[email protected]>
Co-authored-by: brabster <[email protected]>
Copilot
AI
changed the title
[WIP] Refactor daily_package_downloads table for efficient queries
Optimize daily_package_downloads with partitioning, clustering, and bootstrap migration
Nov 23, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The
daily_package_downloadstable lacks partitioning and clustering, causing 153GB scans for 14-day test queries. This adds ~$20/week in unnecessary costs and will worsen as the table grows.Changes
New optimized model
daily_package_downloads_optimised.sqlwith:PARTITION BY download_datefor temporal pruningCLUSTER BY package, package_versionfor package-specific queries{% if is_incremental() %} -- Standard incremental: new data only SELECT ... FROM {{ ref('file_downloads') }} WHERE download_date >= '{{ latest_partition_date }}' {% else %} -- Bootstrap: copy existing + new data SELECT * FROM {{ ref('daily_package_downloads') }} UNION ALL SELECT ... FROM {{ ref('file_downloads') }} WHERE download_date > '{{ old_table_latest_date }}' {% endif %}Test optimization
whereconfig (153GB → 2.5GB)downloads_and_vulnerabilities.sqland test filesDocumentation
MIGRATION_STRATEGY.md: Bootstrap rationale, deployment steps, rollback pathPIPELINE_REFACTORING_ANALYSIS.md: Recommends separate daily pipeline in same repo for fresher data at similar costExpected impact
Original table remains unchanged for safe rollback.
Original prompt
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.