Skip to content

Comments

Dedupe replay improvements: lookup dependent WACZ files#308

Merged
ikreymer merged 4 commits intomainfrom
req-crawls-dependency-loading
Feb 17, 2026
Merged

Dedupe replay improvements: lookup dependent WACZ files#308
ikreymer merged 4 commits intomainfrom
req-crawls-dependency-loading

Conversation

@ikreymer
Copy link
Member

  • Load and store dependent WACZ files in WACZFile
  • Lookup from dependency if revisit record or no record encountered
  • Fix revisit handling: update source when 'resolving' revisits, store origUrl and origTs

- load and store list of required wacz files from 'relation.requires' entry in WACZFile.reqFiles
- when encountering a 'warc/revisit' from a WACZ file, attempt to load via each file in WACZFile.reqFiles
- ensure relation.requires WACZ files are loaded when revisit are encountered!
- fix revisit resolution by storing the original loading path in the resolved entry
@ikreymer ikreymer merged commit 6d55214 into main Feb 17, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant