Skip to content

Output warc-record-id and warc-ip-address in the CDX index#48

Open
lfoppiano wants to merge 3 commits intoccfrom
bugfix/warc-id-warc-ip
Open

Output warc-record-id and warc-ip-address in the CDX index#48
lfoppiano wants to merge 3 commits intoccfrom
bugfix/warc-id-warc-ip

Conversation

@lfoppiano
Copy link
Copy Markdown

Ref #39

@lfoppiano lfoppiano force-pushed the bugfix/warc-id-warc-ip branch from cec9c62 to c5a4bd6 Compare February 27, 2026 17:06
@lfoppiano lfoppiano linked an issue Feb 27, 2026 that may be closed by this pull request
@lfoppiano
Copy link
Copy Markdown
Author

@sebastian-nagel I did some manual test with the local Nutch (the same we did together) and added some simple unit tests (e.g. check that we correctly remove the id prefix). The records may still have some imperfections as they were built manually.

I did not have time to work on a more detailed integration test as we discussed to deserialize the java objects from the segments, but I will add an issue for that that should benefit more use cases.

@lfoppiano lfoppiano marked this pull request as ready for review March 1, 2026 06:53
Copy link
Copy Markdown

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @lfoppiano!

The change is fine.

But the PR should be kept open for a while until also the related changes mentioned in #39 are implemented. I'd prefer to put everything into production in one turn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add WARC-Record-ID and WARC-IP-Address to CDX files

2 participants