fix(crawler): canonicalize Apple documentation URLs#201
fix(crawler): canonicalize Apple documentation URLs#201imwyvern wants to merge 1 commit intomihaelamj:mainfrom
Conversation
|
Thanks for picking this up @imwyvern. Quick status: The case-lowering half landed in develop independently while this PR was open (see the #200 entry in CHANGELOG). Apologies for the miss on the comms. The underscore→dash half I deliberately decided against. Two reasons:
If you have specific URL pairs where the same content is served under both forms, please share. Happy to revisit. Closing this for now. |
|
Reopening. Closed too fast. Looking at the diff again, the underscore→dash collapse is the part I'm not taking, but you also wrote two pieces I didn't credit:
Both are valuable independent of the URL-canonicalization debate. Two options if you're game:
Either way, sorry for the abrupt close, and thanks. The dedup work is the kind of thing nobody asks for and everybody benefits from. |
|
Correcting myself: I got the queue-dedup attribution wrong. Just re-checked the code.
What's actually still novel in your patch:
Narrowed merge offer: if you strip the Or, same as before, if you'd rather not edit, I can cherry-pick that one hunk with you as co-author. Apologies again for the noise. |
Apple doc links with different path casing, or the old underscore framework form, normalized to different strings. That let the crawler enqueue the same page multiple times and let directory indexing produce separate framework/URI entries for the same content.
URLUtilities.normalize()now lowercases Apple documentation paths and maps underscores to dashes. The crawler normalizes restored, seeded, and newly discovered queue entries before de-duping; directory indexing canonicalizes framework/URI keys and keeps the newest duplicate bycrawledAt.Tests:
swift test --package-path Packages --filter 'CrawlerTests/urlNormalize'swift test --package-path Packages --filter 'SearchTests'swift build --package-path Packages -c release --arch arm64Full
swift test --package-path Packagesstill fails in existing MCP integration tests waiting for server responses, then crashes withIndex out of range.Fixes #200