Skip to content

fix(crawler): canonicalize Apple documentation URLs#201

Open
imwyvern wants to merge 1 commit intomihaelamj:mainfrom
imwyvern:clawoss/fix/200-url-canonicalization
Open

fix(crawler): canonicalize Apple documentation URLs#201
imwyvern wants to merge 1 commit intomihaelamj:mainfrom
imwyvern:clawoss/fix/200-url-canonicalization

Conversation

@imwyvern
Copy link
Copy Markdown

Apple doc links with different path casing, or the old underscore framework form, normalized to different strings. That let the crawler enqueue the same page multiple times and let directory indexing produce separate framework/URI entries for the same content.

URLUtilities.normalize() now lowercases Apple documentation paths and maps underscores to dashes. The crawler normalizes restored, seeded, and newly discovered queue entries before de-duping; directory indexing canonicalizes framework/URI keys and keeps the newest duplicate by crawledAt.

Tests:

  • swift test --package-path Packages --filter 'CrawlerTests/urlNormalize'
  • swift test --package-path Packages --filter 'SearchTests'
  • swift build --package-path Packages -c release --arch arm64

Full swift test --package-path Packages still fails in existing MCP integration tests waiting for server responses, then crashes with Index out of range.

Fixes #200

@mihaelamj
Copy link
Copy Markdown
Owner

Thanks for picking this up @imwyvern. Quick status:

The case-lowering half landed in develop independently while this PR was open (see the #200 entry in CHANGELOG). Apologies for the miss on the comms.

The underscore→dash half I deliberately decided against. Two reasons:

  1. installer_js is a real Apple framework whose canonical URL uses underscore. Collapsing dashes/underscores there 404s.
  2. The dash/underscore pairs I sampled (e.g. professional-video-applications vs professional_video_applications) turned out to be Apple serving distinct documentation under similar-looking slugs, not URL aliases.

If you have specific URL pairs where the same content is served under both forms, please share. Happy to revisit. Closing this for now.

@mihaelamj mihaelamj closed this May 4, 2026
@mihaelamj mihaelamj reopened this May 4, 2026
@mihaelamj
Copy link
Copy Markdown
Owner

Reopening. Closed too fast. Looking at the diff again, the underscore→dash collapse is the part I'm not taking, but you also wrote two pieces I didn't credit:

  1. Queue-membership dedup at enqueue in Crawler.shouldEnqueue. That's exactly what Crawler: queue dedup at enqueue time (currently 72 % duplicates) #206 asks for as a separate bug, where the crawl is currently running ~72% duplicate enqueues.
  2. deduplicateDocFilesByCanonicalURL save-layer pass in SearchIndexBuilder. The CHANGELOG had explicitly marked save-layer dedup as future work; you've done it.

Both are valuable independent of the URL-canonicalization debate.

Two options if you're game:

  • Strip the underscore→dash transform from canonicalPathComponent (drop the replacingOccurrences(of: "_", with: "-") call) and keep the rest. installer_js survives, the dedup work lands, and we close Crawler: queue dedup at enqueue time (currently 72 % duplicates) #206 in the same PR. I'd merge that.
  • Or, if you'd rather not edit, I can cherry-pick the queue-dedup + save-layer hunks with you as co-author on the commits.

Either way, sorry for the abrupt close, and thanks. The dedup work is the kind of thing nobody asks for and everybody benefits from.

@mihaelamj
Copy link
Copy Markdown
Owner

Correcting myself: I got the queue-dedup attribution wrong. Just re-checked the code.

Crawler already has queue-membership dedup at enqueue via an enqueued: Set<String> with O(1) insert-check. The comment at Crawler.swift:342-348 references #206 by number and the 72% duplicate framing. So #206 was fixed in develop independently; my earlier comment crediting you for that work was wrong. Sorry for the misdirection.

What's actually still novel in your patch:

  • deduplicateDocFilesByCanonicalURL in SearchIndexBuilder. That save-layer dedup pass is not implemented anywhere on develop. The CHANGELOG had marked it as future work, and your version does it.

Narrowed merge offer: if you strip the replacingOccurrences(of: "_", with: "-") line from canonicalPathComponent and drop the Crawler-side normalize/dedup hunks (already covered upstream), the SearchIndexBuilder dedup pass is what I'd take.

Or, same as before, if you'd rather not edit, I can cherry-pick that one hunk with you as co-author.

Apologies again for the noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

URL canonicalization is incomplete: case + dash/underscore variants treated as distinct pages

2 participants