Skip to content

trawl 0.4.0

Latest

Choose a tag to compare

@dongwhee dongwhee released this 20 Apr 11:36
· 19 commits to main since this release
56ed03e

Fourth tagged release. Closes the C6 (hybrid retrieval) follow-up chain surfaced in 0.3.0's "retrieval still struggles on code_heavy_query" note.

Headline

Playwright shadow-DOM unwrap for code-block custom elements (default on). MDN's post-2024 redesign wraps every code example in <mdn-code-example> backed by Shadow DOM, which Playwright's page.content() does not traverse. This release inlines each matching element's shadowRoot.querySelector('pre > code').textContent into <pre><code>…</code></pre> in the light DOM before extraction, so html_to_markdown / Trafilatura sees a proper code block.

Result: claude_code_mdn_fetch_api flipped to PASS; code_heavy_query 16-pattern slice now 16/16 on develop. Parity 15/15 in both on/off modes. Zero effect on the 15 non-MDN patterns (allow-list is narrow).

  • Env var: TRAWL_SHADOW_DOM_UNWRAP (default "1"; set "0" to disable).
  • Allow-list: SHADOW_DOM_UNWRAP_TAGS = ("mdn-code-example",) in src/trawl/fetchers/playwright.py. Additions go through the same measurement gate.

What else is in 0.4.0

  • fix — Two Stack Exchange code_heavy_query URLs were resolving to unrelated questions (slug-ID mismatch, SE resolves by ID). Replaced SF #378860SF #87056 and SO #44488350SO #42639984. Both flip to PASS. Not an extraction defect.
  • research × 4 — RRF-k tuning (PR #29), identifier-aware BM25 tokenizer (PR #31), HyDE → BM25 extras (PR #32), MDN reranker diagnostic (PR #33). Three spikes reached gate (b) reject at the pre-registered thresholds; each rejection narrowed the search and eventually pointed at Shadow DOM as the actual bottleneck. Runners + design docs preserved for future reuse.

Known caveats

  • Reranker :8083 intermittently returns HTTP 500 during sweeps. Cosine fallback keeps assertions passing; reliability investigation queued separately.
  • SHADOW_DOM_UNWRAP_TAGS covers mdn-code-example only. Docusaurus / GitBook / other React-based docs sites with similar wrappers will need per-tag measurement before addition.

Full entry: CHANGELOG.md ## [0.4.0].

PRs merged since 0.3.0

# Kind Title
#29 research RRF k tuning sweep — retain k=60 (measurement-driven)
#30 fix Correct SE URLs (+ argparse dup fix)
#31 research BM25 identifier-aware tokenizer — gate (b) rejected
#32 research HyDE compound identifier → BM25 query — gate (b) rejected
#33 research MDN reranker diagnostic — D1 / shadow DOM root cause
#34 feat Inline shadow-DOM code blocks for known custom elements (default on)
#35 release 0.4.0 (develop → main)