Fourth tagged release. Closes the C6 (hybrid retrieval) follow-up chain surfaced in 0.3.0's "retrieval still struggles on code_heavy_query" note.
Headline
Playwright shadow-DOM unwrap for code-block custom elements (default on). MDN's post-2024 redesign wraps every code example in <mdn-code-example> backed by Shadow DOM, which Playwright's page.content() does not traverse. This release inlines each matching element's shadowRoot.querySelector('pre > code').textContent into <pre><code>…</code></pre> in the light DOM before extraction, so html_to_markdown / Trafilatura sees a proper code block.
Result: claude_code_mdn_fetch_api flipped to PASS; code_heavy_query 16-pattern slice now 16/16 on develop. Parity 15/15 in both on/off modes. Zero effect on the 15 non-MDN patterns (allow-list is narrow).
- Env var:
TRAWL_SHADOW_DOM_UNWRAP(default"1"; set"0"to disable). - Allow-list:
SHADOW_DOM_UNWRAP_TAGS = ("mdn-code-example",)insrc/trawl/fetchers/playwright.py. Additions go through the same measurement gate.
What else is in 0.4.0
- fix — Two Stack Exchange
code_heavy_queryURLs were resolving to unrelated questions (slug-ID mismatch, SE resolves by ID). ReplacedSF #378860→SF #87056andSO #44488350→SO #42639984. Both flip to PASS. Not an extraction defect. - research × 4 — RRF-k tuning (PR #29), identifier-aware BM25 tokenizer (PR #31), HyDE → BM25 extras (PR #32), MDN reranker diagnostic (PR #33). Three spikes reached gate (b) reject at the pre-registered thresholds; each rejection narrowed the search and eventually pointed at Shadow DOM as the actual bottleneck. Runners + design docs preserved for future reuse.
Known caveats
- Reranker
:8083intermittently returns HTTP 500 during sweeps. Cosine fallback keeps assertions passing; reliability investigation queued separately. SHADOW_DOM_UNWRAP_TAGScoversmdn-code-exampleonly. Docusaurus / GitBook / other React-based docs sites with similar wrappers will need per-tag measurement before addition.
Full entry: CHANGELOG.md ## [0.4.0].
PRs merged since 0.3.0
| # | Kind | Title |
|---|---|---|
| #29 | research | RRF k tuning sweep — retain k=60 (measurement-driven) |
| #30 | fix | Correct SE URLs (+ argparse dup fix) |
| #31 | research | BM25 identifier-aware tokenizer — gate (b) rejected |
| #32 | research | HyDE compound identifier → BM25 query — gate (b) rejected |
| #33 | research | MDN reranker diagnostic — D1 / shadow DOM root cause |
| #34 | feat | Inline shadow-DOM code blocks for known custom elements (default on) |
| #35 | release | 0.4.0 (develop → main) |