@@ -9,6 +9,134 @@ not yet follow semver strictly — expect breaking changes before
99
1010_ No changes yet._
1111
12+ ## [ 0.4.0] — 2026-04-20
13+
14+ Fourth tagged release. Closes the C6 (hybrid retrieval) follow-up
15+ chain surfaced in 0.3.0's "retrieval still struggles on
16+ ` code_heavy_query ` " note. Headline: Playwright shadow-DOM unwrap for
17+ code-block custom elements — MDN's post-2024 redesign wraps every
18+ code example in ` <mdn-code-example> ` backed by Shadow DOM, which
19+ Playwright's ` page.content() ` does not traverse. Inlining the shadow
20+ ` <pre><code> ` content before extraction flips
21+ ` claude_code_mdn_fetch_api ` to PASS and brings the 16-pattern
22+ ` code_heavy_query ` slice to 16/16.
23+
24+ Also includes two Stack Exchange URL corrections (PR #30 ) and four
25+ research spikes (PR #29 /#31 /#32 /#33 ) whose conclusions and runners
26+ are kept as reusable artefacts. Three of those spikes rejected their
27+ hypothesis at the pre-registered gate — the measurement discipline
28+ ended up being as valuable as the one hypothesis that stuck, because
29+ each rejection narrowed the search to the actual bottleneck (Shadow
30+ DOM).
31+
32+ ### Added
33+
34+ - ** Shadow-DOM unwrap for code-block custom elements (default on).**
35+ ` src/trawl/fetchers/playwright.py ` introduces
36+ ` SHADOW_DOM_UNWRAP_TAGS ` (initial allow-list: ` mdn-code-example ` )
37+ and ` _unwrap_shadow_dom() ` , called between the content-ready wait
38+ and ` page.content() ` . For each matching element, pulls
39+ ` shadowRoot.querySelector('pre > code').textContent ` , HTML-escapes
40+ it, and inlines ` <pre><code>{text}</code></pre> ` into the light
41+ DOM so ` html_to_markdown ` / Trafilatura sees a proper code block.
42+ Using ` textContent ` (rather than the full shadow ` innerHTML ` )
43+ avoids the syntax-highlight ` <span> ` scaffolding that would
44+ otherwise split identifiers like ` JSON.stringify ` across tag
45+ boundaries during markdown conversion. Falls back to the full
46+ ` shadowRoot.innerHTML ` when no ` pre > code ` exists. Idempotent;
47+ JS eval exceptions are swallowed so extraction never fails on
48+ account of unwrap.
49+ * New env var: ` TRAWL_SHADOW_DOM_UNWRAP ` (default ` "1" ` ; set to
50+ ` "0" ` to disable).
51+ * New module-level constant: ` SHADOW_DOM_UNWRAP_TAGS ` in
52+ ` fetchers/playwright.py ` . Additions must go through the same
53+ measurement gate (fix a specific pattern and not regress the
54+ other 15).
55+ * Measurement runner: ` benchmarks/shadow_dom_sweep.py ` (2 modes
56+ × 16 patterns × 2 iter + 15-case parity per mode).
57+ * Design doc:
58+ ` docs/superpowers/specs/2026-04-20-playwright-shadow-dom-design.md ` .
59+ * Measurement: ` shadow_dom_off ` 15/16 → ` shadow_dom_on ` 16/16;
60+ ` flipped_to_pass = [claude_code_mdn_fetch_api] ` ;
61+ ` flipped_to_fail = [] ` ; ` top1_identity_changed = 1/16 ` (MDN
62+ only, ` n_chunks_total ` 22 → 24); parity 15/15 both modes;
63+ retrieval_ms regression within noise. Raw at
64+ ` benchmarks/results/shadow-dom-sweep/2026-04-20T10-26-17Z/ `
65+ (gitignored).
66+
67+ ### Fixed
68+
69+ - ** Two Stack Exchange ` code_heavy_query ` URLs resolved to
70+ unrelated questions.** ` claude_code_serverfault_nginx_reverse_proxy `
71+ pointed at ` serverfault.com/questions/378860 ` (resolves to an
72+ apache-vhosts / cookie question, not the nginx reverse-proxy Host
73+ header question). ` claude_code_stackoverflow_python_async_subprocess `
74+ pointed at SO #44488350 , which is an * answer* ID whose parent
75+ question is about CSV escaping. Stack Exchange resolves by ID
76+ alone, ignoring the slug, so both patterns had been failing
77+ against content unrelated to their query since the coding shard
78+ was introduced. Replaced with the canonical questions (SF #87056
79+ and SO #42639984 ); both flip to PASS. Not an extraction defect —
80+ ` benchmarks/stackexchange_extraction_diag.py ` confirmed trawl's
81+ extraction was intact. Also removes a duplicate argparse flag
82+ registration in ` tests/test_agent_patterns.py ` left behind by a
83+ stack-merge union resolver.
84+
85+ ### Research (no code change, shipped as reusable runners + design docs)
86+
87+ - ** C6 RRF-k tuning spike** (PR #29 ). Measured
88+ ` TRAWL_HYBRID_RRF_K ∈ {10, 30, 60, 100} ` on the 16
89+ ` code_heavy_query ` patterns with hybrid retrieval on. All four k
90+ values produced identical assertion pass rate and identical top-1
91+ reshuffles across three patterns — the reranker stabilises the
92+ pre-rerank ordering, so RRF k is effectively invisible
93+ downstream. Gate (b): retain ` k=60 ` . Runner:
94+ ` benchmarks/c6_rrf_k_sweep.py ` ; design doc:
95+ ` docs/superpowers/specs/2026-04-20-c6-rrf-k-tuning-design.md ` .
96+ - ** Identifier-aware BM25 tokenizer spike** (PR #31 ). Hypothesised
97+ that emitting compound tokens for dotted (` asyncio.gather ` ) /
98+ hyphenated (` Content-Type ` ) identifiers would let the sparse
99+ ranker boost code-heavy chunks. Measurement (3 modes × 16
100+ patterns): ` net_assertion_delta = 0 ` , `top1_identity_changed =
101+ 0/16`. Corpus-side compound emission alone is insufficient when
102+ queries don't contain the compound identifier (the MDN query
103+ describes intent — "send a POST request" — not symbols). Gate
104+ (b). Runner: ` benchmarks/bm25_id_aware_sweep.py ` ; design doc:
105+ ` docs/superpowers/specs/2026-04-20-bm25-id-aware-tokenizer-design.md ` .
106+ - ** HyDE → BM25 query spike** (PR #32 ). Hypothesised that the
107+ HyDE hypothetical answer (which does emit compound identifiers
108+ under the current Gemma prompt) could feed the sparse query if
109+ routed into BM25 in addition to the dense path. Measurement (3
110+ modes × 16 patterns): ` net_delta = 0 ` . HyDE produced the right
111+ identifiers, but the MDN failure survived because — as the next
112+ spike proved — the underlying chunks didn't contain those
113+ identifiers in the first place (they were in Shadow DOM). Gate
114+ (b). Runner: ` benchmarks/hyde_compound_id_sweep.py ` ; design doc:
115+ ` docs/superpowers/specs/2026-04-20-hyde-compound-identifier-design.md ` .
116+ - ** MDN reranker diagnostic** (PR #33 ). One-shot diagnostic to
117+ locate the MDN assertion-keyword chunk's rank across raw /
118+ reranked / HyDE modes. Found the keyword chunk at rank 14 even
119+ in ` raw ` mode (no reranker) — reranker was not the bottleneck.
120+ Direct inspection of the HTML returned by Playwright showed 23
121+ ` <mdn-code-example> ` tags with ` innerHTML ` -empty light DOM; the
122+ real code lived in Shadow DOM. Decision hint ` D1 ` , which set up
123+ PR #34 . Runner: ` benchmarks/mdn_reranker_diag.py ` ; design doc:
124+ ` docs/superpowers/specs/2026-04-20-mdn-reranker-diagnostic-design.md ` .
125+
126+ ### Known caveats
127+
128+ - ** Reranker ` :8083 ` intermittently returns HTTP 500** during
129+ sweeps (observed across PR #31 /#32 /#33 /#34 measurements). The
130+ client falls back to cosine-only scoring per the existing
131+ ` reranker unavailable, falling back to cosine: ... ` log line, so
132+ assertions still pass on the 16-pattern slice and on the 15-case
133+ parity matrix. Flagged here but not treated as a 0.4.0 gate
134+ failure; a separate reliability investigation is queued.
135+ - ** ` SHADOW_DOM_UNWRAP_TAGS ` allow-list is narrow.** Only
136+ ` mdn-code-example ` ships. Other docs sites that use similar
137+ Shadow-DOM wrappers (Docusaurus / GitBook variants) are not yet
138+ covered; each addition will come with its own measurement PR.
139+
12140## [ 0.3.0] — 2026-04-20
13141
14142Third tagged release. Packs up the six C-series follow-ups and the
0 commit comments