Skip to content

Commit 56ed03e

Browse files
authored
Merge pull request #35 from bbulb/develop
release: trawl 0.4.0 (develop → main)
2 parents 4a1dad1 + 012a49f commit 56ed03e

18 files changed

Lines changed: 5020 additions & 28 deletions

CHANGELOG.md

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,134 @@ not yet follow semver strictly — expect breaking changes before
99

1010
_No changes yet._
1111

12+
## [0.4.0] — 2026-04-20
13+
14+
Fourth tagged release. Closes the C6 (hybrid retrieval) follow-up
15+
chain surfaced in 0.3.0's "retrieval still struggles on
16+
`code_heavy_query`" note. Headline: Playwright shadow-DOM unwrap for
17+
code-block custom elements — MDN's post-2024 redesign wraps every
18+
code example in `<mdn-code-example>` backed by Shadow DOM, which
19+
Playwright's `page.content()` does not traverse. Inlining the shadow
20+
`<pre><code>` content before extraction flips
21+
`claude_code_mdn_fetch_api` to PASS and brings the 16-pattern
22+
`code_heavy_query` slice to 16/16.
23+
24+
Also includes two Stack Exchange URL corrections (PR #30) and four
25+
research spikes (PR #29/#31/#32/#33) whose conclusions and runners
26+
are kept as reusable artefacts. Three of those spikes rejected their
27+
hypothesis at the pre-registered gate — the measurement discipline
28+
ended up being as valuable as the one hypothesis that stuck, because
29+
each rejection narrowed the search to the actual bottleneck (Shadow
30+
DOM).
31+
32+
### Added
33+
34+
- **Shadow-DOM unwrap for code-block custom elements (default on).**
35+
`src/trawl/fetchers/playwright.py` introduces
36+
`SHADOW_DOM_UNWRAP_TAGS` (initial allow-list: `mdn-code-example`)
37+
and `_unwrap_shadow_dom()`, called between the content-ready wait
38+
and `page.content()`. For each matching element, pulls
39+
`shadowRoot.querySelector('pre > code').textContent`, HTML-escapes
40+
it, and inlines `<pre><code>{text}</code></pre>` into the light
41+
DOM so `html_to_markdown` / Trafilatura sees a proper code block.
42+
Using `textContent` (rather than the full shadow `innerHTML`)
43+
avoids the syntax-highlight `<span>` scaffolding that would
44+
otherwise split identifiers like `JSON.stringify` across tag
45+
boundaries during markdown conversion. Falls back to the full
46+
`shadowRoot.innerHTML` when no `pre > code` exists. Idempotent;
47+
JS eval exceptions are swallowed so extraction never fails on
48+
account of unwrap.
49+
* New env var: `TRAWL_SHADOW_DOM_UNWRAP` (default `"1"`; set to
50+
`"0"` to disable).
51+
* New module-level constant: `SHADOW_DOM_UNWRAP_TAGS` in
52+
`fetchers/playwright.py`. Additions must go through the same
53+
measurement gate (fix a specific pattern and not regress the
54+
other 15).
55+
* Measurement runner: `benchmarks/shadow_dom_sweep.py` (2 modes
56+
× 16 patterns × 2 iter + 15-case parity per mode).
57+
* Design doc:
58+
`docs/superpowers/specs/2026-04-20-playwright-shadow-dom-design.md`.
59+
* Measurement: `shadow_dom_off` 15/16 → `shadow_dom_on` 16/16;
60+
`flipped_to_pass = [claude_code_mdn_fetch_api]`;
61+
`flipped_to_fail = []`; `top1_identity_changed = 1/16` (MDN
62+
only, `n_chunks_total` 22 → 24); parity 15/15 both modes;
63+
retrieval_ms regression within noise. Raw at
64+
`benchmarks/results/shadow-dom-sweep/2026-04-20T10-26-17Z/`
65+
(gitignored).
66+
67+
### Fixed
68+
69+
- **Two Stack Exchange `code_heavy_query` URLs resolved to
70+
unrelated questions.** `claude_code_serverfault_nginx_reverse_proxy`
71+
pointed at `serverfault.com/questions/378860` (resolves to an
72+
apache-vhosts / cookie question, not the nginx reverse-proxy Host
73+
header question). `claude_code_stackoverflow_python_async_subprocess`
74+
pointed at SO #44488350, which is an *answer* ID whose parent
75+
question is about CSV escaping. Stack Exchange resolves by ID
76+
alone, ignoring the slug, so both patterns had been failing
77+
against content unrelated to their query since the coding shard
78+
was introduced. Replaced with the canonical questions (SF #87056
79+
and SO #42639984); both flip to PASS. Not an extraction defect —
80+
`benchmarks/stackexchange_extraction_diag.py` confirmed trawl's
81+
extraction was intact. Also removes a duplicate argparse flag
82+
registration in `tests/test_agent_patterns.py` left behind by a
83+
stack-merge union resolver.
84+
85+
### Research (no code change, shipped as reusable runners + design docs)
86+
87+
- **C6 RRF-k tuning spike** (PR #29). Measured
88+
`TRAWL_HYBRID_RRF_K ∈ {10, 30, 60, 100}` on the 16
89+
`code_heavy_query` patterns with hybrid retrieval on. All four k
90+
values produced identical assertion pass rate and identical top-1
91+
reshuffles across three patterns — the reranker stabilises the
92+
pre-rerank ordering, so RRF k is effectively invisible
93+
downstream. Gate (b): retain `k=60`. Runner:
94+
`benchmarks/c6_rrf_k_sweep.py`; design doc:
95+
`docs/superpowers/specs/2026-04-20-c6-rrf-k-tuning-design.md`.
96+
- **Identifier-aware BM25 tokenizer spike** (PR #31). Hypothesised
97+
that emitting compound tokens for dotted (`asyncio.gather`) /
98+
hyphenated (`Content-Type`) identifiers would let the sparse
99+
ranker boost code-heavy chunks. Measurement (3 modes × 16
100+
patterns): `net_assertion_delta = 0`, `top1_identity_changed =
101+
0/16`. Corpus-side compound emission alone is insufficient when
102+
queries don't contain the compound identifier (the MDN query
103+
describes intent — "send a POST request" — not symbols). Gate
104+
(b). Runner: `benchmarks/bm25_id_aware_sweep.py`; design doc:
105+
`docs/superpowers/specs/2026-04-20-bm25-id-aware-tokenizer-design.md`.
106+
- **HyDE → BM25 query spike** (PR #32). Hypothesised that the
107+
HyDE hypothetical answer (which does emit compound identifiers
108+
under the current Gemma prompt) could feed the sparse query if
109+
routed into BM25 in addition to the dense path. Measurement (3
110+
modes × 16 patterns): `net_delta = 0`. HyDE produced the right
111+
identifiers, but the MDN failure survived because — as the next
112+
spike proved — the underlying chunks didn't contain those
113+
identifiers in the first place (they were in Shadow DOM). Gate
114+
(b). Runner: `benchmarks/hyde_compound_id_sweep.py`; design doc:
115+
`docs/superpowers/specs/2026-04-20-hyde-compound-identifier-design.md`.
116+
- **MDN reranker diagnostic** (PR #33). One-shot diagnostic to
117+
locate the MDN assertion-keyword chunk's rank across raw /
118+
reranked / HyDE modes. Found the keyword chunk at rank 14 even
119+
in `raw` mode (no reranker) — reranker was not the bottleneck.
120+
Direct inspection of the HTML returned by Playwright showed 23
121+
`<mdn-code-example>` tags with `innerHTML`-empty light DOM; the
122+
real code lived in Shadow DOM. Decision hint `D1`, which set up
123+
PR #34. Runner: `benchmarks/mdn_reranker_diag.py`; design doc:
124+
`docs/superpowers/specs/2026-04-20-mdn-reranker-diagnostic-design.md`.
125+
126+
### Known caveats
127+
128+
- **Reranker `:8083` intermittently returns HTTP 500** during
129+
sweeps (observed across PR #31/#32/#33/#34 measurements). The
130+
client falls back to cosine-only scoring per the existing
131+
`reranker unavailable, falling back to cosine: ...` log line, so
132+
assertions still pass on the 16-pattern slice and on the 15-case
133+
parity matrix. Flagged here but not treated as a 0.4.0 gate
134+
failure; a separate reliability investigation is queued.
135+
- **`SHADOW_DOM_UNWRAP_TAGS` allow-list is narrow.** Only
136+
`mdn-code-example` ships. Other docs sites that use similar
137+
Shadow-DOM wrappers (Docusaurus / GitBook variants) are not yet
138+
covered; each addition will come with its own measurement PR.
139+
12140
## [0.3.0] — 2026-04-20
13141

14142
Third tagged release. Packs up the six C-series follow-ups and the

CLAUDE.md

Lines changed: 27 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,12 +12,15 @@ trawl directory. Humans should read `README.md` first, then
1212

1313
## Current status
1414

15-
- **Version**: 0.3.0 (2026-04-20). Highlights since the narrow
16-
`v0.2.0` tag (raw passthrough / Docker / WCXB, 2026-04-15): C6
17-
BM25 hybrid retrieval (opt-in), C7 PDF HEAD probe, C8 per-fetch
18-
cache (default on), C9 per-host adaptive ceiling (default on),
19-
C16 compositional payload enrichment, longform chunk budget
20-
prefilter (opt-in). Full list in `CHANGELOG.md`.
15+
- **Version**: 0.4.0 (2026-04-20). Highlights since `v0.3.0`: shadow-
16+
DOM unwrap for code-block custom elements (default on — MDN-style
17+
`<mdn-code-example>` pages now extract their code bodies), Stack
18+
Exchange URL corrections (SO/SF patterns resolved to unrelated
19+
questions pre-fix), and four research spikes whose measurements
20+
are preserved as reusable runners (C6 RRF-k, id-aware BM25
21+
tokenizer, HyDE → BM25 extras, MDN reranker diagnostic). C6
22+
follow-up chain closed at `code_heavy_query` 16/16. Full list in
23+
`CHANGELOG.md`.
2124
- **Parity matrix**: 15/15 cases pass (see `tests/test_cases.yaml`).
2225
`kbo_schedule` pinned to a historical game day to survive KBO
2326
off-days.
@@ -129,6 +132,24 @@ trawl directory. Humans should read `README.md` first, then
129132
identity preserved. `PipelineResult.n_chunks_embedded` reports the
130133
post-prefilter count. See
131134
`docs/superpowers/specs/2026-04-20-longform-retrieval-cost-design.md`.
135+
- **Shadow-DOM unwrap for code-block custom elements** (default on)
136+
`fetchers/playwright.py` inlines each matching element's
137+
`shadowRoot`'s `pre > code` textContent (wrapped in a fresh
138+
`<pre><code>`) into the light DOM before `page.content()`.
139+
Initial allow-list: `mdn-code-example`. Playwright's default
140+
content capture skips shadow roots, so pages that render code in
141+
Shadow DOM (notably MDN post-2024 redesign) previously fed the
142+
extractor empty `<mdn-code-example></mdn-code-example>` tags;
143+
the MDN fetch pattern's assertion keywords (`JSON.stringify`,
144+
`method:`, `application/json`) were simply not in the extracted
145+
markdown. Measurement on the 16 `code_heavy_query` patterns:
146+
baseline 15/16 → on 16/16 (`claude_code_mdn_fetch_api` flipped
147+
to PASS), top1 changed on 1/16 (MDN only, `n_chunks_total`
148+
22 → 24), parity 15/15 in both modes. Disable via
149+
`TRAWL_SHADOW_DOM_UNWRAP=0`. Grow `SHADOW_DOM_UNWRAP_TAGS` only
150+
with a companion measurement: each addition must fix a specific
151+
pattern and not regress the other 15. See
152+
`docs/superpowers/specs/2026-04-20-playwright-shadow-dom-design.md`.
132153

133154
## Quick Reference
134155

0 commit comments

Comments
 (0)