Skip to content

Commit 1a674f9

Browse files
authored
release: v0.2.0
Release v0.2.0: raw passthrough + Docker cleanup + WCXB benchmark
2 parents 92ddd7a + 8c933c3 commit 1a674f9

39 files changed

Lines changed: 10572 additions & 66 deletions

.env.example

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
# ---- Embeddings (required for retrieval) ----
77
# bge-m3 served by llama-server with --embeddings
88
TRAWL_EMBED_URL=http://localhost:8081/v1
9-
TRAWL_EMBED_MODEL=bge-m3-Q8_0.gguf
9+
TRAWL_EMBED_MODEL=bge-m3
1010

1111
# ---- Cross-encoder reranker (optional; falls back to cosine-only) ----
1212
# bge-reranker-v2-m3 served by llama-server with --reranking --pooling rank
@@ -16,11 +16,12 @@ TRAWL_RERANK_MODEL=bge-reranker-v2-m3
1616
# ---- HyDE query expansion (optional; off by default) ----
1717
# Small utility LLM (e.g. Gemma 4B)
1818
TRAWL_HYDE_URL=http://localhost:8082/v1
19-
TRAWL_HYDE_MODEL=gemma-4-E4B-it-Q8_0.gguf
19+
TRAWL_HYDE_MODEL=gemma
2020
# Pin to a specific llama-server slot for KV-cache reuse (optional)
2121
# TRAWL_HYDE_SLOT=1
2222

23-
# ---- VLM page profiling (optional; required only for profile_page) ----
23+
# ---- VLM page profiling (required only for profile_page) ----
24+
# When unset, the MCP server hides profile_page from its tool list.
2425
TRAWL_VLM_URL=http://localhost:8080/v1
2526
TRAWL_VLM_MODEL=gemma
2627
TRAWL_VLM_TIMEOUT=120
@@ -34,3 +35,10 @@ TRAWL_VLM_MAX_TOKENS=2048
3435
# ---- Benchmark only (benchmarks/run_benchmark.py) ----
3536
# Get a free key at https://jina.ai/reader/
3637
# JINA_API_KEY=
38+
39+
# ---- Raw passthrough (optional; off by default) ----
40+
# Hard cap on raw-passthrough response size in bytes. When fetch_page
41+
# receives JSON/XML/RSS/Atom, the body is returned as-is (no extraction,
42+
# no chunking, no embedding) up to this many bytes. Default: 262144
43+
# (256 KB ≈ 64K tokens — fits most local LLM context windows).
44+
# TRAWL_PASSTHROUGH_MAX_BYTES=262144

.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,3 +18,9 @@ tests/results/*.json
1818

1919
# benchmark outputs
2020
benchmarks/results/
21+
22+
# WCXB benchmark data (downloaded, not redistributed)
23+
benchmarks/wcxb/data/
24+
25+
# git worktrees
26+
.worktrees/

CLAUDE.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,11 @@ trawl directory. Humans should read `README.md` first, then
7676
pin requests to a specific llama-server slot (via `id_slot`) to
7777
avoid evicting other consumers' KV cache on shared servers with
7878
prompt caching.
79+
- **Raw passthrough** — JSON/XML/RSS/Atom responses are returned as-is
80+
without extraction. URL suffixes (`.json`, `.xml`, `.rss`, `.atom`)
81+
take an httpx fast path; suffix-less API endpoints are detected by
82+
response `Content-Type`. Byte cap via `TRAWL_PASSTHROUGH_MAX_BYTES`
83+
(default 256 KB).
7984

8085
## Quick Reference
8186

@@ -117,6 +122,9 @@ from trawl import fetch_relevant
117122
r = fetch_relevant('https://example.com/', 'what is this')
118123
print(r.chunks)
119124
"
125+
126+
# WCXB external extraction benchmark (one-shot)
127+
python benchmarks/wcxb/fetch.py && python benchmarks/wcxb/run.py
120128
```
121129

122130
## Architecture pointer
@@ -170,6 +178,12 @@ benchmarks/
170178
run_benchmark.py trawl (base/profile/cached) vs Jina runner
171179
profile_eval_cases.yaml 36 cases for VLM profile eval
172180
profile_eval.py profile generation quality evaluator
181+
wcxb/ external WCXB extraction benchmark (Phase 1)
182+
fetch.py snapshot download + hash verify
183+
run.py runner (trawl + Trafilatura baseline)
184+
aggregate.py summary + report rendering
185+
evaluate.py vendored WCXB word-F1 evaluator
186+
manifest.json pinned SHA-256 manifest of dev split
173187
results/ gitignored benchmark outputs
174188
175189
examples/
@@ -216,6 +230,7 @@ change them, run `tests/test_pipeline.py` before AND after.
216230
| `pipeline.py retrieve_k multiplier` | `2` | Retrieves 2x candidates for reranking; fewer reduces rerank benefit, more adds latency |
217231
| `profiles/mapper.py DEFAULT_MAX_CANDIDATES_PER_ANCHOR` | `5` | Enough headroom to find non-noise candidates after sidebar/nav filtering |
218232
| `profiles/mapper.py NOISE_CLS_RE` | `nav\|sidebar\|toc\|...` | Noise region detection for anchor filtering; too broad catches content, too narrow misses sidebars |
233+
| `fetchers/passthrough.py` | `PASSTHROUGH_MAX_BYTES` env default `262144` | 256 KB ≈ 64K tokens; weather-like APIs fit, larger than local LLM contexts |
219234

220235
## In / out of scope
221236

Dockerfile

Lines changed: 28 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -5,25 +5,39 @@ FROM mcr.microsoft.com/playwright/python:v1.47.0-jammy
55
WORKDIR /app
66

77
# Install Python deps first so source changes do not invalidate the dep layer.
8+
# Stub packages let `pip install -e .` resolve deps before real source is copied.
89
COPY pyproject.toml README.md ./
9-
COPY src ./src
10+
RUN mkdir -p src/trawl src/trawl_mcp && \
11+
touch src/trawl/__init__.py src/trawl_mcp/__init__.py && \
12+
pip install --no-cache-dir -e .
1013

11-
RUN pip install --no-cache-dir -e . && \
12-
playwright install --with-deps chromium
14+
# Real source — only this layer rebuilds on code changes.
15+
COPY src ./src
1316

14-
# Playwright browsers are installed above under the default path.
17+
# Chromium + runtime libs are already in the base image at /ms-playwright.
1518
ENV PLAYWRIGHT_BROWSERS_PATH=/ms-playwright
1619

17-
# trawl expects these at runtime. Compose overrides these at service
18-
# definition time to point at the host llama-servers.
19-
ENV TRAWL_EMBED_URL=http://host.docker.internal:8081/v1
20-
ENV TRAWL_EMBED_MODEL=bge-m3-Q8_0.gguf
21-
ENV TRAWL_RERANK_URL=http://host.docker.internal:8083/v1
22-
ENV TRAWL_RERANK_MODEL=bge-reranker-v2-m3
23-
ENV TRAWL_HYDE_URL=http://host.docker.internal:8082/v1
24-
ENV TRAWL_HYDE_MODEL=gemma-4-E4B-it-Q8_0.gguf
25-
ENV TRAWL_VLM_URL=http://host.docker.internal:8080/v1
26-
ENV TRAWL_VLM_MODEL=gemma
20+
# trawl runtime config — inject via `docker run -e ...`, compose
21+
# `environment:`, or `--env-file .env`. Not baked into the image so the
22+
# same image works across local-dev, LAN llama-servers, and remote hosts.
23+
# See .env.example for the full list.
24+
#
25+
# Required:
26+
# TRAWL_EMBED_URL e.g. http://host.docker.internal:8081/v1
27+
# TRAWL_EMBED_MODEL e.g. bge-m3
28+
#
29+
# Optional (feature degrades or is unused when absent):
30+
# TRAWL_RERANK_URL / TRAWL_RERANK_MODEL — cross-encoder reranker;
31+
# falls back to cosine-only
32+
# TRAWL_HYDE_URL / TRAWL_HYDE_MODEL — HyDE query expansion (off by default)
33+
# TRAWL_VLM_URL / TRAWL_VLM_MODEL — required for profile_page;
34+
# unset = tool hidden from MCP list
35+
# TRAWL_PASSTHROUGH_MAX_BYTES — default 262144 (256 KB)
36+
# TRAWL_HYDE_SLOT / TRAWL_VLM_SLOT — llama-server slot pinning
37+
#
38+
# Profile/visit cache is persisted at /root/.cache/trawl via VOLUME below.
39+
# Mount from host to retain state across container lifecycle:
40+
# docker run -v ~/.cache/trawl:/root/.cache/trawl ...
2741

2842
EXPOSE 8765
2943

README.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,21 @@ that matter for your query, at ~1k tokens of output.
5050
trawl wins on every token-efficiency axis and runs entirely on your
5151
own infrastructure. In exchange you pay a real cost elsewhere:
5252

53+
### External: WCXB dev (1,497 pages)
54+
55+
Beyond the internal 12-case parity matrix, trawl's extraction stage is
56+
cross-validated against the [WCXB](https://github.com/Murrough-Foley/web-content-extraction-benchmark)
57+
public benchmark (CC-BY-4.0, 1,497 dev pages across 7 page types).
58+
59+
| Extractor | F1 |
60+
|-----------------------------------|--------|
61+
| trawl (`html_to_markdown`) | 0.777 |
62+
| Trafilatura (same environment) | 0.750 |
63+
64+
Per-page-type breakdown and error counts: see
65+
[`benchmarks/wcxb/README.md`](benchmarks/wcxb/README.md) and run the
66+
benchmark locally to regenerate.
67+
5368
### When *not* to use trawl
5469

5570
- **You want the whole page verbatim.** Selective retrieval is the

0 commit comments

Comments
 (0)