Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
92e2e92
chore: ignore .worktrees/ directory
Apr 14, 2026
f79d887
docs: add RESEARCH.md (C1-C5 improvement candidates)
Apr 14, 2026
69c2cf8
docs(wcxb): add design spec and implementation plan with Task 0 findings
Apr 14, 2026
3ca1fdc
feat(wcxb): vendor WCXB evaluate.py with attribution and unit tests
Apr 14, 2026
f2ef571
test(wcxb): add synthetic fixtures (article/product/empty)
Apr 14, 2026
87a33fa
feat(wcxb): add single-page trawl evaluation function
Apr 14, 2026
e348310
feat(wcxb): add Trafilatura baseline path and snippet hit counts
Apr 14, 2026
80b440a
feat(wcxb): add aggregation and markdown report rendering
Apr 14, 2026
8c14106
feat(wcxb): add run_all orchestrator, argparse CLI, progress logging
Apr 14, 2026
0b73bd6
feat(wcxb): add Trafilatura default-mode sanity field
Apr 14, 2026
a320865
feat(wcxb): add fetch.py with pinned manifest for dev split
Apr 14, 2026
2eee073
docs(wcxb): add README, gitignore data dir, register in CLAUDE.md
Apr 14, 2026
449deea
docs: report external WCXB dev F1 in README
Apr 14, 2026
e8015ff
docs(wcxb): fix sanity-check target from article-only 0.958 to dev-to…
Apr 14, 2026
69d0358
docs(late-chunking): add brainstorm-approved design spec
Apr 14, 2026
e4ba66a
docs(late-chunking): add implementation plan (7 tasks)
Apr 14, 2026
38391fa
docs: mark C1 late-chunking as rejected with NO-GO summary, update re…
Apr 14, 2026
78d8aeb
docs(passthrough): add raw passthrough design for JSON/XML responses
Apr 15, 2026
4ed90d5
docs(passthrough): add implementation plan (9 TDD tasks)
Apr 15, 2026
62bb80a
feat(pipeline): add content_type and truncated fields to PipelineResult
Apr 15, 2026
7c54eb4
feat(passthrough): add URL and Content-Type detection predicates
Apr 15, 2026
3d8d678
feat(passthrough): add httpx-based fetch with streaming + byte cap
Apr 15, 2026
7631ecb
chore(docker): drop redundant playwright install, fix layer caching
Apr 15, 2026
8246828
feat(playwright): capture response Content-Type on fetch
Apr 15, 2026
6535d8b
feat(pipeline): short-circuit to raw passthrough for structured data …
Apr 15, 2026
81529f6
feat(pipeline): detect passthrough via Playwright Content-Type post-c…
Apr 15, 2026
d6d736e
test(mcp): exercise raw-passthrough path via stdio
Apr 15, 2026
23c66de
docs: document TRAWL_PASSTHROUGH_MAX_BYTES and passthrough behaviour
Apr 15, 2026
87019fa
Merge branch 'feat/passthrough' into develop
Apr 15, 2026
5217941
refactor: genericize model defaults; hide profile_page without TRAWL_…
Apr 15, 2026
621da62
docs(docker): inject runtime config via -e/compose, remove hardcoded ENV
Apr 15, 2026
8c933c3
chore: bump version to 0.2.0
Apr 15, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 11 additions & 3 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
# ---- Embeddings (required for retrieval) ----
# bge-m3 served by llama-server with --embeddings
TRAWL_EMBED_URL=http://localhost:8081/v1
TRAWL_EMBED_MODEL=bge-m3-Q8_0.gguf
TRAWL_EMBED_MODEL=bge-m3

# ---- Cross-encoder reranker (optional; falls back to cosine-only) ----
# bge-reranker-v2-m3 served by llama-server with --reranking --pooling rank
Expand All @@ -16,11 +16,12 @@ TRAWL_RERANK_MODEL=bge-reranker-v2-m3
# ---- HyDE query expansion (optional; off by default) ----
# Small utility LLM (e.g. Gemma 4B)
TRAWL_HYDE_URL=http://localhost:8082/v1
TRAWL_HYDE_MODEL=gemma-4-E4B-it-Q8_0.gguf
TRAWL_HYDE_MODEL=gemma
# Pin to a specific llama-server slot for KV-cache reuse (optional)
# TRAWL_HYDE_SLOT=1

# ---- VLM page profiling (optional; required only for profile_page) ----
# ---- VLM page profiling (required only for profile_page) ----
# When unset, the MCP server hides profile_page from its tool list.
TRAWL_VLM_URL=http://localhost:8080/v1
TRAWL_VLM_MODEL=gemma
TRAWL_VLM_TIMEOUT=120
Expand All @@ -34,3 +35,10 @@ TRAWL_VLM_MAX_TOKENS=2048
# ---- Benchmark only (benchmarks/run_benchmark.py) ----
# Get a free key at https://jina.ai/reader/
# JINA_API_KEY=

# ---- Raw passthrough (optional; off by default) ----
# Hard cap on raw-passthrough response size in bytes. When fetch_page
# receives JSON/XML/RSS/Atom, the body is returned as-is (no extraction,
# no chunking, no embedding) up to this many bytes. Default: 262144
# (256 KB ≈ 64K tokens — fits most local LLM context windows).
# TRAWL_PASSTHROUGH_MAX_BYTES=262144
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,9 @@ tests/results/*.json

# benchmark outputs
benchmarks/results/

# WCXB benchmark data (downloaded, not redistributed)
benchmarks/wcxb/data/

# git worktrees
.worktrees/
15 changes: 15 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,11 @@ trawl directory. Humans should read `README.md` first, then
pin requests to a specific llama-server slot (via `id_slot`) to
avoid evicting other consumers' KV cache on shared servers with
prompt caching.
- **Raw passthrough** — JSON/XML/RSS/Atom responses are returned as-is
without extraction. URL suffixes (`.json`, `.xml`, `.rss`, `.atom`)
take an httpx fast path; suffix-less API endpoints are detected by
response `Content-Type`. Byte cap via `TRAWL_PASSTHROUGH_MAX_BYTES`
(default 256 KB).

## Quick Reference

Expand Down Expand Up @@ -117,6 +122,9 @@ from trawl import fetch_relevant
r = fetch_relevant('https://example.com/', 'what is this')
print(r.chunks)
"

# WCXB external extraction benchmark (one-shot)
python benchmarks/wcxb/fetch.py && python benchmarks/wcxb/run.py
```

## Architecture pointer
Expand Down Expand Up @@ -170,6 +178,12 @@ benchmarks/
run_benchmark.py trawl (base/profile/cached) vs Jina runner
profile_eval_cases.yaml 36 cases for VLM profile eval
profile_eval.py profile generation quality evaluator
wcxb/ external WCXB extraction benchmark (Phase 1)
fetch.py snapshot download + hash verify
run.py runner (trawl + Trafilatura baseline)
aggregate.py summary + report rendering
evaluate.py vendored WCXB word-F1 evaluator
manifest.json pinned SHA-256 manifest of dev split
results/ gitignored benchmark outputs

examples/
Expand Down Expand Up @@ -216,6 +230,7 @@ change them, run `tests/test_pipeline.py` before AND after.
| `pipeline.py retrieve_k multiplier` | `2` | Retrieves 2x candidates for reranking; fewer reduces rerank benefit, more adds latency |
| `profiles/mapper.py DEFAULT_MAX_CANDIDATES_PER_ANCHOR` | `5` | Enough headroom to find non-noise candidates after sidebar/nav filtering |
| `profiles/mapper.py NOISE_CLS_RE` | `nav\|sidebar\|toc\|...` | Noise region detection for anchor filtering; too broad catches content, too narrow misses sidebars |
| `fetchers/passthrough.py` | `PASSTHROUGH_MAX_BYTES` env default `262144` | 256 KB ≈ 64K tokens; weather-like APIs fit, larger than local LLM contexts |

## In / out of scope

Expand Down
42 changes: 28 additions & 14 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -5,25 +5,39 @@ FROM mcr.microsoft.com/playwright/python:v1.47.0-jammy
WORKDIR /app

# Install Python deps first so source changes do not invalidate the dep layer.
# Stub packages let `pip install -e .` resolve deps before real source is copied.
COPY pyproject.toml README.md ./
COPY src ./src
RUN mkdir -p src/trawl src/trawl_mcp && \
touch src/trawl/__init__.py src/trawl_mcp/__init__.py && \
pip install --no-cache-dir -e .

RUN pip install --no-cache-dir -e . && \
playwright install --with-deps chromium
# Real source — only this layer rebuilds on code changes.
COPY src ./src

# Playwright browsers are installed above under the default path.
# Chromium + runtime libs are already in the base image at /ms-playwright.
ENV PLAYWRIGHT_BROWSERS_PATH=/ms-playwright

# trawl expects these at runtime. Compose overrides these at service
# definition time to point at the host llama-servers.
ENV TRAWL_EMBED_URL=http://host.docker.internal:8081/v1
ENV TRAWL_EMBED_MODEL=bge-m3-Q8_0.gguf
ENV TRAWL_RERANK_URL=http://host.docker.internal:8083/v1
ENV TRAWL_RERANK_MODEL=bge-reranker-v2-m3
ENV TRAWL_HYDE_URL=http://host.docker.internal:8082/v1
ENV TRAWL_HYDE_MODEL=gemma-4-E4B-it-Q8_0.gguf
ENV TRAWL_VLM_URL=http://host.docker.internal:8080/v1
ENV TRAWL_VLM_MODEL=gemma
# trawl runtime config — inject via `docker run -e ...`, compose
# `environment:`, or `--env-file .env`. Not baked into the image so the
# same image works across local-dev, LAN llama-servers, and remote hosts.
# See .env.example for the full list.
#
# Required:
# TRAWL_EMBED_URL e.g. http://host.docker.internal:8081/v1
# TRAWL_EMBED_MODEL e.g. bge-m3
#
# Optional (feature degrades or is unused when absent):
# TRAWL_RERANK_URL / TRAWL_RERANK_MODEL — cross-encoder reranker;
# falls back to cosine-only
# TRAWL_HYDE_URL / TRAWL_HYDE_MODEL — HyDE query expansion (off by default)
# TRAWL_VLM_URL / TRAWL_VLM_MODEL — required for profile_page;
# unset = tool hidden from MCP list
# TRAWL_PASSTHROUGH_MAX_BYTES — default 262144 (256 KB)
# TRAWL_HYDE_SLOT / TRAWL_VLM_SLOT — llama-server slot pinning
#
# Profile/visit cache is persisted at /root/.cache/trawl via VOLUME below.
# Mount from host to retain state across container lifecycle:
# docker run -v ~/.cache/trawl:/root/.cache/trawl ...

EXPOSE 8765

Expand Down
15 changes: 15 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,21 @@ that matter for your query, at ~1k tokens of output.
trawl wins on every token-efficiency axis and runs entirely on your
own infrastructure. In exchange you pay a real cost elsewhere:

### External: WCXB dev (1,497 pages)

Beyond the internal 12-case parity matrix, trawl's extraction stage is
cross-validated against the [WCXB](https://github.com/Murrough-Foley/web-content-extraction-benchmark)
public benchmark (CC-BY-4.0, 1,497 dev pages across 7 page types).

| Extractor | F1 |
|-----------------------------------|--------|
| trawl (`html_to_markdown`) | 0.777 |
| Trafilatura (same environment) | 0.750 |

Per-page-type breakdown and error counts: see
[`benchmarks/wcxb/README.md`](benchmarks/wcxb/README.md) and run the
benchmark locally to regenerate.

### When *not* to use trawl

- **You want the whole page verbatim.** Selective retrieval is the
Expand Down
Loading
Loading