Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
fa597d6
chore: untrack RESEARCH.md; add notes/ as gitignored scratch dir
Apr 15, 2026
153e308
docs: spec for C3 reranker title-injection spike
Apr 15, 2026
50fc968
docs: implementation plan for C3 reranker title-injection spike
Apr 15, 2026
245c60b
feat(extraction): add extract_title() helper for reranker title injec…
Apr 15, 2026
d89d124
fix(extraction): use get_text() for titles with nested tags
Apr 15, 2026
5f2537e
feat(reranking): title + section fields in reranker input, flag-gated
Apr 15, 2026
b7d6778
feat(pipeline): thread page_title into reranker; expose on PipelineRe…
Apr 15, 2026
a28e453
docs: document TRAWL_RERANK_INCLUDE_TITLE flag
Apr 15, 2026
a79e37c
docs: C4 선결 데이터 수집 telemetry 스펙
Apr 15, 2026
fcbf11f
docs: C4 telemetry 구현 플랜
Apr 15, 2026
d744d56
feat(telemetry): opt-in module scaffold with no-op default
Apr 15, 2026
ff5d103
feat(telemetry): build event dict from PipelineResult
Apr 15, 2026
410236b
feat(telemetry): append events to JSONL with 0600/0700 perms
Apr 15, 2026
7851e2d
feat(telemetry): single-generation size-based rotation
Apr 15, 2026
41ddb18
test(telemetry): lock in silent-failure contract
Apr 15, 2026
164291c
test(telemetry): root-guard + helper ordering cleanup
Apr 15, 2026
48e87c1
feat(pipeline): record telemetry on fetch_relevant completion
Apr 15, 2026
5136929
docs: document TRAWL_TELEMETRY opt-in collector
Apr 15, 2026
41d9eb6
refactor(telemetry): default path ~/.cache/trawl/telemetry.jsonl
Apr 15, 2026
4c07f9f
docs(telemetry): CLAUDE.md 포인터 추가 + 스펙 권한 규약 정리
Apr 15, 2026
b257b7a
feat(profiles): VLM 프롬프트 v3 — contiguous-run 앵커 규칙
Apr 15, 2026
29ad273
feat(passthrough): HEAD 프로브로 suffix 없는 JSON API 직통 처리
Apr 15, 2026
b24072f
fix(playwright): partial-init 실패 시 Playwright context 정리
Apr 15, 2026
9948739
chore: Playwright 1.58.0으로 핀 + Docker 베이스 이미지 동반 업
Apr 15, 2026
dac0557
chore: ruff check 위반 정리 (I001/E402/E741)
Apr 16, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@ TRAWL_EMBED_MODEL=bge-m3
# bge-reranker-v2-m3 served by llama-server with --reranking --pooling rank
TRAWL_RERANK_URL=http://localhost:8083/v1
TRAWL_RERANK_MODEL=bge-reranker-v2-m3
# Include page title + section heading as labelled fields in the
# reranker input (DeepQSE-style). Default on. Set to 0 to restore
# the legacy heading-only format.
# TRAWL_RERANK_INCLUDE_TITLE=1

# ---- HyDE query expansion (optional; off by default) ----
# Small utility LLM (e.g. Gemma 4B)
Expand Down Expand Up @@ -42,3 +46,9 @@ TRAWL_VLM_MAX_TOKENS=2048
# no chunking, no embedding) up to this many bytes. Default: 262144
# (256 KB ≈ 64K tokens — fits most local LLM context windows).
# TRAWL_PASSTHROUGH_MAX_BYTES=262144

# ---- Telemetry (opt-in; used to inform C4 decision) ----
# Set TRAWL_TELEMETRY=1 to append one JSON line per fetch_relevant() call.
# TRAWL_TELEMETRY=1
# TRAWL_TELEMETRY_PATH=~/.cache/trawl/telemetry.jsonl
# TRAWL_TELEMETRY_MAX_BYTES=67108864
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -24,3 +24,6 @@ benchmarks/wcxb/data/

# git worktrees
.worktrees/

# local notes / research scratchpads (not for upstream)
notes/
16 changes: 16 additions & 0 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -396,6 +396,22 @@ Full results: `benchmarks/results/`.
Anything on this list that's not justified by real-usage data is
speculation. Don't implement speculatively.

## Telemetry (optional)

Opt-in JSONL collector for `fetch_relevant()` calls. Off by default.
Activated with `TRAWL_TELEMETRY=1`; writes to `~/.cache/trawl/telemetry.jsonl`
(override with `TRAWL_TELEMETRY_PATH`). Single-generation size rotation
at `TRAWL_TELEMETRY_MAX_BYTES` (default 64 MB) — older data moves to
`telemetry.jsonl.1`.

Each line captures host, URL (plaintext), query SHA-1 prefix (query
plaintext is never stored), fetcher path, profile hit/miss, rerank and
HyDE flags, and latency/size breakdown. Full schema: see
`docs/superpowers/specs/2026-04-15-c4-telemetry-design.md`.

Purpose: feed the C4 (`notes/RESEARCH.md`) decision on whether
index-based extraction as a profile fallback has a problem to solve.

## Provenance

trawl is the packaged form of work that lived across three spikes:
Expand Down
5 changes: 5 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,11 @@ trawl directory. Humans should read `README.md` first, then
take an httpx fast path; suffix-less API endpoints are detected by
response `Content-Type`. Byte cap via `TRAWL_PASSTHROUGH_MAX_BYTES`
(default 256 KB).
- **Telemetry** (opt-in) — `TRAWL_TELEMETRY=1` appends one JSON line
per `fetch_relevant()` call to `~/.cache/trawl/telemetry.jsonl`
(override via `TRAWL_TELEMETRY_PATH`). Single-generation rotation
at 64 MB. Purpose: feed the C4 decision in `notes/RESEARCH.md`.
Schema: `src/trawl/telemetry.py` + the C4 spec doc.

## Quick Reference

Expand Down
5 changes: 4 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
# trawl MCP server — HTTP transport, for HTTP-only MCP clients.
# Base image ships chromium + runtime libs pre-installed for Playwright.
FROM mcr.microsoft.com/playwright/python:v1.47.0-jammy
# The tag version MUST match the `playwright==` pin in pyproject.toml —
# the base image's /ms-playwright/ browsers only work with the matching
# Python package revision. Bump both together.
FROM mcr.microsoft.com/playwright/python:v1.58.0-jammy

WORKDIR /app

Expand Down
293 changes: 0 additions & 293 deletions RESEARCH.md

This file was deleted.

Loading
Loading