Stack: Python 3.11+ · Playwright (async) · Anthropic SDK · Pydantic v2
Constraint: No browser-use library. No LangChain. Custom agent loop only.
A site-agnostic browser agent that discovers sample lists by navigating any web UI, then executes structured browser tasks and collects reviewable evidence from each sample in parallel — driven by a custom ReAct loop with Claude, zero hardcoded site logic, and a swappable JSON task spec per target system.
- Project Requirements
- Task Analysis
- Repository Structure
- System Overview
- Layer 1 — Orchestrator
- Layer 2 — Discovery
- Layer 3 — Worker
- Layer 4 — Agent Loop
- Layer 5 — DOM Extractor
- Layer 6 — Vision Module
- Layer 7 — Action System
- Layer 8 — Output & Evidence
- Task Spec Schema
- Tech Stack
- Build Order
This is a Browser Workflow Agent for structured browser workflows. It started from audit/compliance evidence collection, where analysts manually navigate systems like Workday, GitHub, Jira, and LinkedIn, take screenshots, extract data into spreadsheets, and download artifacts. It has since expanded into a broader DOM-first browser agent for extraction, navigation, enrichment, form workflows, judgments, and long-horizon evidence collection.
The core design target is now:
- General-purpose across accessible DOM-first web tasks
- Evidence-oriented by default — screenshots, structured outputs, traces, checkpoints
- Task-configurable — discovery, extraction, judgments, and form workflows via JSON task specs
- System-agnostic within scope — works across a broad class of accessible DOM-first websites via browser UI, no site-specific Python code
- Two modes — Manual UI (default): navigate, extract, screenshot, download. Integration-assisted (optional): use exports/APIs if available
- Evidence-grade outputs — screenshots, CSVs, judgments, per-sample folders
- Scalable — 50 to 1,000+ samples per batch
- Natural language input — "Go to Microsoft's GitHub and get all users' GitHub usernames"
- Broad workflow coverage — supports audit tasks, enrichment, form fill, graph traversal, and long-horizon collection
- Accuracy
- Generability
- Scalability to thousands of samples
- Consistency between samples
- Speed
- Custom agent loop — real ReAct design, not an LLM-in-a-loop
- Sub-agent coordination — brief explicitly says "many subagents"
- Evidence quality — deterministic, hashed, audit-reviewable
- Scalability — 1,000 samples without degradation
- System-agnostic — reasons over live DOM, not hardcoded selectors
| Aspect | Detail |
|---|---|
| Input | Excel/CSV list of Linear ticket URLs |
| Navigation | Single-page per sample — visit URL, extract, screenshot |
| Extract | ticket_number, assignee, due_date |
| Output | CSV + one folder per ticket with screenshot |
| Challenge | React SPA — must wait for hydration. Distinguish "unassigned" from "failed to find" |
| Aspect | Detail |
|---|---|
| Input | 60 commit SHAs or "last 60 commits on main" |
| Navigation | Graph traversal: commit page → PR → checks → CI failure → Jira ticket |
| Extract | commit SHA, PR creator, approver, merger, check pass/fail, CI failure details, Jira URL |
| Output | CSV + per-commit folder with 3-5 screenshots |
| Challenge | SVG status icons (need vision). Cross-domain (GitHub → Jira). Some commits have no PR |
| Aspect | Detail |
|---|---|
| Input | CSV of names (no URLs — agent must search) |
| Navigation | Search → disambiguation → profile extraction |
| Extract | linkedin_url, school, current_company, tenure |
| Output | Original CSV with 4 columns appended |
| Challenge | Auth wall. Rate limiting. Ambiguous names need disambiguation |
| Aspect | Detail |
|---|---|
| Input | Code string (function signature or variable name) |
| Navigation | Find file → Blame View → recent commit → assess materiality |
| Extract | file_path, last_modified_date, author, material_change (yes/no/inconclusive) |
| Output | CSV + screenshots + judgment |
| Challenge | Requires genuine reasoning about code changes, not just extraction |
| Aspect | Detail |
|---|---|
| Input | URL + field values to fill |
| Navigation | Fill form → screenshot → submit → download report → iterate tabs |
| Extract | form_submitted, report downloaded, attachment count |
| Output | Screenshots of empty/filled/result + downloaded files |
| Challenge | Custom widgets (date pickers, cascading dropdowns). Download detection |
playwright_agent/
├── main.py # entry point: discovery → execution → merge CSV
├── discover.py # phase 1: navigate start URL, paginate, write samples.csv
├── worker.py # phase 2: one BrowserContext per sample + agent_loop
├── agent_loop.py # THE core: observe → decide → act → repeat (orchestrates helpers below)
├── agent_prompt.py # Token-budget history + user message construction
├── agent_recovery.py # Smart termination, stagnation escalation, final consolidation
├── agent_merge.py # deep_merge for checkpoints (id-aware list merge)
├── agent_llm_retry.py # Retryable LLM error classification
├── agent_navigation.py # Pagination detection, batch DOM safety
├── agent_dispatch.py # Playwright execution for one action
├── memory.py # Long-term memory: patterns + failures, LLM-distilled
├── config.py # Environment config (.env settings)
│
├── core/
│ ├── dom_extractor.py # a11y pruner, dom_confidence, serializer
│ └── vision.py # screenshot capture, Claude vision, hybrid dispatch
│
├── tools/
│ ├── browser.py # thin Playwright wrappers (goto, click, type, scroll)
│ └── output.py # save_screenshot (SHA-256), write_metadata, write_csv
│
├── models/
│ ├── task.py # TaskSpec (Pydantic), loaded from tasks/*.json
│ └── actions.py # AgentAction schema, ActionResult
│
├── tasks/
│ ├── github_discovery.json
│ ├── github_profile.json
│ ├── github_commit_audit.json
│ ├── linear_tickets.json
│ └── _template.json
│
├── evidence/ # output root — one folder per run
│ └── run_YYYY-MM-DD_HHMMSS/
│ ├── memory/
│ │ ├── patterns.json # learned only within this run
│ │ └── failures.json # failure warnings only within this run
│ ├── samples.csv # optional manual discovery output
│ ├── discovered_samples.csv # planner-driven discovery output
│ ├── combined.csv
│ └── {sample_id}/
│ ├── 01_{label}.png
│ ├── result.json # extracted fields + artifact manifest
│ └── action_log.json # every step: thinking, action, outcome
│
├── .env # ANTHROPIC_API_KEY, credentials
└── requirements.txt
A small set of focused Python modules. Everything site-specific lives in tasks/*.json.
Phase 1: DISCOVERY (sequential, one browser)
"Go to github.com/orgs/microsoft/people"
→ agent navigates, paginates, collects member URLs
→ writes a run-local samples file (the work queue)
Phase 2: EXECUTION (parallel, N browsers)
For each row in the run-local samples file:
→ worker launches isolated BrowserContext
→ agent_loop runs the task spec against that sample
→ writes evidence/{sample_id}/result.json + screenshots
After all workers:
→ merge all result.json → run-local combined.csv
Input (prompt or CSV)
│
▼
┌─────────┐ ┌───────────┐ ┌──────────┐
│ main.py │────▶│ discover │────▶│ samples │
│ │ │ .py │ │ .csv │
└─────────┘ └───────────┘ └──────────┘
│ │
▼ ▼
┌─────────┐ ┌───────────┐ ┌──────────────────┐
│ main.py │────▶│ worker │────▶│ evidence/ │
│ (gather) │ │ .py │ │ {sample_id}/ │
│ │ │ × N parallel │ 01_page.png │
└─────────┘ └───────────┘ │ result.json │
│ │ action_log.json│
▼ └──────────────────┘
┌──────────┐
│ combined │
│ .csv │
└──────────┘
main.py
├── discover.py ──── agent_loop.py
├── worker.py ────── agent_loop.py
│ ├── core/dom_extractor.py
│ ├── core/vision.py
│ ├── tools/browser.py
│ ├── tools/output.py
│ └── memory.py
└── (merge CSV)
Coordinates discovery → execution → output merge.
1. Parse input → load matching task spec JSON
2. If discovery task exists → run discover.py → writes samples.csv
3. If samples.csv provided as input → skip discovery
4. Read samples.csv
5. Skip samples where evidence/{sample_id}/result.json has status:"done" ← idempotency
6. asyncio.gather(*[worker(s) for s in pending], return_exceptions=True)
7. Merge all result.json → combined.csv
Key decisions:
return_exceptions=True— one bad sample never kills the run- Idempotency — restart after crash picks up where it stopped
asyncio.Semaphore(N)— default N=5, tunable per task
One sequential browser session. Builds the work queue.
Runs agent_loop with the discovery task spec. The loop paginates until the agent calls done with an empty list.
| Challenge | Solution |
|---|---|
| Link noise (nav/footer/org links) | LLM filters using goal + a11y context |
| Pagination termination | Agent calls done when page returns zero target links |
| Deduplication | seen: set checked before appending |
| Rate limiting | asyncio.sleep(1) between pages + 429 detection |
| URL vs click pagination | LLM reasons from live page state — no hardcoding |
Output — samples.csv (columns vary by task):
# URL-based tasks (tickets, commits, profiles):
sample_id,url,discovered_at
torvalds,https://github.com/torvalds,2026-03-27T10:00:00Z
# Name-based tasks (LinkedIn enrichment):
sample_id,name,company,url
person_001,Satya Nadella,Microsoft,
# Code-based tasks (blame review):
sample_id,code_string,repo_url,url
blame_001,"class ExchangeRate",https://github.com/org/repo,
The url column is optional. Tasks like LinkedIn enrichment start with a name and the agent discovers the URL. The task spec's input_schema defines which columns are expected:
{
"input_schema": {
"name": "string",
"company": "string | null"
}
}Owns one sample's full lifecycle. Isolated context per sample.
async def run_sample(browser, sem, sample, task_spec, output_dir):
async with sem:
ctx = await browser.new_context() # isolated: own cookies, session
page = await ctx.new_page()
await page.emulate_media(color_scheme="light") # consistent white screenshots
try:
await agent_loop(page, sample, task_spec, output_dir)
except Exception as e:
write_metadata(output_dir, {"status": "failed", "reason": str(e)})
finally:
await ctx.close()- Each worker gets an isolated
BrowserContext— no session bleed - If
task_spec.auth_profileis set, loadstorage_stateJSON into the context (saved cookies from a prior manual login — handles LinkedIn, Workday, etc.) - Exceptions are caught and written to
result.json - Semaphore slot always released in
finally
This is the entire agent. Everything else is scaffolding.
for step in range(task_spec.max_steps): ← default 25, configurable
1. OBSERVE
page_state = dom_extractor.snapshot(page, task_spec.keywords)
if dom_confidence(page_state) < 0.6:
page_state += vision.analyze(page, targeted_question)
2. DECIDE
response = anthropic.messages.create(
system = task_spec.system_prompt ← static, prompt-cached
messages = build_prompt(page_state, fitted_history, task_spec) # 5-25 items, budget-fitted
tools = action_tool_schema(include_reflection=config.REFLECTION_MODE == "full")
tool_choice = {"type": "any"} ← forces structured output
)
action = AgentAction(**response.tool_input)
# Reflection fields optional in schema when REFLECTION_MODE=light
history.append(action)
3. ACT
result = dispatch(action, page, output_manager)
# result is always ActionResult — never raises
4. CHECK TERMINATION
if action.action == "done":
# Machine-checkable completion — not just prompt-driven
missing = [f for f in task_spec.required_fields if f not in action.extracted]
if missing:
inject "You called done but these required fields are missing: {missing}"
continue # force agent to try again
write_result(action.extracted); return
if action.action == "fail": write_result(status="failed"); return
5. LOOP DETECTION
if same (url, action) seen 3+ times:
inject recovery nudge into next observation
# Reached max_steps without done/fail:
write_result(status="failed", reason="max_steps_exceeded")
SYSTEM (split into cached static + uncached dynamic):
Block 1 [CACHED — ephemeral TTL]: {task_spec.system_prompt} ← identical across all steps, cache hit every time
Block 2 [NOT CACHED]: {memory_hints, sample.extra} ← changes per sample, doesn't invalidate Block 1
USER (rebuilt every turn):
## Current page state
URL: https://github.com/torvalds
Title: Linus Torvalds (torvalds)
[0] [heading] "Linus Torvalds"
[1] [text] "Portland, OR"
[2] [link] "linux" → https://github.com/torvalds/linux
[3] [link] "subsurface" → https://github.com/torvalds/subsurface
[4] [button] "Follow"
[5] [text] "231k followers · 0 following"
## Actions taken so far (recent budget-fitted window)
Step 1: goto https://github.com/torvalds → success
Step 2: screenshot "profile" → saved 01_profile.png
## Goal
{task_spec.goal}
## Output schema (populate when calling done)
{task_spec.output_schema}
Take the single best next action.
The prompt construction is split across agent_prompt.py with shared code paths ensuring the budget estimate and the actual prompt are always aligned.
Layer 1: Prompt Cache Split (build_system_blocks)
System prompt is split into two blocks:
- Block 1 (CACHED): Static task instructions →
cache_control: {type: "ephemeral"}→ cached across all steps. On a 10-step run, 9 cache hits on the largest prompt part. - Block 2 (NOT CACHED): Dynamic context (memory hints for steps 1-3, sample metadata) → changes don't invalidate Block 1.
Cache hits are verified via response usage telemetry: cache_read_input_tokens and cache_creation_input_tokens logged per step.
Layer 2: Budget-Fitted History (fit_history)
History is not a sliding window. Each entry is scored by:
- Action importance (
save_progress/done= 3,extract= 2,scroll= 0) - Recency bonus (newer entries score higher)
- Token cost (expensive entries deprioritized when budget is tight)
Meta messages (nudges, recovery prompts) are excluded BEFORE selection — they served their purpose and should not displace real navigation context. Recent meta (last 3 entries) is preserved in chronological position.
Budget is computed from the exact rendered prompt via estimate_fixed_prompt_tokens, which calls the same _build_base_message_parts + build_system_blocks used for the real prompt. No divergence possible.
Result: 5-25 history items fitted to a token budget (30% of prompt capacity, capped at 24K tokens total).
Layer 3: Microcompact (zero cost, at render time in build_messages)
Stale history entries (older than last 5 steps) are compacted to short stubs:
Step 3: screenshot → [01_commit_page.png]instead of full SHA-256 descriptionStep 5: extract → [1432 chars saved]instead of full extraction text- Stale meta messages dropped entirely
Recent results (last 5) stay full — the agent needs them for decision-making.
Layer 4: Step Summaries (LLM-generated, periodic) Every 10 steps, older history is summarized by the fast model (Haiku) into structured FOUND/GAPS/NEXT format. Last 3 summaries injected as "Earlier steps (condensed)". Meta messages are excluded from summaries. Mechanical fallback if LLM call fails.
Run-state capping:
- Progress lists (pages_visited, artifacts, failed_urls) capped at last 10 entries with total count shown
- Accumulated checkpoint JSON truncated at 2000 chars in prompt
- Memory hints only injected for steps 1-3 (absorbed early, not repeated)
class AgentAction(BaseModel):
action: Literal[
"goto", # navigate to a URL
"click", # click an element by index or text
"type", # fill an input field
"scroll", # scroll up or down
"screenshot", # capture full-page evidence screenshot
"extract", # read text from an element into history
"wait", # wait for an element to appear
"download", # click a download trigger and save the file
"select_option", # select from a native <select> dropdown
"save_progress", # checkpoint partial data without stopping
"done", # task complete — write extracted data
"fail", # unrecoverable — write reason and stop
]
selector: str | None = None # click, type, extract, wait, download, select_option — element index or text
value: str | None = None # select_option: visible option text or value
url: str | None = None # goto
text: str | None = None # type
direction: str | None = None # scroll: "up" | "down"
extracted: dict | None = None # done: the structured output matching output_schema
note: str | None = None # fail: reason string
label: str | None = None # screenshot: filename label (e.g. "profile", "checks")
# Structured reflection (per-step self-evaluation)
evaluation_previous_step: str | None = None # "Did my last action work?"
memory_update: str | None = None # "Key fact to remember"
next_goal: str | None = None # "What I'll do next"Claude always returns one or more of these (multi-action batching when enabled) via tool_choice={"type":"any"}. No free-form prose. If it can't proceed, it returns fail with a note — never hangs.
When task_spec.judgment_required = true, the agent includes judgment fields in its done call:
{
"judgment_required": true,
"judgment_question": "Did this code change materially affect the calculation?",
"judgment_output_schema": {
"answer": "yes | no | inconclusive",
"confidence": "0.0-1.0",
"reasoning": "string",
"evidence_refs": ["array of screenshot filenames"]
}
}Same loop, same LLM, same tool dispatch. The done action's extracted dict includes both data fields and judgment fields. One result.json, full provenance.
A (url, action_name) counter is maintained. At count >= 3:
[NOTICE] You have taken the same action on this URL 3 times without progress.
Try a different approach or call fail().
Beyond simple loop counting, the agent tracks page signature stability (URL + DOM hash). Same page state with no new data triggers escalating responses:
| Level | Trigger | Response |
|---|---|---|
| 1 | 3 stagnant steps | Gentle nudge |
| 2 | 5 stagnant steps | "CHANGE YOUR STRATEGY NOW" + checkpoint |
| 3 | 8 stagnant steps | "MUST call done or fail" |
Budget warnings at 75% and 90% of step budget. Final step restricts tools to done/fail only.
After 3 consecutive ActionResult(success=False), inject a recovery prompt listing all currently-visible interactive elements so the agent can try a different path.
Converts raw Playwright accessibility snapshot (2000+ nodes) into compact, task-relevant context (~20-40 nodes) for the LLM prompt.
Pass 1 — prune dead nodes (~2000 → ~600)
Drop: no role, no name, hidden=true, role in {none, presentation, generic} with no children.
Pass 2 — keep semantic roles (~600 → ~150)
Whitelist: button link textbox checkbox radio tab menuitem heading table row cell listitem combobox option status alert img (img only if has alt text).
Pass 3 — task-aware keyword scoring (~150 → ~40)
Boost nodes whose name/value matches keywords from task_spec.keywords. Keep all boosted. Trim zero-score nodes to budget of 20 by tree order.
Pass 4 — viewport bias (~40 → ~20) Prefer elements with bounding box y < viewport height. Deprioritize below-fold.
Computed before passing to LLM. If score < 0.6, vision activates automatically.
def dom_confidence(snapshot, page_metrics) -> float:
total = max(len(snapshot), 1)
interactive = len([n for n in snapshot if n.role in INTERACTIVE_ROLES])
interactive = max(interactive, 1)
score = 1.0
score -= 0.3 * (page_metrics.canvas_count / total) # canvas = invisible to DOM
score -= 0.2 * (page_metrics.missing_aria_labels / interactive) # unlabeled buttons
score -= 0.1 * (page_metrics.svg_icon_count / interactive) # SVG status icons
return max(0.0, score)Why not just count nodes: GitHub's CI checks page has 15+ meaningful DOM nodes but status icons are pure SVG — node count says "DOM is fine" but the actual pass/fail data is invisible to the a11y tree. The canvas/SVG ratio catches this.
[0] [heading] "Overview / Repositories / Stars"
[1] [tab] "Repositories" (selected=false)
[2] [tab] "Overview" (selected=true)
[3] [link] "linux" → https://github.com/torvalds/linux
[4] [text] "The Linux kernel"
[5] [button] "Follow"
[6] [text] "Portland, OR"
[7] [status] "231k followers · 0 following"
Integer indices ([0], [1]...) are the primary way the LLM references elements in actions: click(selector="3") means click element at index 3.
Activated when dom_confidence < 0.6 or task spec requires a screenshot.
await page.emulate_media(color_scheme="light") # consistent white background
await page.set_viewport_size({"width": 1280, "height": 900})
data = await page.screenshot(full_page=full_page, type="png", animations="disabled")Never "describe this page." Always specific:
"What is the status icon next to 'build / test'? Pass (green check), fail (red X), or pending?""Is the form field labeled 'Start Date' filled? If yes, what value?""Which user approved this PR? Look for a green checkmark next to a name."
Targeted questions reduce hallucination. The model answers what it can see.
1. DOM: finds [button "Show all checks"] → agent clicks it
2. DOM: finds check names but SVG icons only → dom_confidence = 0.4
3. Vision activates: "What is the icon next to 'build / test (ubuntu)'?"
4. Vision: "Red X — failure"
5. Combined: check_status = "failed"
DOM provides structure + interactable elements (fast, cheap). Vision provides visual interpretation (accurate, targeted). Neither used for what it does poorly.
12 actions. Pure functions. Always return ActionResult, never raise.
| Action | Playwright Call | Error Policy |
|---|---|---|
goto |
page.goto(url, wait_until="networkidle") |
Timeout → ActionResult(success=False) |
click |
3-strategy resolution (see below) | Not found after all strategies → list visible elements |
type |
page.fill(selector, text) |
Not editable → fail clearly |
scroll |
page.mouse.wheel(0, ±600) |
At limit → report scroll position |
wait |
page.wait_for_selector(sel, timeout=10000) |
Timeout → ActionResult(success=False) |
screenshot |
page.screenshot(full_page=True) |
Always succeeds |
extract |
page.inner_text(selector) |
Not found → empty string + warning |
download |
page.expect_download() + artifact save |
Timeout / no file → ActionResult(success=False) |
select_option |
`locator.select_option(label | value)` |
save_progress |
Checkpoint data, continue loop | Always succeeds |
done |
Write result + signal loop exit | Always succeeds |
fail |
Write failure + signal loop exit | Always succeeds |
Tried in order until one succeeds:
- Index-based —
click(selector="3")→ resolves index 3 from DOM extractor's map. Fastest, most reliable. - Text-based —
click(selector="Show all checks")→page.get_by_text(value). Case-insensitive, partial match fallback. - CSS selector — last resort, explicitly discouraged in prompts.
await page.wait_for_load_state("networkidle") # mandatory — never skipThis is the single biggest source of flakiness if omitted.
Zero Playwright. Pure file I/O. All paths relative to evidence/{sample_id}/.
def save_screenshot(self, data: bytes, label: str, source_url: str) -> dict:
self._counter += 1
filename = f"{self._counter:02d}_{label}.png"
path = self.sample_dir / filename
path.write_bytes(data)
sha256 = hashlib.sha256(data).hexdigest()
return {"filename": filename, "sha256": sha256,
"source_url": source_url, "timestamp": datetime.utcnow().isoformat()}{
"sample_id": "torvalds",
"status": "done",
"steps": 4,
"extracted": {
"display_name": "Linus Torvalds",
"bio": "Just a simple coder",
"company": "Linux Foundation",
"location": "Portland, OR",
"followers": 231400,
"pinned_repos": ["linux", "subsurface"]
},
"artifacts": [
{
"filename": "01_profile.png",
"sha256": "a3f9c2...",
"source_url": "https://github.com/torvalds",
"timestamp": "2026-03-27T10:01:03Z"
}
],
"judgment": null,
"flagged": false,
"notes": [],
"started_at": "2026-03-27T10:01:00Z",
"finished_at": "2026-03-27T10:01:18Z"
}Separate from result.json. Every step: thinking, action, outcome. Fully replayable.
[
{
"step": 1,
"thinking": "Profile page loaded. Take a screenshot first.",
"action": "screenshot",
"params": {"label": "profile"},
"result": "Saved 01_profile.png",
"timestamp": "2026-03-27T10:01:02Z"
}
]Single merge at batch end — main.py reads all result.json files after all workers finish and writes one CSV in sorted order. No concurrent writes, no filelock needed. Simpler, deterministic.
# in main.py, after all workers complete:
results = []
for sample_dir in evidence_dir.iterdir():
result_file = sample_dir / "result.json"
if result_file.exists():
results.append(json.loads(result_file.read_text()))
results.sort(key=lambda r: r["sample_id"])
write_csv("combined.csv", results, task_spec.output_schema)Every sample, every task type, same structure:
evidence/
└── {sample_id}/
├── 01_{label}.png # sequential, deterministic naming
├── 02_{label}.png
├── result.json # extracted fields + artifacts + SHA-256
└── action_log.json # every agent step
combined.csv # one row per sample
Idempotent: crash at sample 47 → restart picks up at 48. Partial evidence preserved on failure.
All site-specific knowledge lives here. Agent code never changes.
{
"task_id": "unique_name",
"phase": "discovery | execution",
"start_url": "https://...",
"system_prompt": "You are a browser agent. Your job is to [goal]. Extract only what you can see. Set missing fields to null. Never guess.",
"goal": "Natural language: what to collect, when to stop.",
"keywords": ["relevant", "terms", "for", "pruning"],
"output_schema": {
"field_name": "string | null",
"count": "number | null"
},
"max_steps": 25,
"required_fields": ["field_name"],
"required_artifacts": ["screenshot"],
"judgment_required": false,
"judgment_question": null,
"judgment_output_schema": null,
"pagination": false,
"stop_condition": "natural language description of done state",
"input_schema": {},
"auth_profile": null
}{
"task_id": "github_commit_audit",
"phase": "execution",
"start_url": "https://github.com/{org}/{repo}/commit/{sha}",
"system_prompt": "You are a browser audit agent. For each commit: screenshot the commit page, navigate to its PR, screenshot the PR, expand checks and screenshot. If any check failed AND PR was merged, navigate into the CI failure and screenshot. If Jira link in PR description, follow it and screenshot.",
"goal": "Collect: commit SHA, PR number, creator, approvers, merger, check statuses, CI failure details, Jira URL. Screenshot every significant page.",
"keywords": ["commit", "PR", "checks", "passed", "failed", "approve", "merge", "jira"],
"output_schema": {
"commit_sha": "string",
"pr_number": "string | null",
"pr_creator": "string | null",
"approvers": "array | null",
"merger": "string | null",
"checks_passed": "number | null",
"checks_failed": "number | null",
"merged_with_failures": "boolean | null",
"ci_failure_details": "string | null",
"jira_url": "string | null"
},
"max_steps": 30,
"judgment_required": true,
"judgment_question": "Was this PR merged with failing CI checks? Were failures material or non-material?",
"judgment_output_schema": {
"answer": "yes | no | inconclusive",
"confidence": "0.0-1.0",
"reasoning": "string",
"evidence_refs": ["array of screenshot filenames"]
},
"stop_condition": "all fields extracted, all screenshots taken, judgment recorded"
}Write one JSON file. No Python changes.
The LLM reasons about the live DOM. It doesn't know what site it's on — only what's in the a11y tree and what the task spec says to look for.
| Layer | Library | Why |
|---|---|---|
| Browser | playwright async |
Direct control, a11y tree, screenshot, download |
| LLM | anthropic SDK |
Tool use, structured output, prompt caching |
| Model (primary) | claude-sonnet-4-6 |
Fast (<2s/turn), accurate, cost-effective |
| Model (vision) | claude-sonnet-4-6 |
Same model, multimodal for screenshot analysis |
| Concurrency | asyncio + Semaphore |
No Redis, no broker, stdlib only |
| Schemas | pydantic v2 |
Task spec + action schema + result models |
| File locking | filelock |
Safe concurrent CSV writes |
| Hashing | hashlib (stdlib) |
SHA-256 per artifact |
| Logging | loguru |
Per-sample bound context |
| Progress | rich |
Live terminal dashboard |
| Credentials | python-dotenv |
.env file support |
| Downloads | httpx |
File downloads with session cookies |
# requirements.txt
playwright>=1.42
anthropic>=0.25
pydantic>=2.0
filelock>=3.13
loguru>=0.7
rich>=13.0
python-dotenv>=1.0
httpx>=0.27
| Package | Why Excluded |
|---|---|
browser-use |
Project requires custom agent loop — no agent frameworks |
langchain |
Unnecessary abstraction |
selenium |
Playwright is async-native, more reliable |
beautifulsoup4 |
DOM via a11y tree, not HTML parsing |
| # | Time | What | Test |
|---|---|---|---|
| 1 | 1.5h | tools/browser.py + tools/output.py |
Call each tool, verify files written |
| 2 | 1h | core/dom_extractor.py |
Print pruned tree for a real GitHub page |
| 3 | 2h | agent_loop.py + models/actions.py |
Run loop on one GitHub profile |
| 4 | 0.5h | worker.py + main.py |
Run 3 samples in parallel |
| 5 | 1h | discover.py |
Discover members from github.com/orgs/microsoft/people |
| 6 | 1h | Task specs + judgment mode | Run commit audit on 5 commits |
| 7 | rest | Scale test 50 samples, fix flakiness, README | — |
Start with step 1. Working tools let you test everything else in isolation.
Per-domain intervals applied before every goto:
RATE_LIMITS = {
"linkedin.com": 3.0, # seconds between requests
"github.com": 0.5,
"atlassian.net": 1.0,
"default": 0.2,
}The agent code contains zero knowledge of GitHub, LinkedIn, Jira, Linear, or Workday.
The only things that know about a specific site:
tasks/{site}.json— goal, keywords, output schema, system prompt.env— credentials
To add a new site: write one JSON file. No Python changes.