Status: merged into main
Standard tasks (profile extraction, single-page audit) complete in 2-10 steps. Long-horizon tasks — multi-page audits, cross-link navigation chains, paginated data collection — need 30-50+ steps.
Before these changes, the agent had:
- Hard
max_stepsceiling (only termination) - Fixed 5-action history (forgets everything older)
- All-or-nothing output (
doneorfailed, no partial saves) - No awareness of how much work is left
Files changed: agent_loop.py, models/actions.py, tools/output.py
save_progress — the 10th agent action:
The agent can now call save_progress(extracted, note) to checkpoint partial data without stopping. Data is deep-merged across calls — arrays append, dicts recurse.
Step 8: save_progress({ "prs": [{ title: "Fix editor", author: "alice" }] }, "PR #1 done")
Step 16: save_progress({ "prs": [{ title: "Refactor sync", author: "bob" }] }, "PR #2 done")
Step 22: done({ "total_audited": 2 })
→ accumulated data merged with final extraction → result.json
If the agent crashes at step 20, checkpoint.json has all data from steps 8 and 16.
Live checkpoint.json:
Written to the sample's evidence folder every 5 steps and on every save_progress call:
{
"sample_id": "pr_chain_audit",
"status": "in_progress",
"step": 16,
"max_steps": 50,
"accumulated_data": { "prs": [{ "title": "Fix editor", "author": "alice" }, ...] },
"progress_notes": ["PR #1 done", "PR #2 done"],
"artifacts_so_far": [{ "filename": "01_pr_overview.png", "sha256": "..." }],
"steps_logged": 16,
"updated_at": "2026-03-27T18:30:00Z"
}action_log.json is also flushed at checkpoint time, so long-running tasks keep a live step trace on disk instead of only writing it at final completion.
Watch it live while the agent runs:
watch -n 2 cat evidence/run_XXXX/sample_id/checkpoint.jsonStep budget awareness:
Every prompt now includes: Step 16 of 40 (24 remaining) — the agent knows how much time it has left.
Accumulated data in prompt:
The full accumulated dict (from all save_progress calls) is shown in the prompt as a JSON block under "Data collected so far." This means when the agent is on PR #5, it can see what it collected from PRs #1-4.
Files changed: agent_loop.py
LLM-powered step summaries:
Every 10 steps, Claude (Haiku — fast/cheap model) summarizes old history:
Steps 1-10: Navigated to the merged PR list, clicked into PR #305569.
Extracted title, author, reviewer. Took screenshot and saved progress.
Navigated back to the list.
This replaces the raw step list for old history. Summaries now use a structured FOUND/GAPS/NEXT format (see Part 5). The agent sees:
- Structured LLM summaries of earlier work (findings, gaps, next actions)
- Dynamic budget-fitted recent actions (5-25 items, importance-scored)
- Full accumulated data (what was collected)
- Structured run state (failures, dead ends, blocked selectors)
Cost: < $0.001 per summary. Falls back to mechanical concatenation if the LLM call fails.
Extract → accumulated buffer:
Every extract action result is now stored in accumulated["extracted_texts"]. This means if the agent extracts text from a page, that text persists in memory even after the recent-action window slides past it.
Files changed: agent_loop.py, task_planner.py, main.py
Auto-pagination:
When the agent clicks "Next", "Load more", "Page 2", etc., the system detects it via keyword matching and grants +3 bonus steps. Pagination doesn't eat the task's working budget.
Detection keywords: next, next page, load more, show more, older, newer, », ›, →
Step 15 | click("Next page") → Pagination detected → +3 bonus (effective_max=43)
Watchdog (stall detection):
If 5 consecutive steps produce no new data (no save_progress with genuinely new data, or successful extract), the watchdog injects:
WARNING: You have not produced new data in 5 steps.
You have 12 steps left. Either extract/save_progress with data,
or call done with what you have, or call fail.
Also writes a checkpoint so no accumulated data is lost if the stall continues.
Batch chunking:
task_planner.py now has plan_chunked() that detects large-scale tasks (10+ items with individual URLs). It generates a discovery spec alongside the execution spec. main.py auto-runs discovery → collects URLs → distributes as parallel samples.
Files changed: agent_loop.py, models/task.py, models/actions.py, main.py
Before every step, _check_termination() evaluates multiple signals:
| Trigger | Status | When |
|---|---|---|
done + all requirements met |
done |
Agent satisfied all fields + artifacts |
done + array count < expected |
partial_success |
Got some items but not all |
| Wall-clock timeout | partial_success / failed |
max_time_seconds exceeded |
| Network circuit breaker | partial_success / failed |
5 consecutive infra errors |
| Watchdog stall | warning injected | 5 steps, no new data |
max_steps exhausted |
failed |
Hard ceiling (accumulated data saved) |
| LLM API error | failed |
Claude unreachable |
fail(reason) |
failed |
Agent gives up intentionally |
Infrastructure error classification:
_is_infra_error() distinguishes network/browser failures from logic errors:
- Infra: timeout, DNS, connection refused, page crashed, SSL, browser closed
- Logic: element not found, click failed, selector mismatch
Only infra errors count toward the circuit breaker. A wrong CSS selector does NOT trigger early termination.
New partial_success status:
When the agent collected some data but couldn't finish (site down, timeout, incomplete items). The accumulated data is preserved in result.json. Better than binary done/failed.
New TaskSpec fields:
{
"max_steps": 50,
"max_time_seconds": 300,
"expected_items": 5,
"max_consecutive_network_errors": 5
}| File | Purpose |
|---|---|
tasks/github_pr_audit_chain.json |
PR audit task spec (50 steps, judgment, save_progress) |
tasks/github_contributor_deep_audit.json |
Contributor deep audit (40 steps, cross-page navigation) |
tasks/inputs/github_pr_chain.csv |
Input CSV for PR audit |
tasks/inputs/github_pr_audit.csv |
Alternative PR input |
tasks/inputs/github_contributors.csv |
Input CSV for contributor audit |
cd playwright_agent
python main.py --task tasks/github_contributor_deep_audit.json \
--input tasks/inputs/github_contributors.csv --no-headlessWhat happens: Agent visits vscode contributors page → clicks top 3 contributor profiles → extracts name/company/location/followers from each → screenshots each → save_progress after each → navigates back → repeats.
What to watch for:
checkpoint.jsonappearing and growing after each contributor- Agent navigating back to the contributors list between profiles
- Console showing
save_progress #1,save_progress #2,save_progress #3 - Final
result.jsonwith all 3 contributors merged
Monitor checkpoint live (in another terminal):
watch -n 2 type playwright_agent\evidence\run_*\vscode_contributors\checkpoint.jsoncd playwright_agent
python main.py --task tasks/github_pr_audit_chain.json \
--input tasks/inputs/github_pr_chain.csv --no-headlessWhat happens: Agent navigates merged PR list → clicks into each PR → extracts title/author/reviewers/CI status → clicks "Files changed" tab → extracts file count → screenshots → checkpoints → navigates back → repeats for 3-5 PRs. Includes judgment at the end.
cd playwright_agent
python main.py --prompt "Go to the GitHub repository microsoft/vscode. \
Visit the top 3 contributors' profiles. For each, extract their name, \
company, and followers count. Take a screenshot of each profile. \
Use save_progress after each contributor." --no-headlessWhat to watch for: Planner generates a task spec with max_steps=40 and instructs the agent to use save_progress. Execution flow should mirror Test 1.
At step 16 of 40, the agent sees:
## Current page state
URL: https://github.com/user123
Title: user123 (John Doe)
[0] [heading] "John Doe"
[1] [text] "Software Engineer at Google"
[2] [text] "1.2k followers"
...
**Step 16 of 40** (24 remaining)
## Data collected so far (via save_progress)
{
"contributors": [
{ "username": "torvalds", "name": "Linus Torvalds", "company": null, "followers": "293k" },
{ "username": "gvanrossum", "name": "Guido van Rossum", "company": null, "followers": "25.9k" }
]
}
## Earlier steps (condensed)
Steps 1-10:
FOUND: Extracted name, company, followers for torvalds and gvanrossum. Screenshots taken.
GAPS: 1 of 3 contributors still not visited. user123 profile not started.
NEXT: Navigate to user123's profile and extract the same fields.
## Run state
Pages visited: github.com/microsoft/vscode/graphs/contributors, github.com/torvalds, github.com/gvanrossum, github.com/user123
Screenshots taken: contributors_list, profile_torvalds, profile_gvanrossum
Exhausted pages (all data taken): github.com/torvalds, github.com/gvanrossum
## Action history (recent items, budget-fitted)
Step 12: goto → Navigated to https://github.com/user123
Step 13: screenshot(profile_user123) → Screenshot saved: 04_profile_user123.png
Step 14: extract(1) → Extracted 45 chars
Step 15: extract(2) → Extracted 12 chars
Step 16: save_progress → Progress saved (2/3 items). 24 steps remaining.
## Goal
Audit the top 3 contributors: visit each profile, extract details...
## Output schema
{ "contributors": "array", "total_audited": "number", "repo_name": "string" }
Take the single best next action.
Files changed: agent_loop.py, memory.py, config.py
Research basis:
- CoALA (Princeton, TMLR 2024): modular memory — working, episodic, procedural
- Lost in the Middle (Stanford, 2023): keep prompt fill below 20% of context
- ReSum (Alibaba, 2025): structured goal-oriented summaries for indefinite exploration
- BrowserUse + Mem0: procedural memory snapshots for 98% task completion, 41% cost reduction
The progress dict now tracks failures and dead ends, not just successes:
progress = {
"pages_visited": [], # URLs successfully loaded
"fields_found": [], # data snippets extracted
"artifacts": [], # screenshots taken
"failed_urls": [], # URLs that errored (404, timeout, auth)
"exhausted_pages": [], # pages where data was already extracted
"blocked_selectors": [], # selectors that failed 2+ times
"dead_ends": [], # actions repeated 3+ times with no progress
}This state is injected into every prompt under "## Run state" so the agent knows what to skip:
## Run state
Pages visited: github.com/torvalds, github.com/gvanrossum
FAILED URLs (skip these): github.com/deleted-user-404
BROKEN selectors (don't retry): div.old-layout-sidebar
DEAD ENDS (tried, didn't work): click on github.com/microsoft/vscode/graphs
Exhausted pages (all data taken): github.com/torvalds
Selector failure tracking: after a selector fails 2 times, it's flagged as "blocked" so the agent stops retrying it.
Page exhaustion: when save_progress successfully extracts new data from a page, that URL is marked as exhausted.
MemoryStore is now run-scoped — stored inside evidence/run_XXXX/memory/. Each run builds its own memory from scratch. Samples within a run learn from each other, but different runs cannot interfere.
| Type | File | Learned from | Contains |
|---|---|---|---|
| Procedural patterns | evidence/run_XXXX/memory/patterns.json |
done samples in this run |
action sequences, tips, things to avoid |
| Episodic warnings | evidence/run_XXXX/memory/failures.json |
failed / partial_success samples in this run |
dead URLs, broken selectors, dead ends, failure reason |
Why run-scoped? A commit audit pattern ("click PR link") would be harmful if injected into a LinkedIn enrichment task on the same domain. Run-scoped memory guarantees samples within the same task help each other, while different tasks stay isolated.
Failure signals are stored and injected into prompts for later samples in the same run:
## Known issues on github.com (from past failures)
Dead URLs (skip): github.com/deleted-user-404
Broken selectors: div.old-layout-sidebar
Previous failure: Expected 5 items but collected 3
learn_failures() is called from:
- Smart termination (timeout, network circuit breaker)
- Agent calling
fail() donewithpartial_successstatus
Before: Step summaries were generic prose:
Steps 1-10: Navigated to contributors page, clicked into torvalds profile.
Extracted name and followers. Took screenshot.
After: Summaries follow a structured FOUND/GAPS/NEXT format:
Steps 1-10:
FOUND: Extracted name, company, followers for torvalds. Screenshot taken.
GAPS: 2 of 3 contributors still not visited. No bio data collected yet.
NEXT: Navigate back to contributors list and click the next profile.
This is directly inspired by ReSum (Alibaba, 2025) which showed that structured summaries with "verified evidence + information gaps + next-step directions" outperform prose summaries by 4.5% on long-horizon web tasks.
PROMPT_TOKEN_BUDGET = min(24_000, max(8_000, int(LLM_CONTEXT_WINDOW * 0.08)))- 8% of context window: Stays well below the "lost in the middle" degradation zone (~20% fill)
- Floor 8K: Even a small model gets a usable budget
- Cap 24K: A 1M-context model doesn't waste 80K of prompt on a single browser step
- Overridable: Set
LLM_CONTEXT_WINDOWin.envfor non-standard models
| Model | Context | Budget | History (30%) |
|---|---|---|---|
| Claude Sonnet (200K) | 200K | 16,000 | 4,800 |
| Claude Haiku (200K) | 200K | 16,000 | 4,800 |
| Future 1M model | 1M | 24,000 (capped) | 7,200 |
| Small 32K model | 32K | 8,000 (floor) | 2,400 |
Files changed: agent_loop.py, models/actions.py, config.py
Research basis:
- browser-use: structured self-evaluation, escalating recovery, budget pressure, multi-action batching
- Decision hygiene: explicit reflection fields instead of incidental text
Every action now includes three optional reflection fields:
class AgentAction(BaseModel):
# ... existing fields ...
evaluation_previous_step: str | None = None # "Did my last action work?"
memory_update: str | None = None # "What to remember going forward"
next_goal: str | None = None # "What I'll do next and why"- No extra LLM call — reflection is part of the tool call response
- Truncated to 160 chars to prevent token bloat
- Stored in
StepRecord→ appears inaction_log.jsonaudit trail memory_updateandnext_goalshown in history for context continuity- Controlled by
REFLECTION_MODE: "light" (default: slimmer tool schema + prompt) or "full" (reflection in tools and history annotations)
Replaces the old flat watchdog with a 3-level escalation using page signature hashing:
Page signature = MD5 of (normalized URL + first 2K of DOM text). Same signature + no new data → stagnation count increases.
| Level | Trigger | Response |
|---|---|---|
| 1 (gentle) | 3 stagnant steps | "Try a different approach" |
| 2 (forceful) | 5 stagnant steps | "CHANGE YOUR STRATEGY NOW" + checkpoint saved |
| 3 (critical) | 8 stagnant steps | "MUST call done or fail" |
Unified into _build_recovery_notice() which also catches:
- Action spam (4+ identical actions on same URL)
- Consecutive failure recovery (3+ failures → visible element list)
One-time notices injected at step-budget thresholds:
| Threshold | Message |
|---|---|
| 75% used | "Start consolidating results — call save_progress, then finalize with done" |
| 90% used | "URGENT — save any unsaved data NOW, then call done immediately" |
| Final step | Tools restricted to done and fail only (via _get_terminal_tools()) |
Each fires exactly once — no spam.
_attempt_final_consolidation() — one last LLM call when:
max_stepsexhausted with accumulated data- LLM retries all fail with accumulated data
The consolidation call receives accumulated data + task schema and produces best-effort structured output. Fits the existing partial_success model.
When the primary model just failed, prefers the fallback model (if ENABLE_FALLBACK_LLM=true).
When ENABLE_FALLBACK_LLM=true (default: false):
- Primary model retried 3x with exponential backoff
- One attempt on
FALLBACK_LLM_MODEL(default: Claude Haiku) - If fallback succeeds, continues the run
- Model switch is explicitly logged
When ENABLE_MULTI_ACTIONS=true (default: false):
- LLM can return multiple tool calls per step (max
MAX_ACTIONS_PER_STEP, default 3) - Sub-actions execute sequentially with safety guards:
- Batch-breaking actions (
goto,done,fail,save_progress) abort remaining - URL change → abort (stale element map)
- DOM stability check: >20% shift in interactive element count → abort
- Any failure → abort
- Batch-breaking actions (
- Fresh DOM refresh before each sub-action
- Full per-sub-action logging in
action_log.json - Best for: form fills, repetitive extraction. Not for navigation-heavy flows.
REFLECTION_MODE=light # default; use "full" for long-horizon
ENABLE_MEMORY_DISTILLATION=true # false = heuristic patterns only (no post-run Haiku)
FINALIZE_ON_FAILURE=true # best-effort consolidation on exhaustion/failure
ENABLE_FALLBACK_LLM=false # try fallback model on primary failure
FALLBACK_LLM_MODEL=claude-haiku-4-5
ENABLE_MULTI_ACTIONS=false # experimental multi-action batching
MAX_ACTIONS_PER_STEP=3 # max sub-actions per batch
Before long-horizon:
history = last 5 actions (everything else forgotten)
termination = max_steps only
output = all-or-nothing (done or failed)
After long-horizon + browser-use improvements:
reflection:
evaluation = structured self-assessment per step
memory = explicit working scratchpad per step
next_goal = declared intent before acting
memory:
working = dynamic budget-fitted window (5-25 items, importance-scored)
summaries = structured FOUND/GAPS/NEXT every 10 steps (Haiku, <$0.001)
run state = pages visited, failed URLs, blocked selectors, dead ends, exhausted pages
procedural = domain-keyed patterns from successful runs (patterns.json)
episodic = failure warnings from failed runs (failures.json)
recovery:
escalating = gentle nudge → forceful demand → forced consolidation
budget = warnings at 75% and 90%, final-step done|fail only
consolidation = one last LLM call on exhaustion/failure with accumulated data
fallback = optional secondary model on primary failure
termination = 8+ conditions checked every step
output = done | partial_success | failed | needs_review
checkpoint = live file updated every 5 steps
pagination = auto-detected, bonus steps granted
batching = optional multi-action per step (experimental)
token budget = min(24K, max(8K, context_window * 8%))