Long-Horizon Task Support

Status: merged into main

The Problem

Standard tasks (profile extraction, single-page audit) complete in 2-10 steps. Long-horizon tasks — multi-page audits, cross-link navigation chains, paginated data collection — need 30-50+ steps.

Before these changes, the agent had:

Hard max_steps ceiling (only termination)
Fixed 5-action history (forgets everything older)
All-or-nothing output (done or failed, no partial saves)
No awareness of how much work is left

Part 1 — Incremental Checkpointing

Files changed: agent_loop.py, models/actions.py, tools/output.py

save_progress — the 10th agent action:

The agent can now call save_progress(extracted, note) to checkpoint partial data without stopping. Data is deep-merged across calls — arrays append, dicts recurse.

Step 8:  save_progress({ "prs": [{ title: "Fix editor", author: "alice" }] }, "PR #1 done")
Step 16: save_progress({ "prs": [{ title: "Refactor sync", author: "bob" }] }, "PR #2 done")
Step 22: done({ "total_audited": 2 })
         → accumulated data merged with final extraction → result.json

If the agent crashes at step 20, checkpoint.json has all data from steps 8 and 16.

Live checkpoint.json:

Written to the sample's evidence folder every 5 steps and on every save_progress call:

{
  "sample_id": "pr_chain_audit",
  "status": "in_progress",
  "step": 16,
  "max_steps": 50,
  "accumulated_data": { "prs": [{ "title": "Fix editor", "author": "alice" }, ...] },
  "progress_notes": ["PR #1 done", "PR #2 done"],
  "artifacts_so_far": [{ "filename": "01_pr_overview.png", "sha256": "..." }],
  "steps_logged": 16,
  "updated_at": "2026-03-27T18:30:00Z"
}

action_log.json is also flushed at checkpoint time, so long-running tasks keep a live step trace on disk instead of only writing it at final completion.

Watch it live while the agent runs:

watch -n 2 cat evidence/run_XXXX/sample_id/checkpoint.json

Step budget awareness:

Every prompt now includes: Step 16 of 40 (24 remaining) — the agent knows how much time it has left.

Accumulated data in prompt:

The full accumulated dict (from all save_progress calls) is shown in the prompt as a JSON block under "Data collected so far." This means when the agent is on PR #5, it can see what it collected from PRs #1-4.

Part 2 — Smarter Memory

Files changed: agent_loop.py

LLM-powered step summaries:

Every 10 steps, Claude (Haiku — fast/cheap model) summarizes old history:

Steps 1-10: Navigated to the merged PR list, clicked into PR #305569.
Extracted title, author, reviewer. Took screenshot and saved progress.
Navigated back to the list.

This replaces the raw step list for old history. Summaries now use a structured FOUND/GAPS/NEXT format (see Part 5). The agent sees:

Structured LLM summaries of earlier work (findings, gaps, next actions)
Dynamic budget-fitted recent actions (5-25 items, importance-scored)
Full accumulated data (what was collected)
Structured run state (failures, dead ends, blocked selectors)

Cost: < $0.001 per summary. Falls back to mechanical concatenation if the LLM call fails.

Extract → accumulated buffer:

Every extract action result is now stored in accumulated["extracted_texts"]. This means if the agent extracts text from a page, that text persists in memory even after the recent-action window slides past it.

Part 3 — Continuous Operation

Files changed: agent_loop.py, task_planner.py, main.py

Auto-pagination:

When the agent clicks "Next", "Load more", "Page 2", etc., the system detects it via keyword matching and grants +3 bonus steps. Pagination doesn't eat the task's working budget.

Detection keywords: next, next page, load more, show more, older, newer, », ›, →

Step 15 | click("Next page") → Pagination detected → +3 bonus (effective_max=43)

Watchdog (stall detection):

If 5 consecutive steps produce no new data (no save_progress with genuinely new data, or successful extract), the watchdog injects:

WARNING: You have not produced new data in 5 steps.
You have 12 steps left. Either extract/save_progress with data,
or call done with what you have, or call fail.

Also writes a checkpoint so no accumulated data is lost if the stall continues.

Batch chunking:

task_planner.py now has plan_chunked() that detects large-scale tasks (10+ items with individual URLs). It generates a discovery spec alongside the execution spec. main.py auto-runs discovery → collects URLs → distributes as parallel samples.

Smart Termination

Files changed: agent_loop.py, models/task.py, models/actions.py, main.py

Before every step, _check_termination() evaluates multiple signals:

Trigger	Status	When
`done` + all requirements met	`done`	Agent satisfied all fields + artifacts
`done` + array count < expected	`partial_success`	Got some items but not all
Wall-clock timeout	`partial_success` / `failed`	`max_time_seconds` exceeded
Network circuit breaker	`partial_success` / `failed`	5 consecutive infra errors
Watchdog stall	warning injected	5 steps, no new data
`max_steps` exhausted	`failed`	Hard ceiling (accumulated data saved)
LLM API error	`failed`	Claude unreachable
`fail(reason)`	`failed`	Agent gives up intentionally

Infrastructure error classification:

_is_infra_error() distinguishes network/browser failures from logic errors:

Infra: timeout, DNS, connection refused, page crashed, SSL, browser closed
Logic: element not found, click failed, selector mismatch

Only infra errors count toward the circuit breaker. A wrong CSS selector does NOT trigger early termination.

New partial_success status:

When the agent collected some data but couldn't finish (site down, timeout, incomplete items). The accumulated data is preserved in result.json. Better than binary done/failed.

New TaskSpec fields:

{
  "max_steps": 50,
  "max_time_seconds": 300,
  "expected_items": 5,
  "max_consecutive_network_errors": 5
}

New Files Created

File	Purpose
`tasks/github_pr_audit_chain.json`	PR audit task spec (50 steps, judgment, save_progress)
`tasks/github_contributor_deep_audit.json`	Contributor deep audit (40 steps, cross-page navigation)
`tasks/inputs/github_pr_chain.csv`	Input CSV for PR audit
`tasks/inputs/github_pr_audit.csv`	Alternative PR input
`tasks/inputs/github_contributors.csv`	Input CSV for contributor audit

How to Test

Test 1: Contributor Deep Audit (recommended first)

cd playwright_agent
python main.py --task tasks/github_contributor_deep_audit.json \
  --input tasks/inputs/github_contributors.csv --no-headless

What happens: Agent visits vscode contributors page → clicks top 3 contributor profiles → extracts name/company/location/followers from each → screenshots each → save_progress after each → navigates back → repeats.

What to watch for:

checkpoint.json appearing and growing after each contributor
Agent navigating back to the contributors list between profiles
Console showing save_progress #1, save_progress #2, save_progress #3
Final result.json with all 3 contributors merged

Monitor checkpoint live (in another terminal):

watch -n 2 type playwright_agent\evidence\run_*\vscode_contributors\checkpoint.json

Test 2: PR Audit Chain (harder, multi-page)

cd playwright_agent
python main.py --task tasks/github_pr_audit_chain.json \
  --input tasks/inputs/github_pr_chain.csv --no-headless

What happens: Agent navigates merged PR list → clicks into each PR → extracts title/author/reviewers/CI status → clicks "Files changed" tab → extracts file count → screenshots → checkpoints → navigates back → repeats for 3-5 PRs. Includes judgment at the end.

Test 3: Natural Language Long-Horizon

cd playwright_agent
python main.py --prompt "Go to the GitHub repository microsoft/vscode. \
  Visit the top 3 contributors' profiles. For each, extract their name, \
  company, and followers count. Take a screenshot of each profile. \
  Use save_progress after each contributor." --no-headless

What to watch for: Planner generates a task spec with max_steps=40 and instructs the agent to use save_progress. Execution flow should mirror Test 1.

What Each Prompt Looks Like (agent's perspective)

At step 16 of 40, the agent sees:

## Current page state
URL: https://github.com/user123
Title: user123 (John Doe)
[0] [heading] "John Doe"
[1] [text] "Software Engineer at Google"
[2] [text] "1.2k followers"
...

**Step 16 of 40** (24 remaining)

## Data collected so far (via save_progress)
{
  "contributors": [
    { "username": "torvalds", "name": "Linus Torvalds", "company": null, "followers": "293k" },
    { "username": "gvanrossum", "name": "Guido van Rossum", "company": null, "followers": "25.9k" }
  ]
}

## Earlier steps (condensed)
Steps 1-10:
FOUND: Extracted name, company, followers for torvalds and gvanrossum. Screenshots taken.
GAPS: 1 of 3 contributors still not visited. user123 profile not started.
NEXT: Navigate to user123's profile and extract the same fields.

## Run state
Pages visited: github.com/microsoft/vscode/graphs/contributors, github.com/torvalds, github.com/gvanrossum, github.com/user123
Screenshots taken: contributors_list, profile_torvalds, profile_gvanrossum
Exhausted pages (all data taken): github.com/torvalds, github.com/gvanrossum

## Action history (recent items, budget-fitted)
Step 12: goto → Navigated to https://github.com/user123
Step 13: screenshot(profile_user123) → Screenshot saved: 04_profile_user123.png
Step 14: extract(1) → Extracted 45 chars
Step 15: extract(2) → Extracted 12 chars
Step 16: save_progress → Progress saved (2/3 items). 24 steps remaining.

## Goal
Audit the top 3 contributors: visit each profile, extract details...

## Output schema
{ "contributors": "array", "total_audited": "number", "repo_name": "string" }

Take the single best next action.

Part 5 — Research-Backed Memory Architecture

Files changed: agent_loop.py, memory.py, config.py

Research basis:

CoALA (Princeton, TMLR 2024): modular memory — working, episodic, procedural
Lost in the Middle (Stanford, 2023): keep prompt fill below 20% of context
ReSum (Alibaba, 2025): structured goal-oriented summaries for indefinite exploration
BrowserUse + Mem0: procedural memory snapshots for 98% task completion, 41% cost reduction

Upgrade 1: Structured Run State

The progress dict now tracks failures and dead ends, not just successes:

progress = {
    "pages_visited": [],         # URLs successfully loaded
    "fields_found": [],          # data snippets extracted
    "artifacts": [],             # screenshots taken
    "failed_urls": [],           # URLs that errored (404, timeout, auth)
    "exhausted_pages": [],       # pages where data was already extracted
    "blocked_selectors": [],     # selectors that failed 2+ times
    "dead_ends": [],             # actions repeated 3+ times with no progress
}

This state is injected into every prompt under "## Run state" so the agent knows what to skip:

## Run state
Pages visited: github.com/torvalds, github.com/gvanrossum
FAILED URLs (skip these): github.com/deleted-user-404
BROKEN selectors (don't retry): div.old-layout-sidebar
DEAD ENDS (tried, didn't work): click on github.com/microsoft/vscode/graphs
Exhausted pages (all data taken): github.com/torvalds

Selector failure tracking: after a selector fails 2 times, it's flagged as "blocked" so the agent stops retrying it.

Page exhaustion: when save_progress successfully extracts new data from a page, that URL is marked as exhausted.

Upgrade 2: Episodic Failure Memory

MemoryStore is now run-scoped — stored inside evidence/run_XXXX/memory/. Each run builds its own memory from scratch. Samples within a run learn from each other, but different runs cannot interfere.

Type	File	Learned from	Contains
Procedural patterns	`evidence/run_XXXX/memory/patterns.json`	`done` samples in this run	action sequences, tips, things to avoid
Episodic warnings	`evidence/run_XXXX/memory/failures.json`	`failed` / `partial_success` samples in this run	dead URLs, broken selectors, dead ends, failure reason

Why run-scoped? A commit audit pattern ("click PR link") would be harmful if injected into a LinkedIn enrichment task on the same domain. Run-scoped memory guarantees samples within the same task help each other, while different tasks stay isolated.

Failure signals are stored and injected into prompts for later samples in the same run:

## Known issues on github.com (from past failures)
Dead URLs (skip): github.com/deleted-user-404
Broken selectors: div.old-layout-sidebar
Previous failure: Expected 5 items but collected 3

learn_failures() is called from:

Smart termination (timeout, network circuit breaker)
Agent calling fail()
done with partial_success status

Upgrade 3: Structured Summary Schema (ReSum-inspired)

Before: Step summaries were generic prose:

Steps 1-10: Navigated to contributors page, clicked into torvalds profile.
Extracted name and followers. Took screenshot.

After: Summaries follow a structured FOUND/GAPS/NEXT format:

Steps 1-10:
FOUND: Extracted name, company, followers for torvalds. Screenshot taken.
GAPS: 2 of 3 contributors still not visited. No bio data collected yet.
NEXT: Navigate back to contributors list and click the next profile.

This is directly inspired by ReSum (Alibaba, 2025) which showed that structured summaries with "verified evidence + information gaps + next-step directions" outperform prose summaries by 4.5% on long-horizon web tasks.

Upgrade 4: Token Budget Formula

PROMPT_TOKEN_BUDGET = min(24_000, max(8_000, int(LLM_CONTEXT_WINDOW * 0.08)))

8% of context window: Stays well below the "lost in the middle" degradation zone (~20% fill)
Floor 8K: Even a small model gets a usable budget
Cap 24K: A 1M-context model doesn't waste 80K of prompt on a single browser step
Overridable: Set LLM_CONTEXT_WINDOW in .env for non-standard models

Model	Context	Budget	History (30%)
Claude Sonnet (200K)	200K	16,000	4,800
Claude Haiku (200K)	200K	16,000	4,800
Future 1M model	1M	24,000 (capped)	7,200
Small 32K model	32K	8,000 (floor)	2,400

Part 6 — Browser-Use Inspired Improvements

Files changed: agent_loop.py, models/actions.py, config.py

Research basis:

browser-use: structured self-evaluation, escalating recovery, budget pressure, multi-action batching
Decision hygiene: explicit reflection fields instead of incidental text

Upgrade 1: Structured Self-Evaluation

Every action now includes three optional reflection fields:

class AgentAction(BaseModel):
    # ... existing fields ...
    evaluation_previous_step: str | None = None  # "Did my last action work?"
    memory_update: str | None = None             # "What to remember going forward"
    next_goal: str | None = None                 # "What I'll do next and why"

No extra LLM call — reflection is part of the tool call response
Truncated to 160 chars to prevent token bloat
Stored in StepRecord → appears in action_log.json audit trail
memory_update and next_goal shown in history for context continuity
Controlled by REFLECTION_MODE: "light" (default: slimmer tool schema + prompt) or "full" (reflection in tools and history annotations)

Upgrade 2: Escalating Loop/Stagnation Detection

Replaces the old flat watchdog with a 3-level escalation using page signature hashing:

Page signature = MD5 of (normalized URL + first 2K of DOM text). Same signature + no new data → stagnation count increases.

Level	Trigger	Response
1 (gentle)	3 stagnant steps	"Try a different approach"
2 (forceful)	5 stagnant steps	"CHANGE YOUR STRATEGY NOW" + checkpoint saved
3 (critical)	8 stagnant steps	"MUST call done or fail"

Unified into _build_recovery_notice() which also catches:

Action spam (4+ identical actions on same URL)
Consecutive failure recovery (3+ failures → visible element list)

Upgrade 3: Budget Pressure Warnings

One-time notices injected at step-budget thresholds:

Threshold	Message
75% used	"Start consolidating results — call save_progress, then finalize with done"
90% used	"URGENT — save any unsaved data NOW, then call done immediately"
Final step	Tools restricted to `done` and `fail` only (via `_get_terminal_tools()`)

Each fires exactly once — no spam.

Upgrade 4: Final-Response-After-Failure

_attempt_final_consolidation() — one last LLM call when:

max_steps exhausted with accumulated data
LLM retries all fail with accumulated data

The consolidation call receives accumulated data + task schema and produces best-effort structured output. Fits the existing partial_success model.

When the primary model just failed, prefers the fallback model (if ENABLE_FALLBACK_LLM=true).

Upgrade 5: Fallback LLM

When ENABLE_FALLBACK_LLM=true (default: false):

Primary model retried 3x with exponential backoff
One attempt on FALLBACK_LLM_MODEL (default: Claude Haiku)
If fallback succeeds, continues the run
Model switch is explicitly logged

Upgrade 6: Multi-Action Batching (Experimental)

When ENABLE_MULTI_ACTIONS=true (default: false):

LLM can return multiple tool calls per step (max MAX_ACTIONS_PER_STEP, default 3)
Sub-actions execute sequentially with safety guards:
- Batch-breaking actions (goto, done, fail, save_progress) abort remaining
- URL change → abort (stale element map)
- DOM stability check: >20% shift in interactive element count → abort
- Any failure → abort
Fresh DOM refresh before each sub-action
Full per-sub-action logging in action_log.json
Best for: form fills, repetitive extraction. Not for navigation-heavy flows.

New Config Flags

REFLECTION_MODE=light             # default; use "full" for long-horizon
ENABLE_MEMORY_DISTILLATION=true   # false = heuristic patterns only (no post-run Haiku)
FINALIZE_ON_FAILURE=true          # best-effort consolidation on exhaustion/failure
ENABLE_FALLBACK_LLM=false         # try fallback model on primary failure
FALLBACK_LLM_MODEL=claude-haiku-4-5
ENABLE_MULTI_ACTIONS=false        # experimental multi-action batching
MAX_ACTIONS_PER_STEP=3            # max sub-actions per batch

Architecture Summary

Before long-horizon:
  history = last 5 actions (everything else forgotten)
  termination = max_steps only
  output = all-or-nothing (done or failed)

After long-horizon + browser-use improvements:
  reflection:
    evaluation  = structured self-assessment per step
    memory      = explicit working scratchpad per step
    next_goal   = declared intent before acting
  memory:
    working   = dynamic budget-fitted window (5-25 items, importance-scored)
    summaries = structured FOUND/GAPS/NEXT every 10 steps (Haiku, <$0.001)
    run state = pages visited, failed URLs, blocked selectors, dead ends, exhausted pages
    procedural = domain-keyed patterns from successful runs (patterns.json)
    episodic  = failure warnings from failed runs (failures.json)
  recovery:
    escalating = gentle nudge → forceful demand → forced consolidation
    budget     = warnings at 75% and 90%, final-step done|fail only
    consolidation = one last LLM call on exhaustion/failure with accumulated data
    fallback   = optional secondary model on primary failure
  termination = 8+ conditions checked every step
  output = done | partial_success | failed | needs_review
  checkpoint = live file updated every 5 steps
  pagination = auto-detected, bonus steps granted
  batching = optional multi-action per step (experimental)
  token budget = min(24K, max(8K, context_window * 8%))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long-Horizon Task Support

The Problem

Part 1 — Incremental Checkpointing

Part 2 — Smarter Memory

Part 3 — Continuous Operation

Smart Termination

New Files Created

How to Test

Test 1: Contributor Deep Audit (recommended first)

Test 2: PR Audit Chain (harder, multi-page)

Test 3: Natural Language Long-Horizon

What Each Prompt Looks Like (agent's perspective)

Part 5 — Research-Backed Memory Architecture

Upgrade 1: Structured Run State

Upgrade 2: Episodic Failure Memory

Upgrade 3: Structured Summary Schema (ReSum-inspired)

Upgrade 4: Token Budget Formula

Part 6 — Browser-Use Inspired Improvements

Upgrade 1: Structured Self-Evaluation

Upgrade 2: Escalating Loop/Stagnation Detection

Upgrade 3: Budget Pressure Warnings

Upgrade 4: Final-Response-After-Failure

Upgrade 5: Fallback LLM

Upgrade 6: Multi-Action Batching (Experimental)

New Config Flags

Architecture Summary

FilesExpand file tree

LONG_HORIZON.md

Latest commit

History

LONG_HORIZON.md

File metadata and controls

Long-Horizon Task Support

The Problem

Part 1 — Incremental Checkpointing

Part 2 — Smarter Memory

Part 3 — Continuous Operation

Smart Termination

New Files Created

How to Test

Test 1: Contributor Deep Audit (recommended first)

Test 2: PR Audit Chain (harder, multi-page)

Test 3: Natural Language Long-Horizon

What Each Prompt Looks Like (agent's perspective)

Part 5 — Research-Backed Memory Architecture

Upgrade 1: Structured Run State

Upgrade 2: Episodic Failure Memory

Upgrade 3: Structured Summary Schema (ReSum-inspired)

Upgrade 4: Token Budget Formula

Part 6 — Browser-Use Inspired Improvements

Upgrade 1: Structured Self-Evaluation

Upgrade 2: Escalating Loop/Stagnation Detection

Upgrade 3: Budget Pressure Warnings

Upgrade 4: Final-Response-After-Failure

Upgrade 5: Fallback LLM

Upgrade 6: Multi-Action Batching (Experimental)

New Config Flags

Architecture Summary