Skip to content

Conversation

@NicolasAG
Copy link
Collaborator

@NicolasAG NicolasAG commented Nov 6, 2025

Introduce the webarena_verified benchmark.

  • tasks are registered with this template: webarena_verified.{intent_template_id}.{task_id}
  • new WebArenaVerifiedTask class overrides the setup() function of GenericWebArenaTask to:
    • use the webarena_verified evaluator
    • load extra html headers if PW_EXTRA_HEADERS is set -- used to store secret keys to access self-hosted webarena instances
    • append a hint to the goal for the model to return the expected response format for the webarena_verified evaluator
  • new WebArenaVerifiedEvaluator class that calls the webarena_verified.api.WebArenaVerifiedEvaluator from platform-labs-webarena-verified

Copy link
Collaborator

@amanjaiswal73892 amanjaiswal73892 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! as discussed

@amanjaiswal73892 amanjaiswal73892 merged commit f0e4275 into main Jan 20, 2026
13 checks passed
@amanjaiswal73892 amanjaiswal73892 deleted the wa_verified branch January 20, 2026 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants