Conversation
…agents Implement all 7 tasks from the subscription-auth spec: 1. Add createAuthCacheArgs(), resolveAuthCachePath(), hasAuthCache() to src/utils/docker.ts 2. Add tryAnyEnv() to src/utils/env.ts; make API keys optional in claude/gemini/codex builders 3. Apply auth mounts in docker-strategy.ts for both v1 and v2 execution paths 4. Apply auth mounts in v2-container.ts (replace hardcoded claudeMount) 5. Implement --setup-auth <agent> CLI command for interactive Docker login 6. Update environment.md with subscription auth documentation 7. Add unit tests for new functions and subscription auth behavior Auth priority: API key > auth cache volume > error with --setup-auth suggestion. GitHub Actions unaffected (always uses API key secrets). Closes #59
…duplicate agent list, remove unused import
Gemini CLI does not have a 'login' sub-command — it authenticates interactively on first launch. Add AUTH_LOGIN_ARGS mapping so --setup-auth gemini runs 'gemini' (no args) while claude/codex still use '<agent> login'.
- Log which auth method is used (API key vs subscription) in all 3 agent builders so users can see which path was taken. - Add missing await on expect(...).rejects.toThrow() in subscription auth tests (Claude, Gemini, Codex) so assertions are actually checked.
…subcommand) None of the supported CLIs (Claude, Gemini, Codex) have a dedicated 'login' sub-command. They all authenticate interactively when launched for the first time. Update AUTH_LOGIN_ARGS to use empty arrays for all agents and update spec docs accordingly.
Codex CLI requires 'codex login --device-auth' for headless Device-Code flow authentication, unlike Claude and Gemini which authenticate interactively on first launch with no extra args.
- SETUP_AUTH_IMAGE constant points to node:lts (lightweight, no build required) - Mount local scripts/run-agent.sh into the container (read-only) - run-agent.sh handles CLI installation via npm automatically - Users no longer need to build ts-bench-container just for authentication
…o exit Claude and Gemini CLIs enter interactive chat mode after successful login, so the user must Ctrl-C to exit — producing a non-zero exit code. Previously the sentinel was only written on exit code 0. Now we also check for the agent-specific credential file (e.g. .credentials.json for Claude, oauth_creds.json for Gemini) and write the sentinel if credentials are found, regardless of exit code.
Two fixes: 1. Use `return await` instead of bare `return` in runV2Docker's try/finally. Without `await`, the finally block runs immediately (destroying the container) while runV2Task is still executing. This is a known JS gotcha where `return promise` in an async try/finally lets the finally block run before the promise resolves. 2. Set IS_SANDBOX=1 for Claude when running in Docker. Claude Code refuses `--dangerously-skip-permissions` when running as root for safety, but IS_SANDBOX=1 signals we are inside an isolated container environment.
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
There was a problem hiding this comment.
Devin Review found 1 potential issue.
⚠️ 1 issue in files not directly in the diff
⚠️ Missing unit test for new IS_SANDBOX env var violates constitution rule III (src/agents/builders/__tests__/claude.test.ts:71-84)
The project constitution (.specify/memory/constitution.md, principle III) requires: "Changes to runner, agents, or benchmarks must include or update unit tests under src/**/__tests__/ where applicable." The PR adds IS_SANDBOX=1 to the Claude agent's environment when useDocker is true (src/agents/builders/claude.ts:70-72), but the existing test at src/agents/builders/__tests__/claude.test.ts:71-84 — which already creates a builder with useDocker: true and checks the command — is not updated to verify cmd.env?.IS_SANDBOX === '1'. This is directly testable and the test infrastructure is already in place.
View 2 additional findings in Devin Review.
|
Addressed Devin Review finding in 22aeb7d — added two unit tests for
|
…th-spec' into devin/1775145866-v2-container-fixes
Summary
Two critical fixes for v2 (SWE-Lancer) Docker benchmark execution, plus a merge of the subscription-based authentication feature (PR #90).
1. Container destroyed mid-execution (
exercise.ts)runV2Docker()usedreturn this.runV2Task(...)inside atry/finallyblock. In async functions, barereturn promiselets thefinallyblock execute before the promise resolves — socontainer.destroy()killed the Docker container while the agent was still running. Fix:return await.Empirically verified:
Confirmed in a live v2 benchmark run: Codex agent executed for 44.9s inside the container (previously would die immediately with SIGKILL/signal=9).
2. Claude
--dangerously-skip-permissionsrejected as root (claude.ts)Claude Code refuses
--dangerously-skip-permissionswhen running as root (the default in Docker). SettingIS_SANDBOX=1signals to Claude Code that it's inside an isolated container, allowing the flag.3. Subscription-based authentication (merged from PR #90)
This branch now includes the full subscription auth implementation, enabling agents to run without API keys when the user has a valid local login session. See PR #90 for the detailed review of that feature.
Updates since last revision
claude.test.ts— one verifyingIS_SANDBOX=1is set whenuseDocker: true, another verifying it's absent without Docker.Review & Testing Checklist for Human
return awaitfix by running a v2 benchmark:bun src/index.ts --dataset v2 --docker --agent codex --model gpt-5.4-mini --task 16912_4 --verbose— the agent should execute inside the container instead of immediately failing with "container is not running"bun src/index.ts --dataset v2 --docker --agent claude --model haiku --task 16912_4 --verbose— should not error with "dangerously-skip-permissions cannot be used with root/sudo"IS_SANDBOX=1is appropriate for all Docker runs (v1 + v2), not just v2Notes
returnvsreturn awaitgotcha only matters insidetry/finally(ortry/catch) blocks in async functions. TherunOneCommitGroupmethod inbenchmark/runner.tsis not affected because it uses a loop withawaitbefore reaching the finally block.IS_SANDBOX=1is a Claude Code environment variable that bypasses the root safety check. This is set for all Docker runs since containers are inherently sandboxed.gpt-4.5-mini(which doesn't exist as an OpenAI model) — the container fix was confirmed working because the agent ran for 44.9s before hitting the model-not-found error, whereas previously the container would be killed immediately. A full successful run requires a valid model name (e.g.gpt-5.4-mini).Link to Devin session: https://app.devin.ai/sessions/514b15c4c58a4d94a8a03a366ff4f4f5
Requested by: @laiso