fix: v2 container killed prematurely + IS_SANDBOX for Claude Docker + subscription auth merge by devin-ai-integration[bot] · Pull Request #92 · laiso/ts-bench

devin-ai-integration · 2026-04-02T16:06:13Z

Summary

Two critical fixes for v2 (SWE-Lancer) Docker benchmark execution, plus a merge of the subscription-based authentication feature (PR #90).

1. Container destroyed mid-execution (exercise.ts)

runV2Docker() used return this.runV2Task(...) inside a try/finally block. In async functions, bare return promise lets the finally block execute before the promise resolves — so container.destroy() killed the Docker container while the agent was still running. Fix: return await.

Empirically verified:

return (no await): inner-start -> finally-runs -> finally-done -> inner-end
return await:      inner-start -> inner-end -> finally-runs -> finally-done

Confirmed in a live v2 benchmark run: Codex agent executed for 44.9s inside the container (previously would die immediately with SIGKILL/signal=9).

2. Claude --dangerously-skip-permissions rejected as root (claude.ts)

Claude Code refuses --dangerously-skip-permissions when running as root (the default in Docker). Setting IS_SANDBOX=1 signals to Claude Code that it's inside an isolated container, allowing the flag.

3. Subscription-based authentication (merged from PR #90)

This branch now includes the full subscription auth implementation, enabling agents to run without API keys when the user has a valid local login session. See PR #90 for the detailed review of that feature.

Updates since last revision

IS_SANDBOX unit tests (addressing Devin Review feedback): Added two tests in claude.test.ts — one verifying IS_SANDBOX=1 is set when useDocker: true, another verifying it's absent without Docker.
Merged subscription auth branch (PR feat: implement subscription-based (API key-less) authentication for agents #90) into this branch so that both the container fixes and subscription auth are available together for end-to-end benchmarking.

Review & Testing Checklist for Human

Verify the return await fix by running a v2 benchmark: bun src/index.ts --dataset v2 --docker --agent codex --model gpt-5.4-mini --task 16912_4 --verbose — the agent should execute inside the container instead of immediately failing with "container is not running"
Verify Claude agent works in Docker: bun src/index.ts --dataset v2 --docker --agent claude --model haiku --task 16912_4 --verbose — should not error with "dangerously-skip-permissions cannot be used with root/sudo"
Review the merge of PR feat: implement subscription-based (API key-less) authentication for agents #90 — ensure no conflicts or regressions were introduced by the merge. The subscription auth changes were already reviewed in PR feat: implement subscription-based (API key-less) authentication for agents #90, but verify they integrate cleanly with the two fixes here.
Confirm IS_SANDBOX=1 is appropriate for all Docker runs (v1 + v2), not just v2

Notes

The return vs return await gotcha only matters inside try/finally (or try/catch) blocks in async functions. The runOneCommitGroup method in benchmark/runner.ts is not affected because it uses a loop with await before reaching the finally block.
IS_SANDBOX=1 is a Claude Code environment variable that bypasses the root safety check. This is set for all Docker runs since containers are inherently sandboxed.
The Codex v2 benchmark was run with gpt-4.5-mini (which doesn't exist as an OpenAI model) — the container fix was confirmed working because the agent ran for 44.9s before hitting the model-not-found error, whereas previously the container would be killed immediately. A full successful run requires a valid model name (e.g. gpt-5.4-mini).
This branch now includes all changes from PR feat: implement subscription-based (API key-less) authentication for agents #90 (subscription auth). If PR feat: implement subscription-based (API key-less) authentication for agents #90 is merged to main first, this PR's diff will shrink to just the two bug fixes + IS_SANDBOX tests.

Link to Devin session: https://app.devin.ai/sessions/514b15c4c58a4d94a8a03a366ff4f4f5
Requested by: @laiso

…agents Implement all 7 tasks from the subscription-auth spec: 1. Add createAuthCacheArgs(), resolveAuthCachePath(), hasAuthCache() to src/utils/docker.ts 2. Add tryAnyEnv() to src/utils/env.ts; make API keys optional in claude/gemini/codex builders 3. Apply auth mounts in docker-strategy.ts for both v1 and v2 execution paths 4. Apply auth mounts in v2-container.ts (replace hardcoded claudeMount) 5. Implement --setup-auth <agent> CLI command for interactive Docker login 6. Update environment.md with subscription auth documentation 7. Add unit tests for new functions and subscription auth behavior Auth priority: API key > auth cache volume > error with --setup-auth suggestion. GitHub Actions unaffected (always uses API key secrets). Closes #59

…duplicate agent list, remove unused import

…unner

…log files

Gemini CLI does not have a 'login' sub-command — it authenticates interactively on first launch. Add AUTH_LOGIN_ARGS mapping so --setup-auth gemini runs 'gemini' (no args) while claude/codex still use '<agent> login'.

- Log which auth method is used (API key vs subscription) in all 3 agent builders so users can see which path was taken. - Add missing await on expect(...).rejects.toThrow() in subscription auth tests (Claude, Gemini, Codex) so assertions are actually checked.

…subcommand) None of the supported CLIs (Claude, Gemini, Codex) have a dedicated 'login' sub-command. They all authenticate interactively when launched for the first time. Update AUTH_LOGIN_ARGS to use empty arrays for all agents and update spec docs accordingly.

Codex CLI requires 'codex login --device-auth' for headless Device-Code flow authentication, unlike Claude and Gemini which authenticate interactively on first launch with no extra args.

- SETUP_AUTH_IMAGE constant points to node:lts (lightweight, no build required) - Mount local scripts/run-agent.sh into the container (read-only) - run-agent.sh handles CLI installation via npm automatically - Users no longer need to build ts-bench-container just for authentication

…o exit Claude and Gemini CLIs enter interactive chat mode after successful login, so the user must Ctrl-C to exit — producing a non-zero exit code. Previously the sentinel was only written on exit code 0. Now we also check for the agent-specific credential file (e.g. .credentials.json for Claude, oauth_creds.json for Gemini) and write the sentinel if credentials are found, regardless of exit code.

Two fixes: 1. Use `return await` instead of bare `return` in runV2Docker's try/finally. Without `await`, the finally block runs immediately (destroying the container) while runV2Task is still executing. This is a known JS gotcha where `return promise` in an async try/finally lets the finally block run before the promise resolves. 2. Set IS_SANDBOX=1 for Claude when running in Docker. Claude Code refuses `--dangerously-skip-permissions` when running as root for safety, but IS_SANDBOX=1 signals we are inside an isolated container environment.

devin-ai-integration · 2026-04-02T16:06:16Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

devin-ai-integration

Devin Review found 1 potential issue.

⚠️ 1 issue in files not directly in the diff

⚠️ Missing unit test for new IS_SANDBOX env var violates constitution rule III (`src/agents/builders/tests/claude.test.ts:71-84`)

The project constitution (.specify/memory/constitution.md, principle III) requires: "Changes to runner, agents, or benchmarks must include or update unit tests under src/**/__tests__/ where applicable." The PR adds IS_SANDBOX=1 to the Claude agent's environment when useDocker is true (src/agents/builders/claude.ts:70-72), but the existing test at src/agents/builders/__tests__/claude.test.ts:71-84 — which already creates a builder with useDocker: true and checks the command — is not updated to verify cmd.env?.IS_SANDBOX === '1'. This is directly testable and the test infrastructure is already in place.

View 2 additional findings in Devin Review.

devin-ai-integration · 2026-04-02T16:12:09Z

Addressed Devin Review finding in 22aeb7d — added two unit tests for IS_SANDBOX:

sets IS_SANDBOX=1 for Docker runs to allow skip-permissions as root — verifies cmd.env?.IS_SANDBOX === '1' when useDocker: true
does not set IS_SANDBOX when not using Docker — verifies IS_SANDBOX is undefined without Docker

…th-spec' into devin/1775145866-v2-container-fixes

devin-ai-integration bot added 12 commits April 2, 2026 07:36

fix: address review issues — pass agentName to V2ContainerManager, de…

2018aec

…duplicate agent list, remove unused import

fix: guard hasAuthCache with useDocker check, pass agentName in TestR…

3563b38

…unner

fix: update ClaudeLogCollector to use auth cache path in Docker mode

aa2fdd2

fix: use sentinel file in hasAuthCache to avoid false positives from …

1e2fa9d

…log files

fix: use agent-specific login commands in --setup-auth

2b742f6

Gemini CLI does not have a 'login' sub-command — it authenticates interactively on first launch. Add AUTH_LOGIN_ARGS mapping so --setup-auth gemini runs 'gemini' (no args) while claude/codex still use '<agent> login'.

fix: codex uses 'codex login --device-auth' for setup-auth

573ea5c

Codex CLI requires 'codex login --device-auth' for headless Device-Code flow authentication, unlike Claude and Gemini which authenticate interactively on first launch with no extra args.

devin-ai-integration bot assigned laiso Apr 2, 2026

devin-ai-integration bot commented Apr 2, 2026

View reviewed changes

test: add IS_SANDBOX unit tests for Claude Docker runs

22aeb7d

Merge remote-tracking branch 'origin/devin/1775114880-subscription-au…

6cc8e0d

…th-spec' into devin/1775145866-v2-container-fixes

devin-ai-integration bot changed the title ~~fix: v2 container killed prematurely due to missing await in try/finally~~ fix: v2 container killed prematurely + IS_SANDBOX for Claude Docker + subscription auth merge Apr 2, 2026

laiso merged commit 40d2bf6 into main Apr 2, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: v2 container killed prematurely + IS_SANDBOX for Claude Docker + subscription auth merge#92

fix: v2 container killed prematurely + IS_SANDBOX for Claude Docker + subscription auth merge#92
laiso merged 14 commits intomainfrom
devin/1775145866-v2-container-fixes

devin-ai-integration bot commented Apr 2, 2026 •

edited

Loading

Uh oh!

devin-ai-integration bot commented Apr 2, 2026

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

devin-ai-integration bot commented Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

devin-ai-integration bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Updates since last revision

Review & Testing Checklist for Human

Notes

Uh oh!

devin-ai-integration bot commented Apr 2, 2026

🤖 Devin AI Engineer

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

⚠️ Missing unit test for new IS_SANDBOX env var violates constitution rule III (src/agents/builders/__tests__/claude.test.ts:71-84)

Uh oh!

devin-ai-integration bot commented Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

devin-ai-integration bot commented Apr 2, 2026 •

edited

Loading

⚠️ Missing unit test for new IS_SANDBOX env var violates constitution rule III (`src/agents/builders/tests/claude.test.ts:71-84`)