Skip to content

fix: v2 container killed prematurely + IS_SANDBOX for Claude Docker + subscription auth merge#92

Merged
laiso merged 14 commits intomainfrom
devin/1775145866-v2-container-fixes
Apr 2, 2026
Merged

fix: v2 container killed prematurely + IS_SANDBOX for Claude Docker + subscription auth merge#92
laiso merged 14 commits intomainfrom
devin/1775145866-v2-container-fixes

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot commented Apr 2, 2026

Summary

Two critical fixes for v2 (SWE-Lancer) Docker benchmark execution, plus a merge of the subscription-based authentication feature (PR #90).

1. Container destroyed mid-execution (exercise.ts)

runV2Docker() used return this.runV2Task(...) inside a try/finally block. In async functions, bare return promise lets the finally block execute before the promise resolves — so container.destroy() killed the Docker container while the agent was still running. Fix: return await.

Empirically verified:

return (no await): inner-start -> finally-runs -> finally-done -> inner-end
return await:      inner-start -> inner-end -> finally-runs -> finally-done

Confirmed in a live v2 benchmark run: Codex agent executed for 44.9s inside the container (previously would die immediately with SIGKILL/signal=9).

2. Claude --dangerously-skip-permissions rejected as root (claude.ts)

Claude Code refuses --dangerously-skip-permissions when running as root (the default in Docker). Setting IS_SANDBOX=1 signals to Claude Code that it's inside an isolated container, allowing the flag.

3. Subscription-based authentication (merged from PR #90)

This branch now includes the full subscription auth implementation, enabling agents to run without API keys when the user has a valid local login session. See PR #90 for the detailed review of that feature.

Updates since last revision

  • IS_SANDBOX unit tests (addressing Devin Review feedback): Added two tests in claude.test.ts — one verifying IS_SANDBOX=1 is set when useDocker: true, another verifying it's absent without Docker.
  • Merged subscription auth branch (PR feat: implement subscription-based (API key-less) authentication for agents #90) into this branch so that both the container fixes and subscription auth are available together for end-to-end benchmarking.

Review & Testing Checklist for Human

  • Verify the return await fix by running a v2 benchmark: bun src/index.ts --dataset v2 --docker --agent codex --model gpt-5.4-mini --task 16912_4 --verbose — the agent should execute inside the container instead of immediately failing with "container is not running"
  • Verify Claude agent works in Docker: bun src/index.ts --dataset v2 --docker --agent claude --model haiku --task 16912_4 --verbose — should not error with "dangerously-skip-permissions cannot be used with root/sudo"
  • Review the merge of PR feat: implement subscription-based (API key-less) authentication for agents #90 — ensure no conflicts or regressions were introduced by the merge. The subscription auth changes were already reviewed in PR feat: implement subscription-based (API key-less) authentication for agents #90, but verify they integrate cleanly with the two fixes here.
  • Confirm IS_SANDBOX=1 is appropriate for all Docker runs (v1 + v2), not just v2

Notes

  • The return vs return await gotcha only matters inside try/finally (or try/catch) blocks in async functions. The runOneCommitGroup method in benchmark/runner.ts is not affected because it uses a loop with await before reaching the finally block.
  • IS_SANDBOX=1 is a Claude Code environment variable that bypasses the root safety check. This is set for all Docker runs since containers are inherently sandboxed.
  • The Codex v2 benchmark was run with gpt-4.5-mini (which doesn't exist as an OpenAI model) — the container fix was confirmed working because the agent ran for 44.9s before hitting the model-not-found error, whereas previously the container would be killed immediately. A full successful run requires a valid model name (e.g. gpt-5.4-mini).
  • This branch now includes all changes from PR feat: implement subscription-based (API key-less) authentication for agents #90 (subscription auth). If PR feat: implement subscription-based (API key-less) authentication for agents #90 is merged to main first, this PR's diff will shrink to just the two bug fixes + IS_SANDBOX tests.

Link to Devin session: https://app.devin.ai/sessions/514b15c4c58a4d94a8a03a366ff4f4f5
Requested by: @laiso


Open with Devin

…agents

Implement all 7 tasks from the subscription-auth spec:

1. Add createAuthCacheArgs(), resolveAuthCachePath(), hasAuthCache() to src/utils/docker.ts
2. Add tryAnyEnv() to src/utils/env.ts; make API keys optional in claude/gemini/codex builders
3. Apply auth mounts in docker-strategy.ts for both v1 and v2 execution paths
4. Apply auth mounts in v2-container.ts (replace hardcoded claudeMount)
5. Implement --setup-auth <agent> CLI command for interactive Docker login
6. Update environment.md with subscription auth documentation
7. Add unit tests for new functions and subscription auth behavior

Auth priority: API key > auth cache volume > error with --setup-auth suggestion.
GitHub Actions unaffected (always uses API key secrets).

Closes #59
Gemini CLI does not have a 'login' sub-command — it authenticates
interactively on first launch.  Add AUTH_LOGIN_ARGS mapping so
--setup-auth gemini runs 'gemini' (no args) while claude/codex
still use '<agent> login'.
- Log which auth method is used (API key vs subscription) in all 3
  agent builders so users can see which path was taken.
- Add missing await on expect(...).rejects.toThrow() in subscription
  auth tests (Claude, Gemini, Codex) so assertions are actually checked.
…subcommand)

None of the supported CLIs (Claude, Gemini, Codex) have a dedicated
'login' sub-command.  They all authenticate interactively when launched
for the first time.  Update AUTH_LOGIN_ARGS to use empty arrays for
all agents and update spec docs accordingly.
Codex CLI requires 'codex login --device-auth' for headless
Device-Code flow authentication, unlike Claude and Gemini which
authenticate interactively on first launch with no extra args.
- SETUP_AUTH_IMAGE constant points to node:lts (lightweight, no build required)
- Mount local scripts/run-agent.sh into the container (read-only)
- run-agent.sh handles CLI installation via npm automatically
- Users no longer need to build ts-bench-container just for authentication
…o exit

Claude and Gemini CLIs enter interactive chat mode after successful
login, so the user must Ctrl-C to exit — producing a non-zero exit
code.  Previously the sentinel was only written on exit code 0.

Now we also check for the agent-specific credential file (e.g.
.credentials.json for Claude, oauth_creds.json for Gemini) and write
the sentinel if credentials are found, regardless of exit code.
Two fixes:

1. Use `return await` instead of bare `return` in runV2Docker's try/finally.
   Without `await`, the finally block runs immediately (destroying the
   container) while runV2Task is still executing. This is a known JS gotcha
   where `return promise` in an async try/finally lets the finally block
   run before the promise resolves.

2. Set IS_SANDBOX=1 for Claude when running in Docker. Claude Code refuses
   `--dangerously-skip-permissions` when running as root for safety, but
   IS_SANDBOX=1 signals we are inside an isolated container environment.
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Copy link
Copy Markdown
Contributor Author

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

⚠️ 1 issue in files not directly in the diff

⚠️ Missing unit test for new IS_SANDBOX env var violates constitution rule III (src/agents/builders/__tests__/claude.test.ts:71-84)

The project constitution (.specify/memory/constitution.md, principle III) requires: "Changes to runner, agents, or benchmarks must include or update unit tests under src/**/__tests__/ where applicable." The PR adds IS_SANDBOX=1 to the Claude agent's environment when useDocker is true (src/agents/builders/claude.ts:70-72), but the existing test at src/agents/builders/__tests__/claude.test.ts:71-84 — which already creates a builder with useDocker: true and checks the command — is not updated to verify cmd.env?.IS_SANDBOX === '1'. This is directly testable and the test infrastructure is already in place.

View 2 additional findings in Devin Review.

Open in Devin Review

@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

Addressed Devin Review finding in 22aeb7d — added two unit tests for IS_SANDBOX:

  • sets IS_SANDBOX=1 for Docker runs to allow skip-permissions as root — verifies cmd.env?.IS_SANDBOX === '1' when useDocker: true
  • does not set IS_SANDBOX when not using Docker — verifies IS_SANDBOX is undefined without Docker

…th-spec' into devin/1775145866-v2-container-fixes
@devin-ai-integration devin-ai-integration bot changed the title fix: v2 container killed prematurely due to missing await in try/finally fix: v2 container killed prematurely + IS_SANDBOX for Claude Docker + subscription auth merge Apr 2, 2026
@laiso laiso merged commit 40d2bf6 into main Apr 2, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant