Skip to content

Fix process-tree termination for background kills#388

Open
kchro3 wants to merge 8 commits intocloudflare:mainfrom
justwordsai:fix/process-tree-kill-regression
Open

Fix process-tree termination for background kills#388
kchro3 wants to merge 8 commits intocloudflare:mainfrom
justwordsai:fix/process-tree-kill-regression

Conversation

@kchro3
Copy link

@kchro3 kchro3 commented Feb 13, 2026

Summary

  • Fix background process kill to terminate full process subtree (shell + children) to prevent orphaned processes.
  • Add process-tree-aware termination with SIGTERM then SIGKILL fallback plus exit verification in Session.killCommand.
  • Add focused E2E regression test that kills a parent background command and verifies the spawned child process is gone.

Motivation

Reports in #338 showed killProcess() returning success while descendant processes continued running.

We're trying to run claude in the sandbox with execStream, but we're finding that it gets into a bad state where exiting the execStream doesn't kill the claude process correctly.

Testing

  • Added E2E regression test in tests/e2e/process-lifecycle-workflow.test.ts.

@changeset-bot
Copy link

changeset-bot bot commented Feb 13, 2026

🦋 Changeset detected

Latest commit: a8a5a70

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@cloudflare/sandbox Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@whoiskatrin
Copy link
Collaborator

/bonk review this PR

@ask-bonk
Copy link

ask-bonk bot commented Feb 13, 2026

Review of PR #388: Fix process-tree termination for background kills

Good problem to solve — orphaned child processes after killProcess() is a real user-facing issue (#338). The overall approach (bottom-up tree termination, SIGTERM then SIGKILL escalation, /proc-based child discovery) is sound. I left inline comments on a few items:

Bug: waitForExit=false path returns early, leaking command handles

The unconditional liveness check at lines 732-744 runs even when waitForExit=false (the destroy() path). Since signals are async and there's no wait, processes are almost certainly still alive at that point, causing killCommand to return false before reaching runningCommands.delete(). See inline comment for suggested fix.

Test does not exercise the actual regression scenario

The E2E test writes $$ (the script's own PID), but the old single-SIGTERM code already killed that PID. To validate tree-kill, the script should spawn a grandchild process (e.g., sleep 120 &) and capture $! — that's the process that would have been orphaned before this fix. See inline suggestion.

Design consideration: process group kill

The tree-walk approach has an inherent TOCTOU race with /proc/.../children. Process group kill (kill(-pgid, signal)) would be atomic and simpler if the background subshell has its own pgid. Worth investigating as an alternative or complement. See inline comment.

Minor

  • JSDoc indentation is off by one space on the @param waitForExit line.
  • SessionManager.killCommand does not thread through the waitForExit parameter — always uses the default true. This is correct for the normal path but noted for future reference.

Changeset

The changeset content and format look good. Correctly targets @cloudflare/sandbox (not internal packages), patch level is appropriate for beta, and the description is user-focused.

@ask-bonk
Copy link

ask-bonk bot commented Feb 13, 2026

Review complete. I posted:

  1. Top-level summary comment covering all findings
  2. Inline comment on line 744 (session.ts) — Bug: the waitForExit=false path (used by destroy()) returns false before cleaning up runningCommands because the liveness check runs immediately after SIGTERM with no wait. Includes a suggestion block to gate it on waitForExit.
  3. Inline comment on line 698 (session.ts) — Design suggestion to consider process group kill (kill(-pgid, signal)) as a simpler, race-free alternative to the /proc tree-walk.
  4. Inline comment on line 659 (session.ts) — Nit: JSDoc indentation off by one space, with suggestion fix.
  5. Inline comment on line 154 (process-lifecycle-workflow.test.ts) — The test captures $$ (the script's own PID) rather than a grandchild PID, so it doesn't actually test the tree-kill regression. Suggested rewriting to background sleep and capture $!.

github run

@kchro3
Copy link
Author

kchro3 commented Feb 13, 2026

/bonk review this PR

whoiskatrin

This comment was marked as outdated.

Copy link
Collaborator

@whoiskatrin whoiskatrin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix overall — process-group-first termination + fallback traversal and the new regression coverage are both solid.

A few nits:

  1. package-lock.json seems unrelated in this PR (only "peer": true markers in platform entries). Can we drop it unless it was intentional?

  2. In Session.destroy(), we now call killCommand(commandId, false). With waitForExit=false, we send SIGTERM and immediately remove tracking. If a process ignores TERM, this may leave survivors during destroy. Could we still do a best-effort escalation to SIGKILL in the non-wait path (or keep wait enabled in destroy)?

  3. In the new E2E test, can we assert the response for /api/file/write setup call (expect(status).toBe(200)) so setup failures are explicit?

Not blocking, but would be great to tighten these up.

@kchro3
Copy link
Author

kchro3 commented Feb 24, 2026

@whoiskatrin , ok, is there a good way to do end-to-end testing? e.g. deploying the docker image to staging tag?

@kchro3 kchro3 requested a review from whoiskatrin February 25, 2026 17:56
@kchro3
Copy link
Author

kchro3 commented Feb 26, 2026

cc: @whoiskatrin , let me know how to proceed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants