Skip to content

emdawnwebgpu: Add runtimeKeepalive for device.lost handler#57

Open
fs-eire wants to merge 1 commit intogoogle:mainfrom
fs-eire:fix-device-lost-keepalive
Open

emdawnwebgpu: Add runtimeKeepalive for device.lost handler#57
fs-eire wants to merge 1 commit intogoogle:mainfrom
fs-eire:fix-device-lost-keepalive

Conversation

@fs-eire
Copy link
Contributor

@fs-eire fs-eire commented Mar 13, 2026

Problem

The \device.lost\ handler in \�mwgpuAdapterRequestDevice\ is wrapped in \callUserCallback(), which calls \maybeExit()\ after the callback executes. For WASM modules built with --no-entry\ and \EXIT_RUNTIME=0\ (common for library-style usage),
untimeKeepaliveCounter\ is 0 at baseline. When \device.lost\ fires (e.g., after destroying a device/session), \maybeExit()\ sees counter == 0 and triggers _exit(0), which sets \ABORT = true.

Once \ABORT\ is true, all subsequent \callUserCallback()\ invocations silently drop their callbacks:
\\js
var callUserCallback = func => {
if (ABORT) {
err('user callback triggered after runtime exited or application aborted. Ignoring.');
return; // callback silently dropped!
}
// ...
};
\\

This breaks any WebGPU operations that follow, such as
equestAdapter\ for creating a new session.

Root Cause

The previous version of this code used
untimeKeepalivePush()\ at device creation time to keep the counter >= 1, preventing \maybeExit()\ from ever triggering exit. The current version removed this with the comment // Don't keepalive here, because this isn't guaranteed to ever happen., but this is incorrect for the \callUserCallback\ wrapping pattern — without keepalive protection, the \maybeExit()\ inside \callUserCallback\ will trigger premature runtime exit.

Fix

Add
untimeKeepalivePush()\ before the \device.lost\ promise setup, and
untimeKeepalivePop()\ after \callUserCallback()\ returns inside the .then()\ handler.

The pop is intentionally placed after \callUserCallback\ (not before) because \callUserCallback\ internally calls \maybeExit()\ — if the pop happened first, the counter would be 0 during \maybeExit, still triggering the premature exit.

Impact

This bug was discovered in ONNX Runtime's WebGPU execution provider (microsoft/onnxruntime#27427), where destroying a session and creating a new one would fail with \Failed to get a WebGPU adapter\ because the
equestAdapter\ callback was silently dropped due to \ABORT = true.

@google-cla
Copy link

google-cla bot commented Mar 13, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@kainino0x
Copy link
Member

@fs-eire We'll need the CLA check to pass before reviewing (review happens on Gerrit, but the change doesn't get imported to Gerrit until that passes). Let me know if you'll be able to sign it.

#if ASSERTIONS
assert(deviceLostFutureId);
#endif
// Don't keepalive here, because this isn't guaranteed to ever happen.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually on second thought, I'm not sure this fix is right. This removed comment was about preventing the runtime from being kept alive forever just because a device hasn't been lost.

I might need to consult with Emscripten folks to figure out how to do this correctly. It's somehow interacting poorly with EXIT_RUNTIME=0 but I don't understand Emscripten would set ABORT = true in the way you described.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be honest I am not 100% sure about this part. The fix works in my specific usage (onnxruntime-web, which defines EXIT_RUNTIME=0), but not tested in other use case.

@fs-eire
Copy link
Contributor Author

fs-eire commented Mar 16, 2026

I have already signed CLA, but seems the co-author (Copilot) didn't. Let me try to figure out. I will create another PR that I am the only author if I cannot fix it.

The device.lost handler in emwgpuAdapterRequestDevice is wrapped in
callUserCallback(), which calls maybeExit() after the callback executes.
When runtimeKeepaliveCounter is 0 (common for --no-entry WASM modules),
maybeExit() triggers _exit(0), setting ABORT=true. This causes all
subsequent callUserCallback() invocations to silently drop their
callbacks, breaking any WebGPU operations that follow (e.g.,
requestAdapter for a new session).

Fix by adding runtimeKeepalivePush() before the device.lost promise
setup, and runtimeKeepalivePop() after callUserCallback() returns
inside the .then() handler. This ensures the runtime stays alive
during the callback and maybeExit() sees counter >= 1.

The pop is placed after callUserCallback (not before) because
callUserCallback internally calls maybeExit() - if the pop happened
first, the counter would be 0 during maybeExit, still triggering the
premature exit.

Bug: onnxruntime#27427
@fs-eire fs-eire force-pushed the fix-device-lost-keepalive branch from 2e13f6b to 8e16c57 Compare March 16, 2026 19:05
@github-actions
Copy link

👋 Thanks for your contribution! Your PR has been imported to Gerrit.
Please visit https://dawn-review.googlesource.com/c/dawn/+/297517 to see it and CC yourself on the change.
After iterating on feedback, please comment on the Gerrit review to notify reviewers.
All reviews are handled within Gerrit, any comments on the GitHub PR may be missed.
You can continue to upload commits to this PR, and they will be automatically imported
into Gerrit.

fs-eire added a commit to microsoft/onnxruntime that referenced this pull request Mar 16, 2026
### Description

Fix 2 bugs in emdawnwebgpu.

1. Fix incorrect handling for device lost. See also:
- issue: [Unexpected exit on device lost handler [492350387] -
Chromium](https://issues.chromium.org/issues/492350387)
- PR: [emdawnwebgpu: Add runtimeKeepalive for device.lost handler by
fs-eire · Pull Request #57 ·
google/d…](google/dawn#57)
(but dawn does not accept PR with copilot as co-author, so just for
reference)

2. Fix wrong call to WGPUBufferImpl constructor. See also:
- issue: [Incorrect WGPUBufferImpl constructor called from
importJsBuffer](https://issues.chromium.org/issues/492539247)

---------

Co-authored-by: Copilot Autofix powered by AI <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants