emdawnwebgpu: Add runtimeKeepalive for device.lost handler#57
emdawnwebgpu: Add runtimeKeepalive for device.lost handler#57fs-eire wants to merge 1 commit intogoogle:mainfrom
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
|
@fs-eire We'll need the CLA check to pass before reviewing (review happens on Gerrit, but the change doesn't get imported to Gerrit until that passes). Let me know if you'll be able to sign it. |
| #if ASSERTIONS | ||
| assert(deviceLostFutureId); | ||
| #endif | ||
| // Don't keepalive here, because this isn't guaranteed to ever happen. |
There was a problem hiding this comment.
Actually on second thought, I'm not sure this fix is right. This removed comment was about preventing the runtime from being kept alive forever just because a device hasn't been lost.
I might need to consult with Emscripten folks to figure out how to do this correctly. It's somehow interacting poorly with EXIT_RUNTIME=0 but I don't understand Emscripten would set ABORT = true in the way you described.
There was a problem hiding this comment.
to be honest I am not 100% sure about this part. The fix works in my specific usage (onnxruntime-web, which defines EXIT_RUNTIME=0), but not tested in other use case.
|
I have already signed CLA, but seems the co-author (Copilot) didn't. Let me try to figure out. I will create another PR that I am the only author if I cannot fix it. |
The device.lost handler in emwgpuAdapterRequestDevice is wrapped in callUserCallback(), which calls maybeExit() after the callback executes. When runtimeKeepaliveCounter is 0 (common for --no-entry WASM modules), maybeExit() triggers _exit(0), setting ABORT=true. This causes all subsequent callUserCallback() invocations to silently drop their callbacks, breaking any WebGPU operations that follow (e.g., requestAdapter for a new session). Fix by adding runtimeKeepalivePush() before the device.lost promise setup, and runtimeKeepalivePop() after callUserCallback() returns inside the .then() handler. This ensures the runtime stays alive during the callback and maybeExit() sees counter >= 1. The pop is placed after callUserCallback (not before) because callUserCallback internally calls maybeExit() - if the pop happened first, the counter would be 0 during maybeExit, still triggering the premature exit. Bug: onnxruntime#27427
2e13f6b to
8e16c57
Compare
|
👋 Thanks for your contribution! Your PR has been imported to Gerrit. |
### Description Fix 2 bugs in emdawnwebgpu. 1. Fix incorrect handling for device lost. See also: - issue: [Unexpected exit on device lost handler [492350387] - Chromium](https://issues.chromium.org/issues/492350387) - PR: [emdawnwebgpu: Add runtimeKeepalive for device.lost handler by fs-eire · Pull Request #57 · google/d…](google/dawn#57) (but dawn does not accept PR with copilot as co-author, so just for reference) 2. Fix wrong call to WGPUBufferImpl constructor. See also: - issue: [Incorrect WGPUBufferImpl constructor called from importJsBuffer](https://issues.chromium.org/issues/492539247) --------- Co-authored-by: Copilot Autofix powered by AI <[email protected]>
Problem
The \device.lost\ handler in \�mwgpuAdapterRequestDevice\ is wrapped in \callUserCallback(), which calls \maybeExit()\ after the callback executes. For WASM modules built with --no-entry\ and \EXIT_RUNTIME=0\ (common for library-style usage),
untimeKeepaliveCounter\ is 0 at baseline. When \device.lost\ fires (e.g., after destroying a device/session), \maybeExit()\ sees counter == 0 and triggers _exit(0), which sets \ABORT = true.
Once \ABORT\ is true, all subsequent \callUserCallback()\ invocations silently drop their callbacks:
\\js
var callUserCallback = func => {
if (ABORT) {
err('user callback triggered after runtime exited or application aborted. Ignoring.');
return; // callback silently dropped!
}
// ...
};
\\
This breaks any WebGPU operations that follow, such as
equestAdapter\ for creating a new session.
Root Cause
The previous version of this code used
untimeKeepalivePush()\ at device creation time to keep the counter >= 1, preventing \maybeExit()\ from ever triggering exit. The current version removed this with the comment // Don't keepalive here, because this isn't guaranteed to ever happen., but this is incorrect for the \callUserCallback\ wrapping pattern — without keepalive protection, the \maybeExit()\ inside \callUserCallback\ will trigger premature runtime exit.
Fix
Add
untimeKeepalivePush()\ before the \device.lost\ promise setup, and
untimeKeepalivePop()\ after \callUserCallback()\ returns inside the .then()\ handler.
The pop is intentionally placed after \callUserCallback\ (not before) because \callUserCallback\ internally calls \maybeExit()\ — if the pop happened first, the counter would be 0 during \maybeExit, still triggering the premature exit.
Impact
This bug was discovered in ONNX Runtime's WebGPU execution provider (microsoft/onnxruntime#27427), where destroying a session and creating a new one would fail with \Failed to get a WebGPU adapter\ because the
equestAdapter\ callback was silently dropped due to \ABORT = true.