[Bugfix] Fix structured output crash on CPU due to pin_memory=True#37706
[Bugfix] Fix structured output crash on CPU due to pin_memory=True#37706wjhrdy wants to merge 3 commits intovllm-project:mainfrom
Conversation
On CPU-only deployments, `apply_grammar_bitmask()` crashes with `RuntimeError: pin_memory=True requires a CUDA or other accelerator backend` when handling mixed batches of structured and non-structured requests. Two issues: 1. `pin_memory=True` is hardcoded in the `torch.tensor()` call for `out_indices` — this requires CUDA and fails on CPU. 2. The xgrammar CPU kernel (`apply_token_bitmask_inplace_cpu`) expects `Sequence[int]` for the `indices` argument, not a tensor. Note: the existing CPU float32 workaround added in vllm-project#31901 was never reachable because the `pin_memory=True` crash occurs first. Fix: on CPU, pass `out_indices` as a plain Python list. The GPU path with pinned memory is preserved. Fixes vllm-project#37705 Signed-off-by: Willy Hardy <[email protected]>
There was a problem hiding this comment.
Code Review
This pull request effectively resolves a critical RuntimeError that occurred on CPU-only deployments due to pin_memory=True being hardcoded for torch.tensor creation. The changes correctly introduce conditional logic to handle CPU and GPU devices separately, ensuring that pin_memory=True is only applied when a CUDA device is available. Furthermore, it addresses the xgrammar CPU kernel's expectation of a Python list for indices by passing out_indices directly on CPU, which is a significant improvement for correctness and stability in mixed-batch scenarios. The updated type hint for indices also enhances code clarity.
vllm/v1/structured_output/utils.py
Outdated
| if logits.device.type == "cpu": | ||
| # On CPU, pass indices as a plain list — pin_memory requires CUDA, | ||
| # and the xgrammar CPU kernel expects Sequence[int], not a tensor. | ||
| indices = out_indices | ||
| else: | ||
| indices = torch.tensor( | ||
| out_indices, dtype=torch.int32, device="cpu", pin_memory=True, | ||
| ) | ||
| indices = indices.to(logits.device, non_blocking=True) |
There was a problem hiding this comment.
This conditional logic is a critical fix. By checking logits.device.type, the code now correctly avoids setting pin_memory=True on CPU, which was causing a RuntimeError. Additionally, passing out_indices as a plain Python list for CPU devices directly addresses the xgrammar CPU kernel's expectation for a Sequence[int], preventing potential issues with type mismatches.
dougbtv
left a comment
There was a problem hiding this comment.
Looks excellent -- do we need any validation on the testing side?
mgoin
left a comment
There was a problem hiding this comment.
Seems reasonable to me, thanks!
|
Hi @wjhrdy, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Willy Hardy <[email protected]>
030f141 to
e97ab92
Compare
Essential Checks
[Tag] Short descriptionPurpose
Fix
RuntimeError: pin_memory=True requires a CUDA or other accelerator backendcrash when using structured output (guided decoding) on CPU-only deployments.Fixes #37705
Problem
apply_grammar_bitmask()invllm/v1/structured_output/utils.pycrashes on CPU when handling mixed batches (concurrent structured + non-structured requests):pin_memory=Trueis hardcoded —torch.tensor(out_indices, ..., pin_memory=True)requires CUDA; fails on CPU-only systems.Sequence[int], nottorch.Tensor—apply_token_bitmask_inplace_cpu()only accepts a Python list for theindicesargument.Note: the existing CPU float32 workaround (added in #31901) was never reachable because the
pin_memory=Truecrash occurs first.Fix
On CPU, pass
out_indicesas a plain Python list directly instead of converting to a pinned tensor. The GPU path with pinned memory is preserved.Test Plan
Tested by starting vLLM on CPU with
ibm-granite/granite-3.2-2b-instruct, then sending concurrent plain + structured output (response_format: json_schema) requests. Without the fix, both requests return 500 and the EngineCore dies. With the fix, both succeed and the server stays healthy.