[Bugfix] Fix structured output crash on CPU due to pin_memory=True by wjhrdy · Pull Request #37706 · vllm-project/vllm

wjhrdy · 2026-03-20T17:42:27Z

Essential Checks

PR title follows the pattern [Tag] Short description
I have searched for related issues and checked existing PRs
I have run linting/formatting locally

Purpose

Fix RuntimeError: pin_memory=True requires a CUDA or other accelerator backend crash when using structured output (guided decoding) on CPU-only deployments.

Fixes #37705

Problem

apply_grammar_bitmask() in vllm/v1/structured_output/utils.py crashes on CPU when handling mixed batches (concurrent structured + non-structured requests):

pin_memory=True is hardcoded — torch.tensor(out_indices, ..., pin_memory=True) requires CUDA; fails on CPU-only systems.
xgrammar CPU kernel expects Sequence[int], not torch.Tensor — apply_token_bitmask_inplace_cpu() only accepts a Python list for the indices argument.

Note: the existing CPU float32 workaround (added in #31901) was never reachable because the pin_memory=True crash occurs first.

Fix

On CPU, pass out_indices as a plain Python list directly instead of converting to a pinned tensor. The GPU path with pinned memory is preserved.

Test Plan

Tested by starting vLLM on CPU with ibm-granite/granite-3.2-2b-instruct, then sending concurrent plain + structured output (response_format: json_schema) requests. Without the fix, both requests return 500 and the EngineCore dies. With the fix, both succeed and the server stays healthy.

import concurrent.futures
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
MODEL = "ibm-granite/granite-3.2-2b-instruct"

def plain_request():
    return client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": "Tell me a story"}],
        max_tokens=200,
    )

def structured_request():
    return client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": "What is the capital of France?"}],
        max_tokens=50,
        response_format={
            "type": "json_schema",
            "json_schema": {
                "name": "resp", "strict": True,
                "schema": {
                    "type": "object",
                    "properties": {"capital": {"type": "string"}},
                    "required": ["capital"],
                    "additionalProperties": False,
                },
            },
        },
    )

with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
    f1 = executor.submit(plain_request)
    f2 = executor.submit(structured_request)
    print(f1.result())
    print(f2.result())

On CPU-only deployments, `apply_grammar_bitmask()` crashes with `RuntimeError: pin_memory=True requires a CUDA or other accelerator backend` when handling mixed batches of structured and non-structured requests. Two issues: 1. `pin_memory=True` is hardcoded in the `torch.tensor()` call for `out_indices` — this requires CUDA and fails on CPU. 2. The xgrammar CPU kernel (`apply_token_bitmask_inplace_cpu`) expects `Sequence[int]` for the `indices` argument, not a tensor. Note: the existing CPU float32 workaround added in vllm-project#31901 was never reachable because the `pin_memory=True` crash occurs first. Fix: on CPU, pass `out_indices` as a plain Python list. The GPU path with pinned memory is preserved. Fixes vllm-project#37705 Signed-off-by: Willy Hardy <[email protected]>

gemini-code-assist

Code Review

This pull request effectively resolves a critical RuntimeError that occurred on CPU-only deployments due to pin_memory=True being hardcoded for torch.tensor creation. The changes correctly introduce conditional logic to handle CPU and GPU devices separately, ensuring that pin_memory=True is only applied when a CUDA device is available. Furthermore, it addresses the xgrammar CPU kernel's expectation of a Python list for indices by passing out_indices directly on CPU, which is a significant improvement for correctness and stability in mixed-batch scenarios. The updated type hint for indices also enhances code clarity.

gemini-code-assist · 2026-03-20T17:48:28Z

vllm/v1/structured_output/utils.py

+        if logits.device.type == "cpu":
+            # On CPU, pass indices as a plain list — pin_memory requires CUDA,
+            # and the xgrammar CPU kernel expects Sequence[int], not a tensor.
+            indices = out_indices
+        else:
+            indices = torch.tensor(
+                out_indices, dtype=torch.int32, device="cpu", pin_memory=True,
+            )
+            indices = indices.to(logits.device, non_blocking=True)


This conditional logic is a critical fix. By checking logits.device.type, the code now correctly avoids setting pin_memory=True on CPU, which was causing a RuntimeError. Additionally, passing out_indices as a plain Python list for CPU devices directly addresses the xgrammar CPU kernel's expectation for a Sequence[int], preventing potential issues with type mismatches.

dougbtv

Looks excellent -- do we need any validation on the testing side?

mgoin

Seems reasonable to me, thanks!

mergify · 2026-03-20T20:12:17Z

Hi @wjhrdy, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Willy Hardy <[email protected]>

andy-neuma

thanks

wjhrdy requested review from aarnphm, benchislett, mgoin and russellb as code owners March 20, 2026 17:42

mergify bot added structured-output v1 bug Something isn't working labels Mar 20, 2026

github-project-automation bot added this to Structured Output Mar 20, 2026

gemini-code-assist bot reviewed Mar 20, 2026

View reviewed changes

dougbtv reviewed Mar 20, 2026

View reviewed changes

mgoin approved these changes Mar 20, 2026

View reviewed changes

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 20, 2026

style: fix formatting per pre-commit linter

e97ab92

Signed-off-by: Willy Hardy <[email protected]>

wjhrdy force-pushed the fix/cpu-structured-output-pin-memory branch from 030f141 to e97ab92 Compare March 20, 2026 20:54

andy-neuma approved these changes Mar 20, 2026

View reviewed changes

Merge branch 'main' into fix/cpu-structured-output-pin-memory

18d90e3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix structured output crash on CPU due to pin_memory=True#37706

[Bugfix] Fix structured output crash on CPU due to pin_memory=True#37706
wjhrdy wants to merge 3 commits intovllm-project:mainfrom
wjhrdy:fix/cpu-structured-output-pin-memory

wjhrdy commented Mar 20, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 20, 2026

Uh oh!

dougbtv left a comment

Uh oh!

mgoin left a comment

Uh oh!

mergify bot commented Mar 20, 2026

Uh oh!

andy-neuma left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

wjhrdy commented Mar 20, 2026

Essential Checks

Purpose

Problem

Fix

Test Plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

dougbtv left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 20, 2026

Uh oh!

andy-neuma left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants