Skip to content

fix: use OpenAI chat-completion field names in /chat/completions usage#1009

Open
chilang wants to merge 1 commit intoBlaizzy:mainfrom
chilang:fix/chat-completions-usage-field-names
Open

fix: use OpenAI chat-completion field names in /chat/completions usage#1009
chilang wants to merge 1 commit intoBlaizzy:mainfrom
chilang:fix/chat-completions-usage-field-names

Conversation

@chilang
Copy link
Copy Markdown

@chilang chilang commented Apr 10, 2026

Summary

UsageStats (used for /v1/chat/completions responses) inherits from OpenAIUsage, which models the OpenAI Responses API (/v1/responses) — that spec uses input_tokens / output_tokens. But /v1/chat/completions is a different spec: the usage object must contain prompt_tokens, completion_tokens, and total_tokens.

The net effect: any OpenAI-compatible client hitting mlx-vlm's /v1/chat/completions fails to read the usage payload because the field names don't match. I hit this reproducing Gemma 4 benchmarks with llama-benchy, which errors out during warmup with Warmup failed: 'prompt_tokens'.

Fix

  • Stop inheriting UsageStats from OpenAIUsage. Keep OpenAIUsage as-is for the Responses API (/v1/responses) where the spec is correct.
  • Declare the chat-completion fields directly on UsageStats:
    • prompt_tokens: int
    • completion_tokens: int
    • total_tokens: int
    • (existing prompt_tps / generation_tps / peak_memory extras preserved)
  • Update the two call sites in chat_completions_endpoint (streaming SSE chunk + non-streaming final response) to build UsageStats with the new field names.
  • /v1/responses is untouched — it keeps using OpenAIUsage with input_tokens / output_tokens per OpenAI's Responses API spec.

Test plan

  • Added regression test test_chat_completions_response_uses_openai_usage_field_names that mocks generate() and asserts the JSON response body contains usage.prompt_tokens, usage.completion_tokens, usage.total_tokens, and does not contain the Responses-API field names.
  • python -m pytest mlx_vlm/tests/test_server.py — 10 passed.
  • Manually verified with curl against /v1/chat/completions on a locally running server — the usage object now matches the OpenAI Chat Completions spec.
  • Verified with llama-benchy against mlx-community/gemma-4-E4B-it-4bit and mlx-community/gemma-4-26b-a4b-it-4bit — warmup now succeeds and benchmarks complete.
  • black --check and isort --profile=black --check pass on changed files.

References

`UsageStats` previously inherited from `OpenAIUsage`, which models the
`/v1/responses` endpoint spec (`input_tokens` / `output_tokens`). The
`/v1/chat/completions` endpoint is a different spec and requires
`prompt_tokens` / `completion_tokens` / `total_tokens` in the `usage`
object. OpenAI-compatible clients (tested with llama-benchy) fail to
parse the response because `prompt_tokens` is missing.

Split the two: keep `OpenAIUsage` for the Responses API, and give
`UsageStats` the chat-completion field names directly. Update both
streaming and non-streaming code paths. Add a regression test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant