fix: use OpenAI chat-completion field names in /chat/completions usage#1009
Open
chilang wants to merge 1 commit intoBlaizzy:mainfrom
Open
fix: use OpenAI chat-completion field names in /chat/completions usage#1009chilang wants to merge 1 commit intoBlaizzy:mainfrom
chilang wants to merge 1 commit intoBlaizzy:mainfrom
Conversation
`UsageStats` previously inherited from `OpenAIUsage`, which models the `/v1/responses` endpoint spec (`input_tokens` / `output_tokens`). The `/v1/chat/completions` endpoint is a different spec and requires `prompt_tokens` / `completion_tokens` / `total_tokens` in the `usage` object. OpenAI-compatible clients (tested with llama-benchy) fail to parse the response because `prompt_tokens` is missing. Split the two: keep `OpenAIUsage` for the Responses API, and give `UsageStats` the chat-completion field names directly. Update both streaming and non-streaming code paths. Add a regression test.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
UsageStats(used for/v1/chat/completionsresponses) inherits fromOpenAIUsage, which models the OpenAI Responses API (/v1/responses) — that spec usesinput_tokens/output_tokens. But/v1/chat/completionsis a different spec: theusageobject must containprompt_tokens,completion_tokens, andtotal_tokens.The net effect: any OpenAI-compatible client hitting mlx-vlm's
/v1/chat/completionsfails to read theusagepayload because the field names don't match. I hit this reproducing Gemma 4 benchmarks with llama-benchy, which errors out during warmup withWarmup failed: 'prompt_tokens'.Fix
UsageStatsfromOpenAIUsage. KeepOpenAIUsageas-is for the Responses API (/v1/responses) where the spec is correct.UsageStats:prompt_tokens: intcompletion_tokens: inttotal_tokens: intprompt_tps/generation_tps/peak_memoryextras preserved)chat_completions_endpoint(streaming SSE chunk + non-streaming final response) to buildUsageStatswith the new field names./v1/responsesis untouched — it keeps usingOpenAIUsagewithinput_tokens/output_tokensper OpenAI's Responses API spec.Test plan
test_chat_completions_response_uses_openai_usage_field_namesthat mocksgenerate()and asserts the JSON response body containsusage.prompt_tokens,usage.completion_tokens,usage.total_tokens, and does not contain the Responses-API field names.python -m pytest mlx_vlm/tests/test_server.py— 10 passed.curlagainst/v1/chat/completionson a locally running server — theusageobject now matches the OpenAI Chat Completions spec.llama-benchyagainstmlx-community/gemma-4-E4B-it-4bitandmlx-community/gemma-4-26b-a4b-it-4bit— warmup now succeeds and benchmarks complete.black --checkandisort --profile=black --checkpass on changed files.References
usageobject fieldsprompt_tokens/completion_tokens/total_tokensusageobject fieldsinput_tokens/output_tokens