Skip to content

feat: add submodule of worker-vllm, updated fastapi endpoints#6

Open
velaraptor-runpod wants to merge 11 commits intomainfrom
feat/update-vllm
Open

feat: add submodule of worker-vllm, updated fastapi endpoints#6
velaraptor-runpod wants to merge 11 commits intomainfrom
feat/update-vllm

Conversation

@velaraptor-runpod
Copy link
Copy Markdown

No description provided.

Copy link
Copy Markdown

@TimPietruskyRunPod TimPietruskyRunPod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few items to address before merge — mostly small fixes. The core rewrite using vLLM's native serving classes is solid.

handler_lb.py Outdated
):
from vllm.entrypoints.openai.protocol import ResponsesResponse
from vllm.entrypoints.openai.engine.protocol import ErrorResponse

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retrieve_responses and cancel_responses import ResponsesResponse from vllm.entrypoints.openai.protocol, but create_responses imports it from vllm.entrypoints.openai.responses.protocol. These should be consistent — which is the correct path in vLLM 0.16.0?


if not body.get("stream"):
return JSONResponse(response.model_dump())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chat and completion handlers check body.get("stream") on the raw dict to decide streaming vs non-streaming. The responses and messages handlers instead check the response type (isinstance). Consider making these consistent — checking the response type is more robust since it follows what vLLM actually returned.

@@ -0,0 +1,64 @@
{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is named tests_json — should it be tests.json?

curl -X POST "https://your-endpoint-id.api.runpod.ai/v1/completions" \
-H "Authorization: Bearer YOUR_RUNPOD_API_KEY" \
curl -X POST "https://<endpoint-id>.api.runpod.ai/v1/chat/completions" \
-H "Authorization: Bearer $RUNPOD_API_KEY" \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a real endpoint ID? If so it should probably be replaced with a placeholder like <endpoint-id> to match the examples above. If there's a reason to keep it (e.g. a public demo endpoint), happy to leave it.

### Core (from worker-vllm)

| Variable | Required | Description | Default |
|----------|----------|-------------|---------|
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: RunP[dRunPod

```

| Path | Method | Description |
|------|--------|-------------|
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAX_CONCURRENCY default here is 300, but hub.json sets it to 10 with a note about keeping it lower for load balancing. These should be consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants