Skip to content

Commit a1544ea

Browse files
Merge pull request #277 from runpod-workers/feat/lmcache
feat: uv installer, LMCache support, add /v1/responses and /v1/messages endpoints
2 parents 9d16869 + 3403889 commit a1544ea

7 files changed

Lines changed: 366 additions & 32 deletions

File tree

.runpod/README.md

Lines changed: 34 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
![vLLM worker banner](https://cpjrphpz3t5wbwfe.public.blob.vercel-storage.com/worker-vllm_banner.jpeg)
1+
![vLLM worker banner](https://image.runpod.ai/preview/vllm/vllm-banner.png)
22

33
Run LLMs using [vLLM](https://docs.vllm.ai) with an OpenAI-compatible API
44

@@ -32,6 +32,9 @@ All behaviour is controlled through environment variables:
3232

3333
For complete configuration options, see the [full configuration documentation](https://github.com/runpod-workers/worker-vllm/blob/main/docs/configuration.md).
3434

35+
### Specify Transformers Version
36+
To change the version of the [Transformers library](https://github.com/huggingface/transformers) use the `TRANSFORMERS_VERSION` environment variable to specify the version you want to use. Note this might break the handler, so use for development purposes.
37+
3538
## API Usage
3639

3740
This worker supports two API formats: **RunPod native** and **OpenAI-compatible**.
@@ -157,6 +160,35 @@ For external clients and SDKs, use the `/openai/v1` path prefix with your RunPod
157160
{}
158161
```
159162

163+
#### OpenAI Responses API
164+
165+
**Path:** `/openai/v1/responses`
166+
167+
Supports the [OpenAI Responses API](https://platform.openai.com/docs/api-reference/responses) format. Note: this route bypasses the RunPod queue and is served directly — use `/openai/` prefixed paths rather than the RunPod job queue for these endpoints.
168+
169+
```json
170+
{
171+
"model": "meta-llama/Llama-3.1-8B-Instruct",
172+
"input": "Tell me a joke."
173+
}
174+
```
175+
176+
#### Anthropic Messages API
177+
178+
**Path:** `/openai/v1/messages`
179+
180+
Supports the [Anthropic Messages API](https://docs.anthropic.com/en/api/messages) format. Served directly, bypassing the RunPod queue.
181+
182+
```json
183+
{
184+
"model": "meta-llama/Llama-3.1-8B-Instruct",
185+
"max_tokens": 256,
186+
"messages": [
187+
{"role": "user", "content": "Hello!"}
188+
]
189+
}
190+
```
191+
160192
#### Response Format
161193

162194
Both APIs return the same response format:
@@ -190,7 +222,7 @@ Minimal Python example using the official `openai` SDK:
190222
from openai import OpenAI
191223
import os
192224

193-
# Initialize the OpenAI Client with your RunPod API Key and Endpoint URL
225+
# Initialize the OpenAI Client with your Runpod API Key and Endpoint URL
194226
client = OpenAI(
195227
api_key=os.getenv("RUNPOD_API_KEY"),
196228
base_url=f"https://api.runpod.ai/v2/<ENDPOINT_ID>/openai/v1",

Dockerfile

Lines changed: 13 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,21 @@
11
FROM nvidia/cuda:12.9.1-base-ubuntu22.04
22

33
RUN apt-get update -y \
4-
&& apt-get install -y python3-pip
4+
&& apt-get install -y python3-pip curl \
5+
&& curl -LsSf https://astral.sh/uv/0.10.9/install.sh | sh
56

6-
RUN ldconfig /usr/local/cuda-12.9/compat/
7-
8-
# Install vLLM with FlashInfer - use CUDA 12.8 PyTorch wheels (compatible with vLLM 0.15.1)
9-
RUN python3 -m pip install --upgrade pip && \
10-
python3 -m pip install "vllm[flashinfer]==0.16.0" --extra-index-url https://download.pytorch.org/whl/cu129
7+
ENV PATH="/root/.local/bin:$PATH"
118

9+
RUN ldconfig /usr/local/cuda-12.9/compat/
1210

11+
# Install vLLM with FlashInfer - use CUDA 12.9 PyTorch wheels
12+
RUN uv pip install --system "packaging>=24.2" && \
13+
uv pip install --system "vllm[flashinfer]==0.16.0" --extra-index-url https://download.pytorch.org/whl/cu129
1314

1415
# Install additional Python dependencies (after vLLM to avoid PyTorch version conflicts)
1516
COPY builder/requirements.txt /requirements.txt
16-
RUN --mount=type=cache,target=/root/.cache/pip \
17-
python3 -m pip install --upgrade -r /requirements.txt
17+
RUN --mount=type=cache,target=/root/.cache/uv \
18+
uv pip install --system -r /requirements.txt
1819

1920
# Setup for Option 2: Building the Image with the Model included
2021
ARG MODEL_NAME=""
@@ -46,12 +47,13 @@ ENV MODEL_NAME=$MODEL_NAME \
4647
ENV PYTHONPATH="/:/vllm-workspace"
4748

4849
RUN if [ "${VLLM_NIGHTLY}" = "true" ]; then \
49-
pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly && \
50+
uv pip install --system -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly && \
5051
apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/* && \
51-
pip install git+https://github.com/huggingface/transformers.git; \
52+
uv pip install --system git+https://github.com/huggingface/transformers.git; \
5253
fi
5354

5455
COPY src /src
56+
RUN chmod +x /src/start.sh
5557
RUN --mount=type=secret,id=HF_TOKEN,required=false \
5658
if [ -f /run/secrets/HF_TOKEN ]; then \
5759
export HF_TOKEN=$(cat /run/secrets/HF_TOKEN); \
@@ -61,4 +63,4 @@ RUN --mount=type=secret,id=HF_TOKEN,required=false \
6163
fi
6264

6365
# Start the handler
64-
CMD ["python3", "/src/handler.py"]
66+
CMD ["/bin/bash", "/src/start.sh"]

README.md

Lines changed: 84 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,16 @@
22

33
# OpenAI-Compatible vLLM Serverless Endpoint Worker
44

5-
Deploy OpenAI-Compatible Blazing-Fast LLM Endpoints powered by the [vLLM](https://github.com/vllm-project/vllm) Inference Engine on RunPod Serverless with just a few clicks.
5+
Deploy OpenAI-Compatible Blazing-Fast LLM Endpoints powered by the [vLLM](https://github.com/vllm-project/vllm) Inference Engine on Runpod Serverless with just a few clicks.
66

77
</div>
88

9+
![vLLM worker banner](https://image.runpod.ai/preview/vllm/vllm-banner.png)
10+
11+
Current vLLM version: [0.16.0](https://github.com/vllm-project/vllm/releases/tag/v0.16.0)
12+
13+
> Check out our Load Balancer implementation here: [vLLM Load Balancer](https://github.com/runpod-workers/vllm-loadbalancer-ep)
14+
915
## Table of Contents
1016

1117
- [Setting up the Serverless Worker](#setting-up-the-serverless-worker)
@@ -21,9 +27,11 @@ Deploy OpenAI-Compatible Blazing-Fast LLM Endpoints powered by the [vLLM](https:
2127
- [Modifying your OpenAI Codebase to use your deployed vLLM Worker](#modifying-your-openai-codebase-to-use-your-deployed-vllm-worker)
2228
- [OpenAI Request Input Parameters](#openai-request-input-parameters)
2329
- [Chat Completions [RECOMMENDED]](#chat-completions-recommended)
24-
- [Examples: Using your RunPod endpoint with OpenAI](#examples-using-your-runpod-endpoint-with-openai)
30+
- [Examples: Using your Runpod endpoint with OpenAI](#examples-using-your-runpod-endpoint-with-openai)
2531
- [Chat Completions](#chat-completions)
2632
- [Getting a list of names for available models](#getting-a-list-of-names-for-available-models)
33+
- [OpenAI Responses API](#openai-responses-api)
34+
- [Anthropic Messages API](#anthropic-messages-api)
2735
- [Usage: Standard (Non-OpenAI)](#usage-standard-non-openai)
2836
- [Request Input Parameters](#request-input-parameters)
2937
- [Sampling Parameters](#sampling-parameters)
@@ -33,7 +41,7 @@ Deploy OpenAI-Compatible Blazing-Fast LLM Endpoints powered by the [vLLM](https:
3341

3442
## Option 1: Deploy Any Model Using Pre-Built Docker Image [Recommended]
3543

36-
**🚀 Deploy Guide**: Follow our [step-by-step deployment guide](https://docs.runpod.io/serverless/vllm/get-started) to deploy using the RunPod Console.
44+
**🚀 Deploy Guide**: Follow our [step-by-step deployment guide](https://docs.runpod.io/serverless/vllm/get-started) to deploy using the Runpod Console.
3745

3846
**📦 Docker Image**: `runpod/worker-v1-vllm:<version>`
3947

@@ -71,6 +79,10 @@ Any env var whose name matches a valid `AsyncEngineArgs` field (uppercased) is a
7179

7280
For the complete list of all available environment variables, examples, and detailed descriptions: **[Configuration](docs/configuration.md)**
7381

82+
### Specify Transformers Version
83+
To change the version of the [Transformers library](https://github.com/huggingface/transformers) use the `TRANSFORMERS_VERSION` environment variable to specify the version you want to use. Note this might break the handler, so use for development purposes.
84+
85+
7486
## Option 2: Build Docker Image with Model Inside
7587

7688
To build an image with the model baked in, you must specify the following docker arguments when building the image.
@@ -142,13 +154,13 @@ You can deploy **any model on Hugging Face** that is supported by vLLM. For the
142154

143155
# Usage: OpenAI Compatibility
144156

145-
The vLLM Worker is fully compatible with OpenAI's API, and you can use it with any OpenAI Codebase by changing only 3 lines in total. The supported routes are <ins>Chat Completions</ins> and <ins>Models</ins> - with both streaming and non-streaming.
157+
The vLLM Worker is fully compatible with OpenAI's API, and you can use it with any OpenAI Codebase by changing only 3 lines in total. The supported routes are <ins>Chat Completions</ins>, <ins>Models</ins>, <ins>Responses</ins>, and <ins>Messages</ins> - with both streaming and non-streaming.
146158

147159
## Modifying your OpenAI Codebase to use your deployed vLLM Worker
148160

149161
**Python** (similar to Node.js, etc.):
150162

151-
1. When initializing the OpenAI Client in your code, change the `api_key` to your RunPod API Key and the `base_url` to your RunPod Serverless Endpoint URL in the following format: `https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1`, filling in your deployed endpoint ID. For example, if your Endpoint ID is `abc1234`, the URL would be `https://api.runpod.ai/v2/abc1234/openai/v1`.
163+
1. When initializing the OpenAI Client in your code, change the `api_key` to your Runpod API Key and the `base_url` to your Runpod Serverless Endpoint URL in the following format: `https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1`, filling in your deployed endpoint ID. For example, if your Endpoint ID is `abc1234`, the URL would be `https://api.runpod.ai/v2/abc1234/openai/v1`.
152164

153165
- Before:
154166

@@ -174,7 +186,7 @@ The vLLM Worker is fully compatible with OpenAI's API, and you can use it with a
174186
```python
175187
response = client.chat.completions.create(
176188
model="gpt-3.5-turbo",
177-
messages=[{"role": "user", "content": "Why is RunPod the best platform?"}],
189+
messages=[{"role": "user", "content": "Why is Runpod the best platform?"}],
178190
temperature=0,
179191
max_tokens=100,
180192
)
@@ -183,15 +195,15 @@ The vLLM Worker is fully compatible with OpenAI's API, and you can use it with a
183195
```python
184196
response = client.chat.completions.create(
185197
model="<YOUR DEPLOYED MODEL REPO/NAME>",
186-
messages=[{"role": "user", "content": "Why is RunPod the best platform?"}],
198+
messages=[{"role": "user", "content": "Why is Runpod the best platform?"}],
187199
temperature=0,
188200
max_tokens=100,
189201
)
190202
```
191203

192204
**Using http requests**:
193205

194-
1. Change the `Authorization` header to your RunPod API Key and the `url` to your RunPod Serverless Endpoint URL in the following format: `https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1`
206+
1. Change the `Authorization` header to your Runpod API Key and the `url` to your Runpod Serverless Endpoint URL in the following format: `https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1`
195207
- Before:
196208
```bash
197209
curl https://api.openai.com/v1/chat/completions \
@@ -202,7 +214,7 @@ The vLLM Worker is fully compatible with OpenAI's API, and you can use it with a
202214
"messages": [
203215
{
204216
"role": "user",
205-
"content": "Why is RunPod the best platform?"
217+
"content": "Why is Runpod the best platform?"
206218
}
207219
],
208220
"temperature": 0,
@@ -219,7 +231,7 @@ The vLLM Worker is fully compatible with OpenAI's API, and you can use it with a
219231
"messages": [
220232
{
221233
"role": "user",
222-
"content": "Why is RunPod the best platform?"
234+
"content": "Why is Runpod the best platform?"
223235
}
224236
],
225237
"temperature": 0,
@@ -239,7 +251,7 @@ When using the chat completion feature of the vLLM Serverless Endpoint Worker, y
239251
| Parameter | Type | Default Value | Description |
240252
| ------------------- | -------------------------------- | ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
241253
| `messages` | Union[str, List[Dict[str, str]]] | | List of messages, where each message is a dictionary with a `role` and `content`. The model's chat template will be applied to the messages automatically, so the model must have one or it should be specified as `CUSTOM_CHAT_TEMPLATE` env var. |
242-
| `model` | str | | The model repo that you've deployed on your RunPod Serverless Endpoint. If you are unsure what the name is or are baking the model in, use the guide to get the list of available models in the **Examples: Using your RunPod endpoint with OpenAI** section |
254+
| `model` | str | | The model repo that you've deployed on your Runpod Serverless Endpoint. If you are unsure what the name is or are baking the model in, use the guide to get the list of available models in the **Examples: Using your Runpod endpoint with OpenAI** section |
243255
| `temperature` | Optional[float] | 0.7 | Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling. |
244256
| `top_p` | Optional[float] | 1.0 | Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
245257
| `n` | Optional[int] | 1 | Number of output sequences to return for the given prompt. |
@@ -269,15 +281,15 @@ Additional parameters supported by vLLM:
269281

270282
</details>
271283

272-
### Examples: Using your RunPod endpoint with OpenAI
284+
### Examples: Using your Runpod endpoint with OpenAI
273285

274-
First, initialize the OpenAI Client with your RunPod API Key and Endpoint URL:
286+
First, initialize the OpenAI Client with your Runpod API Key and Endpoint URL:
275287

276288
```python
277289
from openai import OpenAI
278290
import os
279291

280-
# Initialize the OpenAI Client with your RunPod API Key and Endpoint URL
292+
# Initialize the OpenAI Client with your Runpod API Key and Endpoint URL
281293
client = OpenAI(
282294
api_key=os.environ.get("RUNPOD_API_KEY"),
283295
base_url="https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1",
@@ -293,7 +305,7 @@ This is the format used for GPT-4 and focused on instruction-following and chat.
293305
# Create a chat completion stream
294306
response_stream = client.chat.completions.create(
295307
model="<YOUR DEPLOYED MODEL REPO/NAME>",
296-
messages=[{"role": "user", "content": "Why is RunPod the best platform?"}],
308+
messages=[{"role": "user", "content": "Why is Runpod the best platform?"}],
297309
temperature=0,
298310
max_tokens=100,
299311
stream=True,
@@ -307,7 +319,7 @@ This is the format used for GPT-4 and focused on instruction-following and chat.
307319
# Create a chat completion
308320
response = client.chat.completions.create(
309321
model="<YOUR DEPLOYED MODEL REPO/NAME>",
310-
messages=[{"role": "user", "content": "Why is RunPod the best platform?"}],
322+
messages=[{"role": "user", "content": "Why is Runpod the best platform?"}],
311323
temperature=0,
312324
max_tokens=100,
313325
)
@@ -325,6 +337,62 @@ list_of_models = [model.id for model in models_response]
325337
print(list_of_models)
326338
```
327339

340+
### OpenAI Responses API
341+
342+
**Path:** `/openai/v1/responses` (full URL: `https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1/responses`)
343+
344+
Supports the [OpenAI Responses API](https://platform.openai.com/docs/api-reference/responses) request shape. Like other `/openai/` routes, this is served directly—use the `/openai/` prefix rather than the RunPod native job queue for these calls.
345+
346+
```json
347+
{
348+
"model": "meta-llama/Llama-3.1-8B-Instruct",
349+
"input": "Tell me a joke."
350+
}
351+
```
352+
353+
**Using HTTP requests:**
354+
355+
```bash
356+
curl https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1/responses \
357+
-H "Content-Type: application/json" \
358+
-H "Authorization: Bearer <YOUR RUNPOD API KEY>" \
359+
-d '{
360+
"model": "<YOUR DEPLOYED MODEL REPO/NAME>",
361+
"input": "Tell me a joke."
362+
}'
363+
```
364+
365+
### Anthropic Messages API
366+
367+
**Path:** `/openai/v1/messages` (full URL: `https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1/messages`)
368+
369+
Supports the [Anthropic Messages API](https://docs.anthropic.com/en/api/messages) format. Served directly, bypassing the RunPod queue.
370+
371+
```json
372+
{
373+
"model": "meta-llama/Llama-3.1-8B-Instruct",
374+
"max_tokens": 256,
375+
"messages": [
376+
{"role": "user", "content": "Hello!"}
377+
]
378+
}
379+
```
380+
381+
**Using HTTP requests:**
382+
383+
```bash
384+
curl https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1/messages \
385+
-H "Content-Type: application/json" \
386+
-H "Authorization: Bearer <YOUR RUNPOD API KEY>" \
387+
-d '{
388+
"model": "<YOUR DEPLOYED MODEL REPO/NAME>",
389+
"max_tokens": 256,
390+
"messages": [
391+
{"role": "user", "content": "Hello!"}
392+
]
393+
}'
394+
```
395+
328396
# Usage: Standard (Non-OpenAI)
329397

330398
## Request Input Parameters

builder/requirements.txt

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,13 @@ pandas
33
pyarrow
44
runpod
55
huggingface-hub
6-
packaging
6+
lmcache==0.4.1
7+
packaging>=24.2
78
typing-extensions>=4.8.0
89
pydantic
910
pydantic-settings
1011
hf-transfer
11-
transformers>=4.57.0
12+
transformers>=4.57.0,<5
1213
bitsandbytes>=0.45.0
1314
kernels
1415
torch-c-dlpack-ext

0 commit comments

Comments
 (0)