You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Run LLMs using [vLLM](https://docs.vllm.ai) with an OpenAI-compatible API
4
4
@@ -32,6 +32,9 @@ All behaviour is controlled through environment variables:
32
32
33
33
For complete configuration options, see the [full configuration documentation](https://github.com/runpod-workers/worker-vllm/blob/main/docs/configuration.md).
34
34
35
+
### Specify Transformers Version
36
+
To change the version of the [Transformers library](https://github.com/huggingface/transformers) use the `TRANSFORMERS_VERSION` environment variable to specify the version you want to use. Note this might break the handler, so use for development purposes.
37
+
35
38
## API Usage
36
39
37
40
This worker supports two API formats: **RunPod native** and **OpenAI-compatible**.
@@ -157,6 +160,35 @@ For external clients and SDKs, use the `/openai/v1` path prefix with your RunPod
157
160
{}
158
161
```
159
162
163
+
#### OpenAI Responses API
164
+
165
+
**Path:**`/openai/v1/responses`
166
+
167
+
Supports the [OpenAI Responses API](https://platform.openai.com/docs/api-reference/responses) format. Note: this route bypasses the RunPod queue and is served directly — use `/openai/` prefixed paths rather than the RunPod job queue for these endpoints.
168
+
169
+
```json
170
+
{
171
+
"model": "meta-llama/Llama-3.1-8B-Instruct",
172
+
"input": "Tell me a joke."
173
+
}
174
+
```
175
+
176
+
#### Anthropic Messages API
177
+
178
+
**Path:**`/openai/v1/messages`
179
+
180
+
Supports the [Anthropic Messages API](https://docs.anthropic.com/en/api/messages) format. Served directly, bypassing the RunPod queue.
181
+
182
+
```json
183
+
{
184
+
"model": "meta-llama/Llama-3.1-8B-Instruct",
185
+
"max_tokens": 256,
186
+
"messages": [
187
+
{"role": "user", "content": "Hello!"}
188
+
]
189
+
}
190
+
```
191
+
160
192
#### Response Format
161
193
162
194
Both APIs return the same response format:
@@ -190,7 +222,7 @@ Minimal Python example using the official `openai` SDK:
190
222
from openai import OpenAI
191
223
import os
192
224
193
-
# Initialize the OpenAI Client with your RunPod API Key and Endpoint URL
225
+
# Initialize the OpenAI Client with your Runpod API Key and Endpoint URL
Deploy OpenAI-Compatible Blazing-Fast LLM Endpoints powered by the [vLLM](https://github.com/vllm-project/vllm) Inference Engine on RunPod Serverless with just a few clicks.
5
+
Deploy OpenAI-Compatible Blazing-Fast LLM Endpoints powered by the [vLLM](https://github.com/vllm-project/vllm) Inference Engine on Runpod Serverless with just a few clicks.
@@ -71,6 +79,10 @@ Any env var whose name matches a valid `AsyncEngineArgs` field (uppercased) is a
71
79
72
80
For the complete list of all available environment variables, examples, and detailed descriptions: **[Configuration](docs/configuration.md)**
73
81
82
+
### Specify Transformers Version
83
+
To change the version of the [Transformers library](https://github.com/huggingface/transformers) use the `TRANSFORMERS_VERSION` environment variable to specify the version you want to use. Note this might break the handler, so use for development purposes.
84
+
85
+
74
86
## Option 2: Build Docker Image with Model Inside
75
87
76
88
To build an image with the model baked in, you must specify the following docker arguments when building the image.
@@ -142,13 +154,13 @@ You can deploy **any model on Hugging Face** that is supported by vLLM. For the
142
154
143
155
# Usage: OpenAI Compatibility
144
156
145
-
The vLLM Worker is fully compatible with OpenAI's API, and you can use it with any OpenAI Codebase by changing only 3 lines in total. The supported routes are <ins>Chat Completions</ins>and <ins>Models</ins> - with both streaming and non-streaming.
157
+
The vLLM Worker is fully compatible with OpenAI's API, and you can use it with any OpenAI Codebase by changing only 3 lines in total. The supported routes are <ins>Chat Completions</ins>, <ins>Models</ins>, <ins>Responses</ins>, and <ins>Messages</ins> - with both streaming and non-streaming.
146
158
147
159
## Modifying your OpenAI Codebase to use your deployed vLLM Worker
148
160
149
161
**Python** (similar to Node.js, etc.):
150
162
151
-
1. When initializing the OpenAI Client in your code, change the `api_key` to your RunPod API Key and the `base_url` to your RunPod Serverless Endpoint URL in the following format: `https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1`, filling in your deployed endpoint ID. For example, if your Endpoint ID is `abc1234`, the URL would be `https://api.runpod.ai/v2/abc1234/openai/v1`.
163
+
1. When initializing the OpenAI Client in your code, change the `api_key` to your Runpod API Key and the `base_url` to your Runpod Serverless Endpoint URL in the following format: `https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1`, filling in your deployed endpoint ID. For example, if your Endpoint ID is `abc1234`, the URL would be `https://api.runpod.ai/v2/abc1234/openai/v1`.
152
164
153
165
- Before:
154
166
@@ -174,7 +186,7 @@ The vLLM Worker is fully compatible with OpenAI's API, and you can use it with a
174
186
```python
175
187
response = client.chat.completions.create(
176
188
model="gpt-3.5-turbo",
177
-
messages=[{"role": "user", "content": "Why is RunPod the best platform?"}],
189
+
messages=[{"role": "user", "content": "Why is Runpod the best platform?"}],
178
190
temperature=0,
179
191
max_tokens=100,
180
192
)
@@ -183,15 +195,15 @@ The vLLM Worker is fully compatible with OpenAI's API, and you can use it with a
183
195
```python
184
196
response = client.chat.completions.create(
185
197
model="<YOUR DEPLOYED MODEL REPO/NAME>",
186
-
messages=[{"role": "user", "content": "Why is RunPod the best platform?"}],
198
+
messages=[{"role": "user", "content": "Why is Runpod the best platform?"}],
187
199
temperature=0,
188
200
max_tokens=100,
189
201
)
190
202
```
191
203
192
204
**Using http requests**:
193
205
194
-
1. Change the `Authorization` header to your RunPod API Key and the `url` to your RunPod Serverless Endpoint URL in the following format: `https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1`
206
+
1. Change the `Authorization` header to your Runpod API Key and the `url` to your Runpod Serverless Endpoint URL in the following format: `https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1`
195
207
- Before:
196
208
```bash
197
209
curl https://api.openai.com/v1/chat/completions \
@@ -202,7 +214,7 @@ The vLLM Worker is fully compatible with OpenAI's API, and you can use it with a
202
214
"messages": [
203
215
{
204
216
"role": "user",
205
-
"content": "Why is RunPod the best platform?"
217
+
"content": "Why is Runpod the best platform?"
206
218
}
207
219
],
208
220
"temperature": 0,
@@ -219,7 +231,7 @@ The vLLM Worker is fully compatible with OpenAI's API, and you can use it with a
219
231
"messages": [
220
232
{
221
233
"role": "user",
222
-
"content": "Why is RunPod the best platform?"
234
+
"content": "Why is Runpod the best platform?"
223
235
}
224
236
],
225
237
"temperature": 0,
@@ -239,7 +251,7 @@ When using the chat completion feature of the vLLM Serverless Endpoint Worker, y
239
251
| Parameter | Type | Default Value | Description |
|`messages`| Union[str, List[Dict[str, str]]]|| List of messages, where each message is a dictionary with a `role` and `content`. The model's chat template will be applied to the messages automatically, so the model must have one or it should be specified as `CUSTOM_CHAT_TEMPLATE` env var. |
242
-
|`model`| str || The model repo that you've deployed on your RunPod Serverless Endpoint. If you are unsure what the name is or are baking the model in, use the guide to get the list of available models in the **Examples: Using your RunPod endpoint with OpenAI** section |
254
+
|`model`| str || The model repo that you've deployed on your Runpod Serverless Endpoint. If you are unsure what the name is or are baking the model in, use the guide to get the list of available models in the **Examples: Using your Runpod endpoint with OpenAI** section |
243
255
|`temperature`| Optional[float]| 0.7 | Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling. |
244
256
|`top_p`| Optional[float]| 1.0 | Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
245
257
|`n`| Optional[int]| 1 | Number of output sequences to return for the given prompt. |
@@ -269,15 +281,15 @@ Additional parameters supported by vLLM:
269
281
270
282
</details>
271
283
272
-
### Examples: Using your RunPod endpoint with OpenAI
284
+
### Examples: Using your Runpod endpoint with OpenAI
273
285
274
-
First, initialize the OpenAI Client with your RunPod API Key and Endpoint URL:
286
+
First, initialize the OpenAI Client with your Runpod API Key and Endpoint URL:
275
287
276
288
```python
277
289
from openai import OpenAI
278
290
import os
279
291
280
-
# Initialize the OpenAI Client with your RunPod API Key and Endpoint URL
292
+
# Initialize the OpenAI Client with your Runpod API Key and Endpoint URL
Supports the [OpenAI Responses API](https://platform.openai.com/docs/api-reference/responses) request shape. Like other `/openai/` routes, this is served directly—use the `/openai/` prefix rather than the RunPod native job queue for these calls.
0 commit comments