runpod-workers
diff --git a/‎.runpod/README.md‎
Lines changed: 34 additions & 2 deletions b/‎.runpod/README.md‎
Lines changed: 34 additions & 2 deletions
diff --git a/‎Dockerfile‎
Lines changed: 13 additions & 11 deletions b/‎Dockerfile‎
Lines changed: 13 additions & 11 deletions
diff --git a/‎README.md‎
Lines changed: 84 additions & 16 deletions b/‎README.md‎
Lines changed: 84 additions & 16 deletions
diff --git a/‎builder/requirements.txt‎
Lines changed: 3 additions & 2 deletions b/‎builder/requirements.txt‎
Lines changed: 3 additions & 2 deletions
@@ -1,4 +1,4 @@
-![vLLM worker banner](https://cpjrphpz3t5wbwfe.public.blob.vercel-storage.com/worker-vllm_banner.jpeg)
+![vLLM worker banner](https://image.runpod.ai/preview/vllm/vllm-banner.png)
 
 Run LLMs using [vLLM](https://docs.vllm.ai) with an OpenAI-compatible API
 
@@ -32,6 +32,9 @@ All behaviour is controlled through environment variables:
 
 For complete configuration options, see the [full configuration documentation](https://github.com/runpod-workers/worker-vllm/blob/main/docs/configuration.md).
 
+### Specify Transformers Version
+To change the version of the [Transformers library](https://github.com/huggingface/transformers) use the `TRANSFORMERS_VERSION` environment variable to specify the version you want to use. Note this might break the handler, so use for development purposes. 
+
 ## API Usage
 
 This worker supports two API formats: **RunPod native** and **OpenAI-compatible**.
@@ -157,6 +160,35 @@ For external clients and SDKs, use the `/openai/v1` path prefix with your RunPod
 {}
 ```
 
+#### OpenAI Responses API
+
+**Path:** `/openai/v1/responses`
+
+Supports the [OpenAI Responses API](https://platform.openai.com/docs/api-reference/responses) format. Note: this route bypasses the RunPod queue and is served directly — use `/openai/` prefixed paths rather than the RunPod job queue for these endpoints.
+
+```json
+{
+  "model": "meta-llama/Llama-3.1-8B-Instruct",
+  "input": "Tell me a joke."
+}
+```
+
+#### Anthropic Messages API
+
+**Path:** `/openai/v1/messages`
+
+Supports the [Anthropic Messages API](https://docs.anthropic.com/en/api/messages) format. Served directly, bypassing the RunPod queue.
+
+```json
+{
+  "model": "meta-llama/Llama-3.1-8B-Instruct",
+  "max_tokens": 256,
+  "messages": [
+    {"role": "user", "content": "Hello!"}
+  ]
+}
+```
+
 #### Response Format
 
 Both APIs return the same response format:
@@ -190,7 +222,7 @@ Minimal Python example using the official `openai` SDK:
 from openai import OpenAI
 import os
 
-# Initialize the OpenAI Client with your RunPod API Key and Endpoint URL
+# Initialize the OpenAI Client with your Runpod API Key and Endpoint URL
 client = OpenAI(
     api_key=os.getenv("RUNPOD_API_KEY"),
     base_url=f"https://api.runpod.ai/v2/<ENDPOINT_ID>/openai/v1",
 
@@ -1,20 +1,21 @@
 FROM nvidia/cuda:12.9.1-base-ubuntu22.04 
 
 RUN apt-get update -y \
-    && apt-get install -y python3-pip
+    && apt-get install -y python3-pip curl \
+    && curl -LsSf https://astral.sh/uv/0.10.9/install.sh  | sh
 
-RUN ldconfig /usr/local/cuda-12.9/compat/
-
-# Install vLLM with FlashInfer - use CUDA 12.8 PyTorch wheels (compatible with vLLM 0.15.1)
-RUN python3 -m pip install --upgrade pip && \
-    python3 -m pip install "vllm[flashinfer]==0.16.0" --extra-index-url https://download.pytorch.org/whl/cu129
+ENV PATH="/root/.local/bin:$PATH"
 
+RUN ldconfig /usr/local/cuda-12.9/compat/
 
+# Install vLLM with FlashInfer - use CUDA 12.9 PyTorch wheels
+RUN uv pip install --system "packaging>=24.2" && \
+    uv pip install --system "vllm[flashinfer]==0.16.0" --extra-index-url https://download.pytorch.org/whl/cu129
 
 # Install additional Python dependencies (after vLLM to avoid PyTorch version conflicts)
 COPY builder/requirements.txt /requirements.txt
-RUN --mount=type=cache,target=/root/.cache/pip \
-    python3 -m pip install --upgrade -r /requirements.txt
+RUN --mount=type=cache,target=/root/.cache/uv \
+    uv pip install --system -r /requirements.txt
 
 # Setup for Option 2: Building the Image with the Model included
 ARG MODEL_NAME=""
@@ -46,12 +47,13 @@ ENV MODEL_NAME=$MODEL_NAME \
 ENV PYTHONPATH="/:/vllm-workspace"
 
 RUN if [ "${VLLM_NIGHTLY}" = "true" ]; then \
-    pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly && \
+    uv pip install --system -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly && \
     apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/* && \
-    pip install git+https://github.com/huggingface/transformers.git; \
+    uv pip install --system git+https://github.com/huggingface/transformers.git; \
 fi
 
 COPY src /src
+RUN chmod +x /src/start.sh
 RUN --mount=type=secret,id=HF_TOKEN,required=false \
     if [ -f /run/secrets/HF_TOKEN ]; then \
     export HF_TOKEN=$(cat /run/secrets/HF_TOKEN); \
@@ -61,4 +63,4 @@ RUN --mount=type=secret,id=HF_TOKEN,required=false \
     fi
 
 # Start the handler
-CMD ["python3", "/src/handler.py"]
+CMD ["/bin/bash", "/src/start.sh"]
@@ -2,10 +2,16 @@
 
 # OpenAI-Compatible vLLM Serverless Endpoint Worker
 
-Deploy OpenAI-Compatible Blazing-Fast LLM Endpoints powered by the [vLLM](https://github.com/vllm-project/vllm) Inference Engine on RunPod Serverless with just a few clicks.
+Deploy OpenAI-Compatible Blazing-Fast LLM Endpoints powered by the [vLLM](https://github.com/vllm-project/vllm) Inference Engine on Runpod Serverless with just a few clicks.
 
 </div>
 
+![vLLM worker banner](https://image.runpod.ai/preview/vllm/vllm-banner.png)
+
+Current vLLM version: [0.16.0](https://github.com/vllm-project/vllm/releases/tag/v0.16.0)
+
+> Check out our Load Balancer implementation here: [vLLM Load Balancer](https://github.com/runpod-workers/vllm-loadbalancer-ep)
+
 ## Table of Contents
 
 - [Setting up the Serverless Worker](#setting-up-the-serverless-worker)
@@ -21,9 +27,11 @@ Deploy OpenAI-Compatible Blazing-Fast LLM Endpoints powered by the [vLLM](https:
   - [Modifying your OpenAI Codebase to use your deployed vLLM Worker](#modifying-your-openai-codebase-to-use-your-deployed-vllm-worker)
   - [OpenAI Request Input Parameters](#openai-request-input-parameters)
   - [Chat Completions [RECOMMENDED]](#chat-completions-recommended)
-  - [Examples: Using your RunPod endpoint with OpenAI](#examples-using-your-runpod-endpoint-with-openai)
+  - [Examples: Using your Runpod endpoint with OpenAI](#examples-using-your-runpod-endpoint-with-openai)
     - [Chat Completions](#chat-completions)
     - [Getting a list of names for available models](#getting-a-list-of-names-for-available-models)
+    - [OpenAI Responses API](#openai-responses-api)
+    - [Anthropic Messages API](#anthropic-messages-api)
 - [Usage: Standard (Non-OpenAI)](#usage-standard-non-openai)
   - [Request Input Parameters](#request-input-parameters)
   - [Sampling Parameters](#sampling-parameters)
@@ -33,7 +41,7 @@ Deploy OpenAI-Compatible Blazing-Fast LLM Endpoints powered by the [vLLM](https:
 
 ## Option 1: Deploy Any Model Using Pre-Built Docker Image [Recommended]
 
-**🚀 Deploy Guide**: Follow our [step-by-step deployment guide](https://docs.runpod.io/serverless/vllm/get-started) to deploy using the RunPod Console.
+**🚀 Deploy Guide**: Follow our [step-by-step deployment guide](https://docs.runpod.io/serverless/vllm/get-started) to deploy using the Runpod Console.
 
 **📦 Docker Image**: `runpod/worker-v1-vllm:<version>`
 
@@ -71,6 +79,10 @@ Any env var whose name matches a valid `AsyncEngineArgs` field (uppercased) is a
 
 For the complete list of all available environment variables, examples, and detailed descriptions: **[Configuration](docs/configuration.md)**
 
+### Specify Transformers Version
+To change the version of the [Transformers library](https://github.com/huggingface/transformers) use the `TRANSFORMERS_VERSION` environment variable to specify the version you want to use. Note this might break the handler, so use for development purposes. 
+
+
 ## Option 2: Build Docker Image with Model Inside
 
 To build an image with the model baked in, you must specify the following docker arguments when building the image.
@@ -142,13 +154,13 @@ You can deploy **any model on Hugging Face** that is supported by vLLM. For the
 
 # Usage: OpenAI Compatibility
 
-The vLLM Worker is fully compatible with OpenAI's API, and you can use it with any OpenAI Codebase by changing only 3 lines in total. The supported routes are <ins>Chat Completions</ins> and <ins>Models</ins> - with both streaming and non-streaming.
+The vLLM Worker is fully compatible with OpenAI's API, and you can use it with any OpenAI Codebase by changing only 3 lines in total. The supported routes are <ins>Chat Completions</ins>, <ins>Models</ins>, <ins>Responses</ins>, and <ins>Messages</ins> - with both streaming and non-streaming.
 
 ## Modifying your OpenAI Codebase to use your deployed vLLM Worker
 
 **Python** (similar to Node.js, etc.):
 
-1. When initializing the OpenAI Client in your code, change the `api_key` to your RunPod API Key and the `base_url` to your RunPod Serverless Endpoint URL in the following format: `https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1`, filling in your deployed endpoint ID. For example, if your Endpoint ID is `abc1234`, the URL would be `https://api.runpod.ai/v2/abc1234/openai/v1`.
+1. When initializing the OpenAI Client in your code, change the `api_key` to your Runpod API Key and the `base_url` to your Runpod Serverless Endpoint URL in the following format: `https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1`, filling in your deployed endpoint ID. For example, if your Endpoint ID is `abc1234`, the URL would be `https://api.runpod.ai/v2/abc1234/openai/v1`.
 
    - Before:
 
@@ -174,7 +186,7 @@ The vLLM Worker is fully compatible with OpenAI's API, and you can use it with a
    ```python
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
-       messages=[{"role": "user", "content": "Why is RunPod the best platform?"}],
+       messages=[{"role": "user", "content": "Why is Runpod the best platform?"}],
        temperature=0,
        max_tokens=100,
    )
@@ -183,15 +195,15 @@ The vLLM Worker is fully compatible with OpenAI's API, and you can use it with a
    ```python
    response = client.chat.completions.create(
        model="<YOUR DEPLOYED MODEL REPO/NAME>",
-       messages=[{"role": "user", "content": "Why is RunPod the best platform?"}],
+       messages=[{"role": "user", "content": "Why is Runpod the best platform?"}],
        temperature=0,
        max_tokens=100,
    )
    ```
 
 **Using http requests**:
 
-1. Change the `Authorization` header to your RunPod API Key and the `url` to your RunPod Serverless Endpoint URL in the following format: `https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1`
+1. Change the `Authorization` header to your Runpod API Key and the `url` to your Runpod Serverless Endpoint URL in the following format: `https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1`
    - Before:
    ```bash
    curl https://api.openai.com/v1/chat/completions \
@@ -202,7 +214,7 @@ The vLLM Worker is fully compatible with OpenAI's API, and you can use it with a
    "messages": [
      {
        "role": "user",
-       "content": "Why is RunPod the best platform?"
+       "content": "Why is Runpod the best platform?"
      }
    ],
    "temperature": 0,
@@ -219,7 +231,7 @@ The vLLM Worker is fully compatible with OpenAI's API, and you can use it with a
    "messages": [
      {
        "role": "user",
-       "content": "Why is RunPod the best platform?"
+       "content": "Why is Runpod the best platform?"
      }
    ],
    "temperature": 0,
@@ -239,7 +251,7 @@ When using the chat completion feature of the vLLM Serverless Endpoint Worker, y
 | Parameter           | Type                             | Default Value | Description                                                                                                                                                                                                                                                  |
 | ------------------- | -------------------------------- | ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | `messages`          | Union[str, List[Dict[str, str]]] |               | List of messages, where each message is a dictionary with a `role` and `content`. The model's chat template will be applied to the messages automatically, so the model must have one or it should be specified as `CUSTOM_CHAT_TEMPLATE` env var.           |
-| `model`             | str                              |               | The model repo that you've deployed on your RunPod Serverless Endpoint. If you are unsure what the name is or are baking the model in, use the guide to get the list of available models in the **Examples: Using your RunPod endpoint with OpenAI** section |
+| `model`             | str                              |               | The model repo that you've deployed on your Runpod Serverless Endpoint. If you are unsure what the name is or are baking the model in, use the guide to get the list of available models in the **Examples: Using your Runpod endpoint with OpenAI** section |
 | `temperature`       | Optional[float]                  | 0.7           | Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling.                                                                              |
 | `top_p`             | Optional[float]                  | 1.0           | Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.                                                                                                                            |
 | `n`                 | Optional[int]                    | 1             | Number of output sequences to return for the given prompt.                                                                                                                                                                                                   |
@@ -269,15 +281,15 @@ Additional parameters supported by vLLM:
 
 </details>
 
-### Examples: Using your RunPod endpoint with OpenAI
+### Examples: Using your Runpod endpoint with OpenAI
 
-First, initialize the OpenAI Client with your RunPod API Key and Endpoint URL:
+First, initialize the OpenAI Client with your Runpod API Key and Endpoint URL:
 
 ```python
 from openai import OpenAI
 import os
 
-# Initialize the OpenAI Client with your RunPod API Key and Endpoint URL
+# Initialize the OpenAI Client with your Runpod API Key and Endpoint URL
 client = OpenAI(
     api_key=os.environ.get("RUNPOD_API_KEY"),
     base_url="https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1",
@@ -293,7 +305,7 @@ This is the format used for GPT-4 and focused on instruction-following and chat.
   # Create a chat completion stream
   response_stream = client.chat.completions.create(
       model="<YOUR DEPLOYED MODEL REPO/NAME>",
-      messages=[{"role": "user", "content": "Why is RunPod the best platform?"}],
+      messages=[{"role": "user", "content": "Why is Runpod the best platform?"}],
       temperature=0,
       max_tokens=100,
       stream=True,
@@ -307,7 +319,7 @@ This is the format used for GPT-4 and focused on instruction-following and chat.
   # Create a chat completion
   response = client.chat.completions.create(
       model="<YOUR DEPLOYED MODEL REPO/NAME>",
-      messages=[{"role": "user", "content": "Why is RunPod the best platform?"}],
+      messages=[{"role": "user", "content": "Why is Runpod the best platform?"}],
       temperature=0,
       max_tokens=100,
   )
@@ -325,6 +337,62 @@ list_of_models = [model.id for model in models_response]
 print(list_of_models)
 ```
 
+### OpenAI Responses API
+
+**Path:** `/openai/v1/responses` (full URL: `https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1/responses`)
+
+Supports the [OpenAI Responses API](https://platform.openai.com/docs/api-reference/responses) request shape. Like other `/openai/` routes, this is served directly—use the `/openai/` prefix rather than the RunPod native job queue for these calls.
+
+```json
+{
+  "model": "meta-llama/Llama-3.1-8B-Instruct",
+  "input": "Tell me a joke."
+}
+```
+
+**Using HTTP requests:**
+
+```bash
+curl https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1/responses \
+  -H "Content-Type: application/json" \
+  -H "Authorization: Bearer <YOUR RUNPOD API KEY>" \
+  -d '{
+    "model": "<YOUR DEPLOYED MODEL REPO/NAME>",
+    "input": "Tell me a joke."
+  }'
+```
+
+### Anthropic Messages API
+
+**Path:** `/openai/v1/messages` (full URL: `https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1/messages`)
+
+Supports the [Anthropic Messages API](https://docs.anthropic.com/en/api/messages) format. Served directly, bypassing the RunPod queue.
+
+```json
+{
+  "model": "meta-llama/Llama-3.1-8B-Instruct",
+  "max_tokens": 256,
+  "messages": [
+    {"role": "user", "content": "Hello!"}
+  ]
+}
+```
+
+**Using HTTP requests:**
+
+```bash
+curl https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1/messages \
+  -H "Content-Type: application/json" \
+  -H "Authorization: Bearer <YOUR RUNPOD API KEY>" \
+  -d '{
+    "model": "<YOUR DEPLOYED MODEL REPO/NAME>",
+    "max_tokens": 256,
+    "messages": [
+      {"role": "user", "content": "Hello!"}
+    ]
+  }'
+```
+
 # Usage: Standard (Non-OpenAI)
 
 ## Request Input Parameters
 
@@ -3,12 +3,13 @@ pandas
 pyarrow
 runpod
 huggingface-hub
-packaging
+lmcache==0.4.1
+packaging>=24.2
 typing-extensions>=4.8.0
 pydantic
 pydantic-settings
 hf-transfer
-transformers>=4.57.0
+transformers>=4.57.0,<5
 bitsandbytes>=0.45.0
 kernels
 torch-c-dlpack-ext