Skip to content

Commit 4a65586

Browse files
gc-fuliu-shaojun
andauthored
Release b8 (#301)
* fix * enable b8 * Add * delete * fix --------- Co-authored-by: liu-shaojun <shaojun.liu@intel.com>
1 parent 35a14cb commit 4a65586

3 files changed

Lines changed: 6299 additions & 9387 deletions

File tree

vllm/README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1193,6 +1193,18 @@ export VLLM_OFFLOAD_WEIGHTS_BEFORE_QUANT=1
11931193
11941194
## 5. Performance tuning
11951195
1196+
### 5.1 Avoid Memory Fragmentation
1197+
1198+
To avoid GPU memory fragmentation (which can lead to out-of-memory errors even when sufficient memory appears available), enable PyTorch's expandable segments feature:
1199+
1200+
```bash
1201+
export PYTORCH_ALLOC_CONF="expandable_segments:True"
1202+
```
1203+
1204+
Set this environment variable **before** launching the vLLM service. This allows PyTorch's memory allocator to use expandable segments instead of fixed-size blocks, significantly reducing fragmentation over long-running sessions.
1205+
1206+
### 5.2 CPU Affinity (NUMA Binding)
1207+
11961208
To improve performance, you can optimize CPU affinity based on the GPU–NUMA topology.
11971209
11981210
For example, if your process uses two GPUs that are both connected to NUMA node 0, you can use lscpu to identify the CPU cores associated with that NUMA node:

vllm/docker/Dockerfile

Lines changed: 10 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
# SPDX-License-Identifier: Apache-2.0
33

44
# ======== Base Stage ========
5-
FROM intel/llm-scaler-platform:26.5.6.1 AS vllm-base
5+
FROM intel/llm-scaler-platform:26.13.7.1 AS vllm-base
66

77
ARG https_proxy
88
ARG http_proxy
@@ -54,7 +54,7 @@ RUN python3 -m pip config set global.break-system-packages true
5454

5555
# Clone + patch vllm
5656
RUN --mount=type=cache,target=/root/.cache/pip \
57-
git clone -b v0.11.1 https://github.com/vllm-project/vllm.git && \
57+
git clone -b v0.14.0 https://github.com/vllm-project/vllm.git && \
5858
cd vllm && \
5959
git apply /tmp/vllm_for_multi_arc.patch && \
6060
pip install -r requirements/xpu.txt && \
@@ -100,9 +100,14 @@ RUN --mount=type=cache,target=/root/.cache/pip \
100100

101101
# Pin transformers version to avoid conflict in vLLM
102102
RUN --mount=type=cache,target=/root/.cache/pip \
103-
pip install "transformers==4.57.3" && \
103+
# pip install "transformers==4.57.3" && \
104104
pip install librosa soundfile decord
105105

106+
# FIX triton
107+
RUN --mount=type=cache,target=/root/.cache/pip \
108+
pip uninstall triton triton-xpu -y && \
109+
pip install triton-xpu==3.6.0 --extra-index-url=https://download.pytorch.org/whl/test/xpu
110+
106111

107112
# Set additional environment for production usage
108113
ENV VLLM_QUANTIZE_Q40_LIB="/usr/local/lib/python3.12/dist-packages/vllm_int4_for_multi_arc.so"
@@ -116,26 +121,9 @@ RUN cd /llm/vllm && \
116121

117122
RUN pip uninstall oneccl oneccl-devel -y
118123

119-
ENV TBBROOT=/opt/intel/oneapi/tbb/2022.2/env/.. \
120-
CCL_ROOT=/opt/intel/oneapi/ccl/2021.15.7-down.1 \
121-
CMPLR_ROOT=/opt/intel/oneapi/compiler/2025.2 \
122-
MKLROOT=/opt/intel/oneapi/mkl/2025.2 \
123-
DPL_ROOT=/opt/intel/oneapi/dpl/2022.9 \
124-
DNNLROOT=/opt/intel/oneapi/dnnl/2025.2 \
125-
I_MPI_ROOT=/opt/intel/oneapi/mpi/2021.16
126-
127-
128-
ENV PKG_CONFIG_PATH=/opt/intel/oneapi/tbb/2022.2/env/../lib/pkgconfig:/opt/intel/oneapi/mpi/2021.16/lib/pkgconfig:/opt/intel/oneapi/mkl/2025.2/lib/pkgconfig:/opt/intel/oneapi/dpl/2022.9/lib/pkgconfig:/opt/intel/oneapi/dnnl/2025.2/lib/pkgconfig:/opt/intel/oneapi/compiler/2025.2/lib/pkgconfig:/opt/intel/oneapi/ccl/2021.15.7-down.1/lib/pkgconfig/
129-
130-
ENV CMAKE_PREFIX_PATH=/opt/intel/oneapi/tbb/2022.2/env/..:/opt/intel/oneapi/pti/0.13/lib/cmake/pti:/opt/intel/oneapi/mkl/2025.2/lib/cmake:/opt/intel/oneapi/dpl/2022.9/lib/cmake/oneDPL:/opt/intel/oneapi/dnnl/2025.2/lib/cmake:/opt/intel/oneapi/compiler/2025.2:/opt/intel/oneapi/ccl/2021.15.7-down.1/lib/cmake/oneCCL
131-
132-
ENV LIBRARY_PATH=/opt/intel/oneapi/tcm/1.4/lib:/opt/intel/oneapi/umf/0.11/lib:/opt/intel/oneapi/tbb/2022.2/env/../lib/intel64/gcc4.8:/opt/intel/oneapi/pti/0.13/lib:/opt/intel/oneapi/mpi/2021.16/lib:/opt/intel/oneapi/mkl/2025.2/lib:/opt/intel/oneapi/dnnl/2025.2/lib:/opt/intel/oneapi/compiler/2025.2/lib:/opt/intel/oneapi/ccl/2021.15.7-down.1/lib
133-
134-
ENV LD_LIBRARY_PATH=/opt/intel/oneapi/tcm/1.4/lib:/opt/intel/oneapi/umf/0.11/lib:/opt/intel/oneapi/tbb/2022.2/env/../lib/intel64/gcc4.8:/opt/intel/oneapi/pti/0.13/lib:/opt/intel/oneapi/mpi/2021.16/opt/mpi/libfabric/lib:/opt/intel/oneapi/mpi/2021.16/lib:/opt/intel/oneapi/mkl/2025.2/lib:/opt/intel/oneapi/dnnl/2025.2/lib:/opt/intel/oneapi/debugger/2025.2/opt/debugger/lib:/opt/intel/oneapi/compiler/2025.2/opt/compiler/lib:/opt/intel/oneapi/compiler/2025.2/lib:/opt/intel/oneapi/ccl/2021.15.7-down.1/lib:/usr/local/lib/
124+
RUN rm /usr/lib/python3/dist-packages/PyJWT-2.7.0.dist-info/ -rf
135125

136-
ENV CPLUS_INCLUDE_PATH=/opt/intel/oneapi/umf/0.11/include:/opt/intel/oneapi/tbb/2022.2/env/../include:/opt/intel/oneapi/pti/0.13/include:/opt/intel/oneapi/mpi/2021.16/include:/opt/intel/oneapi/mkl/2025.2/include:/opt/intel/oneapi/dpl/2022.9/include:/opt/intel/oneapi/dpcpp-ct/2025.2/include
137-
ENV CPATH=/opt/intel/oneapi/umf/0.11/include:/opt/intel/oneapi/mkl/2025.2/include:/opt/intel/oneapi/dnnl/2025.2/include:/opt/intel/oneapi/dev-utilities/2025.2/include:/opt/intel/oneapi/ccl/2021.15.7-down.1/include
126+
RUN echo "source /opt/intel/oneapi/setvars.sh --force" >> /root/.bashrc
138127

139-
# ENTRYPOINT ["bash", "-c", "source /opt/intel/oneapi/setvars.sh --force && python3 -m vllm.entrypoints.openai.api_server"]
140128
ENTRYPOINT ["bash", "-c", "vllm serve"]
141129

0 commit comments

Comments
 (0)