feat: custom IPEX-LLM Ollama Dockerfile, Intel GPU tuning, and VRAM/context docs#38
feat: custom IPEX-LLM Ollama Dockerfile, Intel GPU tuning, and VRAM/context docs#38eSlider wants to merge 19 commits intoeleiton:mainfrom
Conversation
Add ipex-ollama/Dockerfile that builds an IPEX-LLM Ollama image from scratch on Ubuntu 24.04 with Intel GPU compute runtimes (Level Zero, IGC, compute-runtime) and the portable ollama-ipex-llm bundle v2.3.0. Update docker-compose.yml with: - Intel GPU performance tuning (SYCL, SDP fusion, persistent cache) - Refined device mapping (/dev/dri/renderD128 instead of full /dev/dri) - Ollama memory/model management env vars - Open-WebUI: explicit OLLAMA_BASE_URL, RAG web search, telemetry opt-out Co-authored-by: Cursor <cursoragent@cursor.com>
- Restore standard device mapping (/dev/dri) and port (11434) - Add configurable env vars with defaults: context length, KV cache type, flash attention, SDP fusion, NPU toggles, GPU layers, SYSMAN - Re-enable OLLAMA_API and disable OPENAI_API in Open-WebUI - Add comments documenting each environment variable Co-authored-by: Cursor <cursoragent@cursor.com>
Dockerfile (ipex-ollama): - Enable BuildKit syntax (# syntax=docker/dockerfile:1.4) - Add ARG version pins for all Intel GPU runtime components - Use --mount=type=cache for apt and wget downloads (faster rebuilds) - Use wget -nc (no-clobber) for idempotent cached downloads - Remove duplicate ENV blocks (USE_XETLA, ZES_ENABLE_SYSMAN) - Remove unverified OLLAMA_USE_IPEX* env vars (no official docs) - Fix NPU defaults (disabled by default, commented out) - Fix comment syntax (#- → # ENV) - Merge cleanup into fewer layers for smaller image docs: - Add docs/intel-arc-a770-context-limits.md with VRAM budget breakdown, context length vs KV cache trade-off tables, and recommended settings by model size for 16 GB Intel Arc GPUs README.md: - Add Documentation section linking to the new VRAM/context guide Co-authored-by: Cursor <cursoragent@cursor.com>
- Restore full open-webui service definition (was a placeholder comment) - Merge settings from main with new additions: - OLLAMA_BASE_URL for inter-container connectivity - ENABLE_RAG_WEB_SEARCH for web search in RAG - Telemetry opt-out (SCARF_NO_ANALYTICS, DO_NOT_TRACK, ANONYMIZED_TELEMETRY) - Fix missing $ in SYCL_CACHE_PERSISTENT env var reference Co-authored-by: Cursor <cursoragent@cursor.com>
…links - Add "Building a Custom Image" section to VRAM guide with version pin table and build instructions - Add "Configuring via docker-compose" section with .env example - Add links to Level Zero, IGC, and IPEX-LLM releases in Further Reading - Expand README Documentation section with links to custom Dockerfile and docker-compose.yml with brief descriptions Co-authored-by: Cursor <cursoragent@cursor.com>
Docker defaults /dev/shm to 64 MB which is too small for SYCL kernel caches, Level Zero buffers, and memory-mapped model loading. Setting shm_size to 16G sets the upper bound without pre-allocating memory. Also document the shared memory requirement in the VRAM guide tips. Co-authored-by: Cursor <cursoragent@cursor.com>
- Add docs/sycl-vs-vulkan.md with performance benchmarks (SYCL 40-100% faster than Vulkan on Intel Arc), three backend options (IPEX-LLM bundle, SYCL from source, upstream Vulkan), how the ggml-sycl source build and patch-sycl.py work, tested hardware table, and troubleshooting guide - Add troubleshooting section to VRAM guide with OOM, slow first inference, Level-Zero kernel regression, and ABI mismatch fixes - Add Tested Hardware table and SYCL vs Vulkan link to README Co-authored-by: Cursor <cursoragent@cursor.com>
Expand the SYCL-from-source section in sycl-vs-vulkan.md with: - Links to tmp/Dockerfile and tmp/patch-sycl.py - Exact ggml commit (a5bb8ba4) and oneAPI version (2025.1.1) - Detailed explanation of both patches and why they're needed - How Stage 1 collects oneAPI runtime deps into /sycl-runner - How Stage 2 assembles the final minimal image - Step-by-step guide for updating to a new Ollama version Co-authored-by: Cursor <cursoragent@cursor.com>
Show the key Dockerfile snippets from tmp/Dockerfile inline in the sycl-vs-vulkan doc: sparse-checkout of ggml-sycl, patch step, cmake build with icpx, runtime dependency collection, and Stage 2 assembly with official ollama binary + SYCL runner drop-in. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Add tmp/ directory to repo (previously untracked): - tmp/Dockerfile: multi-stage SYCL build with oneAPI (Ollama v0.15.6) - tmp/patch-sycl.py: patches ggml-sycl for Ollama API compatibility - tmp/docker-compose.yml: compose file for the SYCL source build - tmp/start-ollama.sh: legacy entrypoint script - tmp/test-glm-ocr.sh: vision model test script Add Project Structure section to README showing the full repo layout. Co-authored-by: Cursor <cursoragent@cursor.com>
- Rename tmp/ to ollama-sycl/ to follow <service>/Dockerfile convention - Move docker-compose.yml to docker-compose.ollama-sycl.yml at root (matches docker-compose.comfyui.yml, docker-compose.sdnext.yml, etc.) - Refactor compose file: env-var driven defaults, shared volume names, consistent service naming (ollama-sycl, open-webui-sycl) - Remove tmp/README.md (content lives in docs/sycl-vs-vulkan.md) - Update all tmp/ references in docs and README - Add SYCL build command to Setup section in README Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Prevents corporate/system HTTP proxies from intercepting container-to-container traffic (Open WebUI → Ollama). Both lowercase and uppercase variants are set since different libraries check different casing. Values include localhost, loopback, and Docker service names. Configurable via .env. Co-authored-by: Cursor <cursoragent@cursor.com>
- Update OLLAMA_VERSION from 0.15.6 to 0.16.1 in Dockerfile and compose - Update GGML_COMMIT to ec98e200 (llama.cpp tag b7437) matching v0.16.1 - Rewrite patch-sycl.py: no patches needed since v0.16.1 (APIs converged) - graph_compute now has batch_size in both upstream and ollama - GGML_TENSOR_FLAG_COMPUTE removed from both - Script exits cleanly (code 0) when no patches are needed - Retains backward compatibility for older ollama versions - Update all docs and README version references Co-authored-by: Cursor <cursoragent@cursor.com>
… SYCL validate output - Fix duplicate nested markdown link in Whisper service description - Update patch-sycl.py description in project structure to reflect no-op since v0.16.1 - Add SYCL-from-source validate output showing v0.16.1 and SYCL0 device detection - Fix outdated Vulkan "advantage" text (SYCL build is now also v0.16.1) - Correct patch-sycl.py exit behavior description (exits cleanly, not with error) Co-authored-by: Cursor <cursoragent@cursor.com>
- Add Image Size Comparison table to sycl-vs-vulkan.md - Update image sizes to measured values: ipex-ollama 1.03 GB, sycl-ollama 1.27 GB - Update sycl-ollama build time to ~90s (measured) - Both custom images are 4-15x smaller than alternatives Co-authored-by: Cursor <cursoragent@cursor.com>
|
Impressive work, thank you @eSlider! I would love to test this PR on a How do I go best about testing all of these changes so that I can report back here? |
|
This is incredible work @eSlider! I was able to use your branch as-is to use my Arc B580 (12GB VRAM) and have qwen2.5-coder:7b running very smoothly. I'm still quite new to all of this but I wanted to say thank you. |
|
Thanks for the PR @eSlider, happy to receive contributions to the codebase. |
There was a problem hiding this comment.
Can this file be removed? I believe it's not needed for running the project.
There was a problem hiding this comment.
Can this file be removed? I believe it's not needed for running the project
| original = src | ||
| applied = [] | ||
|
|
||
| # 1. Fix graph_compute signature: add 'int batch_size' parameter |
There was a problem hiding this comment.
I believe this patch is needed still, and will probably be needed moving forward for a while.
I tested with Ollama 0.16.3 and GGML_COMMIT=ef83fb8601229ff650d952985be47e82d644bfaa
| if "int batch_size" in src: | ||
| applied.append("batch_size parameter") | ||
|
|
||
| # 2. Remove GGML_TENSOR_FLAG_COMPUTE skip-check entirely. |
There was a problem hiding this comment.
This one on the other hand, I believe is not needed, as you suggest in the documentation.
So I think ideally this can either me removed, or the documentation updated to mention one is needed for all versions and the other only for older versions?
| RUN mkdir -p /sycl-runner && \ | ||
| cp build/lib/ollama/libggml-sycl.so /sycl-runner/ && \ | ||
| # SYCL / DPC++ runtime | ||
| for f in libsycl.so*; do true; done && \ |
| # ============================================================================= | ||
| FROM ubuntu:24.04 | ||
| ENV DEBIAN_FRONTEND=noninteractive \ | ||
| TZ=America/Los_Angeles |
| RUN python3 /tmp/patch-sycl.py ml/backend/ggml/ggml/src/ggml-sycl/ggml-sycl.cpp | ||
| ``` | ||
|
|
||
| As of v0.16.1, the upstream ggml and Ollama APIs have converged — **no patches are needed**. The script detects this and exits cleanly. |
There was a problem hiding this comment.
Ollama v0.16.3 requires the 'int batch_size' parameter that upstream llama.cpp at commit ef83fb86 doesn't have. The patch script detects this and applies the fix.
|
Thank you all for your reviews, especially @eleiton! |
The purpose of the research is to find the fastest calculation approach (inference method) by means of comparison. How would you suggest dividing up the competitors in order to then create a benchmark? I'll leave PR as WIP for now. |
|
I would like to use an existing directory containing previously downloaded ollama models to avoid having to download models yet again. I'm running more than one machine and copy models between machines to avoid download delays on a 10Mb link (yeah, I'm in the boonies). To do so a new environment variable "OLLAMA_MODELS" (or something better) can be used which defaults to "{}". Then change the snippet from volumes:
ollama-volume: {}
open-webui-volume: {}to: volumes:
ollama-volume: ${OLLAMA_MODELS}
open-webui-volume: {}I'm not a docker expert by any means so could be totally off base. But a suggestion. I hope you get the idea. |
|
Thank you for your hard work. I was able to get this up and running, but I am having some trouble with my dual GPU setup (Intel Arc a580 and a380). I am trying to troubleshoot the error messages right now with Claude. |
|
Looking through the logs I found the following entries:
Is this a limitation of Ollama or a limitation of SYCL or just how this new project was built? |
|
I spent several hours with Claude troubleshooting the problem and rebuilding the dockerfile a least a dozen times. I am not a developer, so the process quickly went above my head, but I asked Claude to summarize our steps in the hopes that it would help others. Please let me know if you have any questions or any new builds that you would like me to try that support multiple GPUs. `# Multi-GPU Issue: Intel Arc A580 + A380 with deepseek-r1:14b Hardware
What Works
The ProblemLarger models (tested: deepseek-r1:14b at 8.37 GiB) crash at inference time with: The model loads successfully — all layers are assigned and KV cache allocated across both GPUs — but crashes on the first inference call when the A380 tries to execute a RMS_NORM kernel. Root Cause AnalysisThe error "built for 1 devices" is a runtime SYCL program cache issue. ggml-sycl compiles GPU kernels (as OpenCL/SYCL programs) keyed to the first device it encounters (A580). When the A380 tries to execute those same cached programs, it fails because they were compiled in the context of a single device. The RMS_NORM operation is simply the first kernel that hits this issue. Fix AttemptsAttempt 1: CMAKE_CXX_FLAGS with spir64_gen + AOTAdded AOT compilation targets for both GPU architectures at cmake configure time: -DCMAKE_CXX_FLAGS="-fsycl-targets=spir64_gen,spir64 -Xs \"-device acm-g10,acm-g11\""Result: Failed — Attempt 2: Install ocloc in builder stageAdded Intel compute-runtime ( Result: Failed — ocloc version mismatch with oneAPI 2025.1.1: The oneAPI 2025.1.1 compiler passes a flag that compute-runtime 26.05.37020.3's ocloc does not support. Attempt 3: Drop spir64_gen, use portable spir64 onlyRemoved ocloc requirement entirely, switched to portable SPIR-V IR: -DCMAKE_CXX_FLAGS="-fsycl -fsycl-targets=spir64"The idea: each GPU JITs from portable IR independently, so no shared compiled program. Result: Build succeeded, but crash persisted. Attempt 4: Patch CMakeLists.txt directlyUsed sed -i \
'/target_link_libraries(ggml-sycl/i target_compile_options(ggml-sycl PRIVATE -fsycl -fsycl-targets=spir64)\ntarget_link_options(ggml-sycl PRIVATE -fsycl -fsycl-targets=spir64)' \
ml/backend/ggml/ggml/src/ggml-sycl/CMakeLists.txtResult: Build succeeded with flags applied at the right level, but crash still persisted. Confirmed the "built for 1 devices" error is a runtime program cache issue, not a compile-time flag issue. Attempt 5: Clear SYCL runtime cacheSearched for and cleared any persistent JIT kernel cache: find /root -name '*.bin' -path '*/sycl*' -delete
find /tmp -name '*.bin' -path '*/sycl*' -deleteAlso set Result: No change. Cache was not the issue. ConclusionThe "built for 1 devices" error originates inside ggml-sycl's runtime program building logic in WorkaroundLimit to a single GPU (the A580) via environment variables: This allows larger models to run on the A580's 8 GiB alone. The A380 is unused in this configuration. Environment
|
|
as a follow-up to my last post. [SOLVED] Intel Arc SYCL Multi-GPU Deduplication (SameBackendDevice) & Tensor Splitting FixA community fix developed by a joint collaboration between Antigravity (Gemini 3.1 Pro), Claude (Sonnet 4.6), and Codex (GPT-5.4). 1. The SymptomsIf you are running Ollama with heterogeneous Intel Arc GPUs (for example, combining an A580 and an A380) natively over SYCL (Level Zero) via a Docker container, you may have encountered a silent multi-GPU failure:
2. The Root Cause (Silent Deduplication)Through extensive source code tracing, our multi-agent AI framework located the root cause in the Go-layer abstractions, not the Intel drivers. The flaw exists in The Go deduplicator compares multiple devices. Because SYCL0 and SYCL1 share the same library string but both report their identity as "", Go assumes they are clones of the same physical hardware wrapper. This triggers a silent deduction internally, erasing SYCL1 from the active array. 3. The SolutionTo unify the VRAM (13.2 GiB on our A580+A380) and enable heterogeneous tensor splitting, we applied a surgical C++ identity injection hook and accompanying environment tweaks. Part 1: The C++ Identity Hook (patch-sycl-v2.py)When compiling the This provides dev_ctx->id = "sycl:" + std::to_string(i);
dev_ctx->library = "SYCL";
dev_ctx->device_id = "sycl:" + std::to_string(i);
props->id = ctx->id.c_str();
props->device_id = ctx->device_id.c_str();
props->library = ctx->library.c_str();We also included a diagnostic hook for Part 2: OOM Safeguards & Context Isolation (.env)While the identity patch enables the discovery of both cards, stability on disparate silicon requires environmental isolation to prevent queue stalls and out-of-memory (OOM) failures: # JIT / Level Zero cache separation
SYCL_CACHE_PERSISTENT=0
SYCL_CACHE_IN_MEM=0
SYCL_ENABLE_DEFAULT_CONTEXTS=0
SYCL_RT_WARNING_LEVEL=1
ZE_ENABLE_PCI_ID_DEVICE_ORDER=1
ZE_FLAT_DEVICE_HIERARCHY=FLAT
# Conservative stability constraints (Working Production Config)
OLLAMA_NUM_PARALLEL=1
OLLAMA_MAX_LOADED_MODELS=1
OLLAMA_CONTEXT_LENGTH=8192(Note: Flash Attention was logged as disabled during the successful run over this SYCL architecture). 4. The ResultsUpon rebooting via The Go engine accepted BOTH indices as unique devices. We demonstrated a successful dual-GPU inference run using Qwen 2.5 14B, spilling 3.7GB onto the A580 and the remaining 2.3GB directly onto the A380. The KV layer map correctly assigned compute layers across both cards, achieving stable generation where it previously failed. |
|
Unfortunately, a few weeks ago I burned out the iGPU on my I’d appreciate it if someone could take on the task of adding or using these improvements in a PR. My plans regarding whether to buy an Intel Arc GPU or an iGPU are still up in the air. |
|
I am currently working on trying to get it to work with the new Ollama releases. I am having trouble getting qwen3.5:9b to run. I am happy to share my findings and files, but I've never managed a GitHub repository. What would be the best way to hand off? |
|
I dont want to admin how many hours and tokens I have spent across multiple providers trying to get this to work, but long story short, my AI finally sent me the follow:
It then went on to tell me to sell my Intel Arc GPUs and purchase new two Nvidia GPUs. LOL I saw in a recent Unraid Uncast show that Spaceinvaderone created a new Unraid app specific to Ollama with Arc GPUs. I am going to give this a whirl, but based on the issues already posted in the gitbhub responsitory, it doesnt look promising. |

Closes #37
Summary
Project Structure
Refactored into a clean
<service>/Dockerfile+docker-compose.<service>.ymlconvention:Custom Dockerfiles
ipex-ollama/Dockerfile— IPEX-LLM bundle-based image (Ollama v0.9.3):# syntax=docker/dockerfile:1.4with--mount=type=cachefor apt and download cachingARGversion pins for all Intel GPU runtime components (bump in one place)sycl-ollama/Dockerfile— SYCL-from-source image (Ollama v0.16.1):ggml-syclwith Intel oneAPIicpx, Stage 2 is minimal runtimesycl-ollama/patch-sycl.py— backward-compatible API patching (no patches needed since v0.16.1 — APIs converged)ec98e200(llama.cpp tag b7437) matching Ollama v0.16.1libggml-sycl.so+ stripped oneAPI runtime libs alongside official Ollama binarytest-glm-ocr.shvision model test scriptDocker Compose
docker-compose.yml(main stack):${VAR:-default}syntax (override with.envor shell)shm_size: "16G"for SYCL/Level Zero shared memory (Docker defaults to 64 MB)no_proxy/NO_PROXYon all services — prevents corporate/system HTTP proxies from intercepting container-to-container trafficopen-webuiservice withOLLAMA_BASE_URL, RAG web search, telemetry opt-outdocker-compose.sycl-ollama.yml(alternative stack):no_proxy, and Open WebUI configollama-volumeso models persist when switching between stacksDocumentation
docs/sycl-vs-vulkan.md— SYCL vs Vulkan comparison:patch-sycl.pyworks (and why it's a no-op since v0.16.1)docs/intel-arc-a770-context-limits.md— VRAM & context guide:README.md:Build & test verification (2026-02-16)
IPEX-LLM stack (ipex-ollama)
docker build -t ipex-ollama:latest ./ipex-ollama/— all layers build successfullyusing Intel GPU)0.0.0.0:11434SYCL-from-source stack (sycl-ollama)
docker compose -f docker-compose.sycl-ollama.yml build— all stages pass (~87s)patch-sycl.pyexits with code 0 — no patches needed (APIs converged in v0.16.1)ggml-syclcompiled successfully →libggml-sycl.sobuilt and strippedollama listreturns models from shared volume (7 models loaded)ollama run llama3.2:1bresponded correctly usingSYCL0 compute buffer(1074 MiB allocated)docker compose downTest plan (remaining manual checks)
docker compose -f docker-compose.sycl-ollama.yml up --buildpatch-sycl.pyexits with code 0 during builddocker compose up -dOLLAMA_CONTEXT_LENGTH=8192in.env)no_proxyprevents proxy interference on corporate networks