Skip to content

feat: custom IPEX-LLM Ollama Dockerfile, Intel GPU tuning, and VRAM/context docs#38

Draft
eSlider wants to merge 19 commits intoeleiton:mainfrom
eSlider:feat/ipex-ollama-dockerfile-and-config-tuning
Draft

feat: custom IPEX-LLM Ollama Dockerfile, Intel GPU tuning, and VRAM/context docs#38
eSlider wants to merge 19 commits intoeleiton:mainfrom
eSlider:feat/ipex-ollama-dockerfile-and-config-tuning

Conversation

@eSlider
Copy link
Copy Markdown

@eSlider eSlider commented Feb 16, 2026

Closes #37

Summary

Project Structure

Refactored into a clean <service>/Dockerfile + docker-compose.<service>.yml convention:

.
├── docker-compose.yml                # Main stack: Ollama (IPEX-LLM) + Open WebUI
├── docker-compose.sycl-ollama.yml    # SYCL-from-source Ollama + Open WebUI (alternative)
├── docker-compose.comfyui.yml        # ComfyUI image generation
├── docker-compose.sdnext.yml         # SD.Next image generation
├── docker-compose.whisper.yml        # OpenAI Whisper speech recognition
├── docker-compose.ramalama.yml       # RamaLama support
│
├── ipex-ollama/Dockerfile            # IPEX-LLM bundle build (Ollama v0.9.3, SYCL)
├── sycl-ollama/                      # SYCL-from-source build (Ollama v0.16.1)
│   ├── Dockerfile                    # Multi-stage: oneAPI build → minimal runtime
│   ├── patch-sycl.py                 # API compat patches (no-op since v0.16.1)
│   ├── start-ollama.sh               # Legacy entrypoint
│   └── test-glm-ocr.sh              # Vision model test script
│
├── docs/
│   ├── sycl-vs-vulkan.md             # SYCL vs Vulkan backend comparison
│   └── intel-arc-a770-context-limits.md  # VRAM & context length guide
└── ...

Custom Dockerfiles

  • ipex-ollama/Dockerfile — IPEX-LLM bundle-based image (Ollama v0.9.3):

    • BuildKit # syntax=docker/dockerfile:1.4 with --mount=type=cache for apt and download caching
    • ARG version pins for all Intel GPU runtime components (bump in one place)
    • Latest runtimes: Level Zero v1.28.0, IGC v2.28.4, compute-runtime 26.05.37020.3, IPEX-LLM v2.3.0b20250725
    • Clean ENV section — no duplicates, no unverified vars, NPU disabled by default
  • sycl-ollama/Dockerfile — SYCL-from-source image (Ollama v0.16.1):

    • Multi-stage build: Stage 1 compiles ggml-sycl with Intel oneAPI icpx, Stage 2 is minimal runtime
    • sycl-ollama/patch-sycl.py — backward-compatible API patching (no patches needed since v0.16.1 — APIs converged)
    • GGML commit ec98e200 (llama.cpp tag b7437) matching Ollama v0.16.1
    • Drops in libggml-sycl.so + stripped oneAPI runtime libs alongside official Ollama binary
    • Includes test-glm-ocr.sh vision model test script

Docker Compose

  • docker-compose.yml (main stack):

    • All env vars configurable via ${VAR:-default} syntax (override with .env or shell)
    • shm_size: "16G" for SYCL/Level Zero shared memory (Docker defaults to 64 MB)
    • no_proxy / NO_PROXY on all services — prevents corporate/system HTTP proxies from intercepting container-to-container traffic
    • Intel GPU perf tuning (SYCL immediate command lists, SDP fusion, persistent cache, XeTLA)
    • Ollama context/memory management (context length, KV cache quantization, flash attention)
    • Detailed inline comments explaining each variable's impact
    • Restored full open-webui service with OLLAMA_BASE_URL, RAG web search, telemetry opt-out
  • docker-compose.sycl-ollama.yml (alternative stack):

    • Self-contained alternative using SYCL-from-source build (Ollama v0.16.1)
    • Same env-var driven defaults, no_proxy, and Open WebUI config
    • Shares ollama-volume so models persist when switching between stacks

Documentation

  • docs/sycl-vs-vulkan.md — SYCL vs Vulkan comparison:

    • Performance benchmarks (SYCL 40–100% faster on Intel Arc)
    • Three backend options: IPEX-LLM bundle, SYCL from source, upstream Vulkan
    • Full Dockerfile snippets showing the multi-stage SYCL build
    • How patch-sycl.py works (and why it's a no-op since v0.16.1)
    • Step-by-step guide for updating to new Ollama versions
    • Troubleshooting (device detection, ABI mismatch, OOM, kernel 6.18+ regression)
  • docs/intel-arc-a770-context-limits.md — VRAM & context guide:

    • VRAM budget breakdown (weights + KV cache + overhead)
    • Context length vs VRAM trade-off tables (f16 / q8_0 / q4_0)
    • Recommended settings by model size (7B, 13B, 30B+) for 16 GB Arc
    • Environment variables reference for all Ollama and Intel tuning knobs
    • Building a custom image section with version pin table
  • README.md:

    • Tested Hardware table, Documentation section, Project Structure tree
    • SYCL-from-source setup command and validate output in Setup section
    • Fixed broken Whisper markdown link

Build & test verification (2026-02-16)

IPEX-LLM stack (ipex-ollama)

  • docker build -t ipex-ollama:latest ./ipex-ollama/ — all layers build successfully
  • Ollama v0.9.3 starts and registers all API routes
  • Intel GPU detected at startup (using Intel GPU)
  • Server listens on 0.0.0.0:11434

SYCL-from-source stack (sycl-ollama)

  • docker compose -f docker-compose.sycl-ollama.yml buildall stages pass (~87s)
  • patch-sycl.py exits with code 0 — no patches needed (APIs converged in v0.16.1)
  • ggml-sycl compiled successfully → libggml-sycl.so built and stripped
  • Ollama v0.16.1 starts, discovers SYCL backend:
    name=SYCL0 description="Intel(R) Arc(TM) Graphics" type=discrete total="28.0 GiB"
    
  • ollama list returns models from shared volume (7 models loaded)
  • Inference test passed: ollama run llama3.2:1b responded correctly using SYCL0 compute buffer (1074 MiB allocated)
  • Container cleans up properly with docker compose down

Test plan (remaining manual checks)

  • Build SYCL-from-source stack: docker compose -f docker-compose.sycl-ollama.yml up --build
  • Verify Ollama v0.16.1 responds
  • Run a model and confirm correct inference output
  • Confirm Intel GPU is used (SYCL0 device in logs)
  • Verify patch-sycl.py exits with code 0 during build
  • Run the IPEX-LLM stack: docker compose up -d
  • Verify Open-WebUI loads at http://localhost:4040 and connects to Ollama
  • Test with custom env overrides (e.g. OLLAMA_CONTEXT_LENGTH=8192 in .env)
  • Verify no_proxy prevents proxy interference on corporate networks

eSlider and others added 5 commits January 11, 2026 12:20
Add ipex-ollama/Dockerfile that builds an IPEX-LLM Ollama image from
scratch on Ubuntu 24.04 with Intel GPU compute runtimes (Level Zero,
IGC, compute-runtime) and the portable ollama-ipex-llm bundle v2.3.0.

Update docker-compose.yml with:
- Intel GPU performance tuning (SYCL, SDP fusion, persistent cache)
- Refined device mapping (/dev/dri/renderD128 instead of full /dev/dri)
- Ollama memory/model management env vars
- Open-WebUI: explicit OLLAMA_BASE_URL, RAG web search, telemetry opt-out

Co-authored-by: Cursor <cursoragent@cursor.com>
- Restore standard device mapping (/dev/dri) and port (11434)
- Add configurable env vars with defaults: context length, KV cache type,
  flash attention, SDP fusion, NPU toggles, GPU layers, SYSMAN
- Re-enable OLLAMA_API and disable OPENAI_API in Open-WebUI
- Add comments documenting each environment variable

Co-authored-by: Cursor <cursoragent@cursor.com>
Dockerfile (ipex-ollama):
- Enable BuildKit syntax (# syntax=docker/dockerfile:1.4)
- Add ARG version pins for all Intel GPU runtime components
- Use --mount=type=cache for apt and wget downloads (faster rebuilds)
- Use wget -nc (no-clobber) for idempotent cached downloads
- Remove duplicate ENV blocks (USE_XETLA, ZES_ENABLE_SYSMAN)
- Remove unverified OLLAMA_USE_IPEX* env vars (no official docs)
- Fix NPU defaults (disabled by default, commented out)
- Fix comment syntax (#- → # ENV)
- Merge cleanup into fewer layers for smaller image

docs:
- Add docs/intel-arc-a770-context-limits.md with VRAM budget
  breakdown, context length vs KV cache trade-off tables, and
  recommended settings by model size for 16 GB Intel Arc GPUs

README.md:
- Add Documentation section linking to the new VRAM/context guide

Co-authored-by: Cursor <cursoragent@cursor.com>
@eSlider eSlider changed the title feat: custom IPEX-LLM Ollama Dockerfile and Intel GPU config tuning feat: custom IPEX-LLM Ollama Dockerfile, Intel GPU tuning, and VRAM/context docs Feb 16, 2026
eSlider and others added 14 commits February 16, 2026 14:09
- Restore full open-webui service definition (was a placeholder comment)
- Merge settings from main with new additions:
  - OLLAMA_BASE_URL for inter-container connectivity
  - ENABLE_RAG_WEB_SEARCH for web search in RAG
  - Telemetry opt-out (SCARF_NO_ANALYTICS, DO_NOT_TRACK, ANONYMIZED_TELEMETRY)
- Fix missing $ in SYCL_CACHE_PERSISTENT env var reference

Co-authored-by: Cursor <cursoragent@cursor.com>
…links

- Add "Building a Custom Image" section to VRAM guide with version pin
  table and build instructions
- Add "Configuring via docker-compose" section with .env example
- Add links to Level Zero, IGC, and IPEX-LLM releases in Further Reading
- Expand README Documentation section with links to custom Dockerfile
  and docker-compose.yml with brief descriptions

Co-authored-by: Cursor <cursoragent@cursor.com>
Docker defaults /dev/shm to 64 MB which is too small for SYCL kernel
caches, Level Zero buffers, and memory-mapped model loading. Setting
shm_size to 16G sets the upper bound without pre-allocating memory.

Also document the shared memory requirement in the VRAM guide tips.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Add docs/sycl-vs-vulkan.md with performance benchmarks (SYCL 40-100%
  faster than Vulkan on Intel Arc), three backend options (IPEX-LLM
  bundle, SYCL from source, upstream Vulkan), how the ggml-sycl source
  build and patch-sycl.py work, tested hardware table, and
  troubleshooting guide
- Add troubleshooting section to VRAM guide with OOM, slow first
  inference, Level-Zero kernel regression, and ABI mismatch fixes
- Add Tested Hardware table and SYCL vs Vulkan link to README

Co-authored-by: Cursor <cursoragent@cursor.com>
Expand the SYCL-from-source section in sycl-vs-vulkan.md with:
- Links to tmp/Dockerfile and tmp/patch-sycl.py
- Exact ggml commit (a5bb8ba4) and oneAPI version (2025.1.1)
- Detailed explanation of both patches and why they're needed
- How Stage 1 collects oneAPI runtime deps into /sycl-runner
- How Stage 2 assembles the final minimal image
- Step-by-step guide for updating to a new Ollama version

Co-authored-by: Cursor <cursoragent@cursor.com>
Show the key Dockerfile snippets from tmp/Dockerfile inline in the
sycl-vs-vulkan doc: sparse-checkout of ggml-sycl, patch step, cmake
build with icpx, runtime dependency collection, and Stage 2 assembly
with official ollama binary + SYCL runner drop-in.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add tmp/ directory to repo (previously untracked):
- tmp/Dockerfile: multi-stage SYCL build with oneAPI (Ollama v0.15.6)
- tmp/patch-sycl.py: patches ggml-sycl for Ollama API compatibility
- tmp/docker-compose.yml: compose file for the SYCL source build
- tmp/start-ollama.sh: legacy entrypoint script
- tmp/test-glm-ocr.sh: vision model test script

Add Project Structure section to README showing the full repo layout.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Rename tmp/ to ollama-sycl/ to follow <service>/Dockerfile convention
- Move docker-compose.yml to docker-compose.ollama-sycl.yml at root
  (matches docker-compose.comfyui.yml, docker-compose.sdnext.yml, etc.)
- Refactor compose file: env-var driven defaults, shared volume names,
  consistent service naming (ollama-sycl, open-webui-sycl)
- Remove tmp/README.md (content lives in docs/sycl-vs-vulkan.md)
- Update all tmp/ references in docs and README
- Add SYCL build command to Setup section in README

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Prevents corporate/system HTTP proxies from intercepting
container-to-container traffic (Open WebUI → Ollama). Both
lowercase and uppercase variants are set since different
libraries check different casing. Values include localhost,
loopback, and Docker service names. Configurable via .env.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Update OLLAMA_VERSION from 0.15.6 to 0.16.1 in Dockerfile and compose
- Update GGML_COMMIT to ec98e200 (llama.cpp tag b7437) matching v0.16.1
- Rewrite patch-sycl.py: no patches needed since v0.16.1 (APIs converged)
  - graph_compute now has batch_size in both upstream and ollama
  - GGML_TENSOR_FLAG_COMPUTE removed from both
  - Script exits cleanly (code 0) when no patches are needed
  - Retains backward compatibility for older ollama versions
- Update all docs and README version references

Co-authored-by: Cursor <cursoragent@cursor.com>
… SYCL validate output

- Fix duplicate nested markdown link in Whisper service description
- Update patch-sycl.py description in project structure to reflect no-op since v0.16.1
- Add SYCL-from-source validate output showing v0.16.1 and SYCL0 device detection
- Fix outdated Vulkan "advantage" text (SYCL build is now also v0.16.1)
- Correct patch-sycl.py exit behavior description (exits cleanly, not with error)

Co-authored-by: Cursor <cursoragent@cursor.com>
- Add Image Size Comparison table to sycl-vs-vulkan.md
- Update image sizes to measured values: ipex-ollama 1.03 GB, sycl-ollama 1.27 GB
- Update sycl-ollama build time to ~90s (measured)
- Both custom images are 4-15x smaller than alternatives

Co-authored-by: Cursor <cursoragent@cursor.com>
@eSlider
Copy link
Copy Markdown
Author

eSlider commented Feb 16, 2026

Test Patched Ollama 16.1(SYCL) on Intel Core Ultra 155H :

image

@eSlider eSlider mentioned this pull request Feb 18, 2026
@larsblumberg
Copy link
Copy Markdown

Impressive work, thank you @eSlider!

I would love to test this PR on a Ultra 7 255H / Arc 140T.

How do I go best about testing all of these changes so that I can report back here?

@harpsychord
Copy link
Copy Markdown

This is incredible work @eSlider! I was able to use your branch as-is to use my Arc B580 (12GB VRAM) and have qwen2.5-coder:7b running very smoothly. I'm still quite new to all of this but I wanted to say thank you.

@eleiton
Copy link
Copy Markdown
Owner

eleiton commented Feb 22, 2026

Thanks for the PR @eSlider, happy to receive contributions to the codebase.
Can we split the PR into two: one for the new sycl-ollama proposal, and another for the ipex-ollama proposal? This should help keep things cleaner and make progress in smaller chunks.
Ideally we can keep this one for the sycl. I'll make some comments to the PR.

@eleiton eleiton self-assigned this Feb 22, 2026
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this file be removed? I believe it's not needed for running the project.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this file be removed? I believe it's not needed for running the project

Comment thread sycl-ollama/patch-sycl.py
original = src
applied = []

# 1. Fix graph_compute signature: add 'int batch_size' parameter
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this patch is needed still, and will probably be needed moving forward for a while.
I tested with Ollama 0.16.3 and GGML_COMMIT=ef83fb8601229ff650d952985be47e82d644bfaa

Comment thread sycl-ollama/patch-sycl.py
if "int batch_size" in src:
applied.append("batch_size parameter")

# 2. Remove GGML_TENSOR_FLAG_COMPUTE skip-check entirely.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one on the other hand, I believe is not needed, as you suggest in the documentation.
So I think ideally this can either me removed, or the documentation updated to mention one is needed for all versions and the other only for older versions?

Comment thread sycl-ollama/Dockerfile
RUN mkdir -p /sycl-runner && \
cp build/lib/ollama/libggml-sycl.so /sycl-runner/ && \
# SYCL / DPC++ runtime
for f in libsycl.so*; do true; done && \
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for loop used for?

Comment thread sycl-ollama/Dockerfile
# =============================================================================
FROM ubuntu:24.04
ENV DEBIAN_FRONTEND=noninteractive \
TZ=America/Los_Angeles
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for this timezone?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eleiton, absolutely not.

Comment thread docs/sycl-vs-vulkan.md
RUN python3 /tmp/patch-sycl.py ml/backend/ggml/ggml/src/ggml-sycl/ggml-sycl.cpp
```

As of v0.16.1, the upstream ggml and Ollama APIs have converged — **no patches are needed**. The script detects this and exits cleanly.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ollama v0.16.3 requires the 'int batch_size' parameter that upstream llama.cpp at commit ef83fb86 doesn't have. The patch script detects this and applies the fix.

@eSlider eSlider marked this pull request as draft February 23, 2026 10:25
@eSlider
Copy link
Copy Markdown
Author

eSlider commented Feb 23, 2026

Thank you all for your reviews, especially @eleiton!
I'll clean up the PR after work today!

@eSlider
Copy link
Copy Markdown
Author

eSlider commented Feb 23, 2026

Thanks for the PR @eSlider, happy to receive contributions to the codebase. Can we split the PR into two: one for the new sycl-ollama proposal, and another for the ipex-ollama proposal? This should help keep things cleaner and make progress in smaller chunks. Ideally we can keep this one for the sycl. I'll make some comments to the PR.

The purpose of the research is to find the fastest calculation approach (inference method) by means of comparison. How would you suggest dividing up the competitors in order to then create a benchmark?

I'll leave PR as WIP for now.

@StevenIsaacs
Copy link
Copy Markdown

I would like to use an existing directory containing previously downloaded ollama models to avoid having to download models yet again. I'm running more than one machine and copy models between machines to avoid download delays on a 10Mb link (yeah, I'm in the boonies). To do so a new environment variable "OLLAMA_MODELS" (or something better) can be used which defaults to "{}". Then change the snippet from docker-compose-sycl-ollama.yml:

volumes:
  ollama-volume: {}
  open-webui-volume: {}

to:

volumes:
  ollama-volume: ${OLLAMA_MODELS}
  open-webui-volume: {}

I'm not a docker expert by any means so could be totally off base. But a suggestion. I hope you get the idea.

@rbrowning85
Copy link
Copy Markdown

Thank you for your hard work. I was able to get this up and running, but I am having some trouble with my dual GPU setup (Intel Arc a580 and a380).

I am trying to troubleshoot the error messages right now with Claude.

@rbrowning85
Copy link
Copy Markdown

Looking through the logs I found the following entries:

The program was built for 1 devices Build program log for 'Intel(R) Arc(TM) A380 Graphics': Exception caught at file:/ollama/ml/backend/ggml/ggml/src/ggml-sycl/ggml-sycl.cpp, line:3957 Error OP RMS_NORM

Is this a limitation of Ollama or a limitation of SYCL or just how this new project was built?

@rbrowning85
Copy link
Copy Markdown

I spent several hours with Claude troubleshooting the problem and rebuilding the dockerfile a least a dozen times. I am not a developer, so the process quickly went above my head, but I asked Claude to summarize our steps in the hopes that it would help others. Please let me know if you have any questions or any new builds that you would like me to try that support multiple GPUs.

`# Multi-GPU Issue: Intel Arc A580 + A380 with deepseek-r1:14b

Hardware

  • Intel Arc A580 (8 GiB VRAM) — SYCL0
  • Intel Arc A380 (6 GiB VRAM) — SYCL1
  • Unraid host, Docker container

What Works

  • Both GPUs are detected at startup (SYCL0, SYCL1)
  • Small models (≤~5GB e.g. llama3.2:1b, llama3.2:3b) run successfully across both GPUs
  • Layer distribution works correctly (layers split across SYCL0/SYCL1/CPU)
  • 16k context works fine for smaller models

The Problem

Larger models (tested: deepseek-r1:14b at 8.37 GiB) crash at inference time with:

The program was built for 1 devices
Build program log for 'Intel(R) Arc(TM) A380 Graphics':
Exception caught at file:/ollama/ml/backend/ggml/ggml/src/ggml-sycl/ggml-sycl.cpp, line:3957
Error OP RMS_NORM

The model loads successfully — all layers are assigned and KV cache allocated across both GPUs — but crashes on the first inference call when the A380 tries to execute a RMS_NORM kernel.

Root Cause Analysis

The error "built for 1 devices" is a runtime SYCL program cache issue. ggml-sycl compiles GPU kernels (as OpenCL/SYCL programs) keyed to the first device it encounters (A580). When the A380 tries to execute those same cached programs, it fails because they were compiled in the context of a single device. The RMS_NORM operation is simply the first kernel that hits this issue.

Fix Attempts

Attempt 1: CMAKE_CXX_FLAGS with spir64_gen + AOT

Added AOT compilation targets for both GPU architectures at cmake configure time:

-DCMAKE_CXX_FLAGS="-fsycl-targets=spir64_gen,spir64 -Xs \"-device acm-g10,acm-g11\""

Result: Failed — spir64_gen requires ocloc which wasn't present in the build stage.

Attempt 2: Install ocloc in builder stage

Added Intel compute-runtime (intel-ocloc, intel-igc-*) to the sycl-builder stage so AOT compilation would have the required tools.

Result: Failed — ocloc version mismatch with oneAPI 2025.1.1:

Invalid option (arg 9): -ze-intel-greater-than-4GB-buffer-required

The oneAPI 2025.1.1 compiler passes a flag that compute-runtime 26.05.37020.3's ocloc does not support.

Attempt 3: Drop spir64_gen, use portable spir64 only

Removed ocloc requirement entirely, switched to portable SPIR-V IR:

-DCMAKE_CXX_FLAGS="-fsycl -fsycl-targets=spir64"

The idea: each GPU JITs from portable IR independently, so no shared compiled program.

Result: Build succeeded, but crash persisted. CMAKE_CXX_FLAGS is overridden by ggml's internal target_compile_options on the ggml-sycl target.

Attempt 4: Patch CMakeLists.txt directly

Used sed to inject -fsycl-targets=spir64 at the target level in ggml-sycl's CMakeLists.txt before the build:

sed -i \
  '/target_link_libraries(ggml-sycl/i target_compile_options(ggml-sycl PRIVATE -fsycl -fsycl-targets=spir64)\ntarget_link_options(ggml-sycl PRIVATE -fsycl -fsycl-targets=spir64)' \
  ml/backend/ggml/ggml/src/ggml-sycl/CMakeLists.txt

Result: Build succeeded with flags applied at the right level, but crash still persisted. Confirmed the "built for 1 devices" error is a runtime program cache issue, not a compile-time flag issue.

Attempt 5: Clear SYCL runtime cache

Searched for and cleared any persistent JIT kernel cache:

find /root -name '*.bin' -path '*/sycl*' -delete
find /tmp -name '*.bin' -path '*/sycl*' -delete

Also set SYCL_CACHE_PERSISTENT=0 to disable persistent caching entirely.

Result: No change. Cache was not the issue.

Conclusion

The "built for 1 devices" error originates inside ggml-sycl's runtime program building logic in ggml-sycl.cpp at line 3957. The fix likely requires a source-level patch to ensure SYCL program objects are built for all active devices rather than just the first one. This is beyond what Dockerfile build flags can address.

Workaround

Limit to a single GPU (the A580) via environment variables:

ONEAPI_DEVICE_SELECTOR=level_zero:0
ZE_AFFINITY_MASK=0

This allows larger models to run on the A580's 8 GiB alone. The A380 is unused in this configuration.

Environment

  • Ollama: v0.16.1 (sycl-ollama build from this PR)
  • oneAPI basekit: 2025.1.1
  • Intel compute-runtime: 26.05.37020.3
  • Level Zero: 1.28.0
  • IGC: 2.28.4
  • GGML commit: ec98e200 (llama.cpp tag b7437)`

@eleiton eleiton mentioned this pull request Mar 15, 2026
@rbrowning85
Copy link
Copy Markdown

as a follow-up to my last post.

[SOLVED] Intel Arc SYCL Multi-GPU Deduplication (SameBackendDevice) & Tensor Splitting Fix

A community fix developed by a joint collaboration between Antigravity (Gemini 3.1 Pro), Claude (Sonnet 4.6), and Codex (GPT-5.4).


1. The Symptoms

If you are running Ollama with heterogeneous Intel Arc GPUs (for example, combining an A580 and an A380) natively over SYCL (Level Zero) via a Docker container, you may have encountered a silent multi-GPU failure:

  1. Setting ONEAPI_DEVICE_SELECTOR=level_zero:0,1 is not respected.
  2. The Ollama daemon logs detect BOTH cards initially (initial_count=2).
  3. But the secondary card (the A380) vanishes immediately from the inference pool with zero error logs, leaving you with restricted VRAM and failing to split larger models.

2. The Root Cause (Silent Deduplication)

Through extensive source code tracing, our multi-agent AI framework located the root cause in the Go-layer abstractions, not the Intel drivers.

The flaw exists in ml/device.go inside the ml.DeviceInfo.Compare() deduplication function.
When the backend discover/runner.go collects hardware lists via RPC, the upstream ggml-sycl.cpp implementation (specifically inside ggml_backend_sycl_device_get_props) completely fails to assign the props->id and props->device_id values - returning empty strings ("").

The Go deduplicator compares multiple devices. Because SYCL0 and SYCL1 share the same library string but both report their identity as "", Go assumes they are clones of the same physical hardware wrapper. This triggers a silent deduction internally, erasing SYCL1 from the active array.

3. The Solution

To unify the VRAM (13.2 GiB on our A580+A380) and enable heterogeneous tensor splitting, we applied a surgical C++ identity injection hook and accompanying environment tweaks.

Part 1: The C++ Identity Hook (patch-sycl-v2.py)

When compiling the sycl-runner Docker image, we intercepted the build sequence with a custom Python script that manipulates ggml-sycl.cpp. By forcing dynamic struct definitions, we force the SYCL backend to generate unique "sycl:0" and "sycl:1" identity attributes.

This provides Compare() with enough identity data to return UniqueDevice status.

dev_ctx->id = "sycl:" + std::to_string(i);
dev_ctx->library = "SYCL";
dev_ctx->device_id = "sycl:" + std::to_string(i);
props->id = ctx->id.c_str();
props->device_id = ctx->device_id.c_str();
props->library = ctx->library.c_str();

We also included a diagnostic hook for ggml_sycl_has_mixed_device_topology(), which disables tensor-reorder tracking optimization to prevent secondary cross-device cache invalidations.

Part 2: OOM Safeguards & Context Isolation (.env)

While the identity patch enables the discovery of both cards, stability on disparate silicon requires environmental isolation to prevent queue stalls and out-of-memory (OOM) failures:

# JIT / Level Zero cache separation
SYCL_CACHE_PERSISTENT=0
SYCL_CACHE_IN_MEM=0
SYCL_ENABLE_DEFAULT_CONTEXTS=0
SYCL_RT_WARNING_LEVEL=1
ZE_ENABLE_PCI_ID_DEVICE_ORDER=1
ZE_FLAT_DEVICE_HIERARCHY=FLAT

# Conservative stability constraints (Working Production Config)
OLLAMA_NUM_PARALLEL=1
OLLAMA_MAX_LOADED_MODELS=1
OLLAMA_CONTEXT_LENGTH=8192

(Note: Flash Attention was logged as disabled during the successful run over this SYCL architecture).

4. The Results

Upon rebooting via docker compose up --build -d --force-recreate:

msg="inference compute" id=sycl:0 library=SYCL name=SYCL0 description="Intel(R) Arc(TM) A580 Graphics" available="7.5 GiB"
msg="inference compute" id=sycl:1 library=SYCL name=SYCL1 description="Intel(R) Arc(TM) A380 Graphics" available="5.6 GiB"
msg="vram-based default context" total_vram="13.2 GiB" 

The Go engine accepted BOTH indices as unique devices. We demonstrated a successful dual-GPU inference run using Qwen 2.5 14B, spilling 3.7GB onto the A580 and the remaining 2.3GB directly onto the A380. The KV layer map correctly assigned compute layers across both cards, achieving stable generation where it previously failed.

@eSlider
Copy link
Copy Markdown
Author

eSlider commented Mar 31, 2026

Unfortunately, a few weeks ago I burned out the iGPU on my
Intel Core Ultra 7 155H, most likely due to overheating, and the laptop stopped turning on.

I’d appreciate it if someone could take on the task of adding or using these improvements in a PR.

My plans regarding whether to buy an Intel Arc GPU or an iGPU are still up in the air.

@rbrowning85
Copy link
Copy Markdown

I am currently working on trying to get it to work with the new Ollama releases. I am having trouble getting qwen3.5:9b to run.

I am happy to share my findings and files, but I've never managed a GitHub repository. What would be the best way to hand off?

@rbrowning85
Copy link
Copy Markdown

rbrowning85 commented Apr 17, 2026

I dont want to admin how many hours and tokens I have spent across multiple providers trying to get this to work, but long story short, my AI finally sent me the follow:

The experiment is officially over. That deterministic gibberish is the final piece of proof we needed.

By stripping all the python memory patches back to native malloc_device and strictly filtering out the bounds errors, we tested exactly what the upstream llama.cpp developers compiled at that commit block. If it still hallucinates after that, it undeniably proves the underlying SYCL scheduler in version 0.20.6 is mathematically broken for dual-GPU topologies. The ggml event tracker is launching graph arrays across the two Intel Arc cards without properly synchronizing the memory streams, meaning the neural network is literally doing math on empty memory.

We can't fix a broken OpenCL scheduling pipe from outside the C++ compiler. You have officially outgrown the Intel Arc ecosystem.

It then went on to tell me to sell my Intel Arc GPUs and purchase new two Nvidia GPUs. LOL

I saw in a recent Unraid Uncast show that Spaceinvaderone created a new Unraid app specific to Ollama with Arc GPUs. I am going to give this a whirl, but based on the issues already posted in the gitbhub responsitory, it doesnt look promising.

https://github.com/SpaceinvaderOne/ollama-intel-gpu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: custom IPEX-LLM Ollama Dockerfile and Intel GPU config tuning

6 participants