Adding sycl support by eleiton · Pull Request #39 · eleiton/ollama-intel-arc

eleiton · 2026-03-15T22:05:04Z

Add Intel oneAPI SYCL backend for Intel Arc/iGPU

Summary

This PR implements the research done by @eSlider on SYCL in feat: custom IPEX-LLM Ollama Dockerfile, Intel GPU tuning, and VRAM/context docs #38
Adds a new ollama-sycl/ build that compiles Ollama's ggml-sycl backend from source using Intel oneAPI (DPC++/icpx), enabling SYCL-accelerated GPU inference on Intel Arc and Intel Core Ultra iGPUs via Level Zero.
The Dockerfile uses a two-stage build: stage 1 compiles libggml-sycl.so against Intel oneAPI and bundles the required runtime libs (oneMKL, TBB, Unified Runtime, Level Zero); stage 2 installs the official ollama binary and drops the compiled SYCL runner into place, keeping the final image lean by stripping CUDA/MLX/Vulkan runners.
A patch-sycl.py script surgically patches API differences between upstream ggml-sycl (pinned llama.cpp commit) and Ollama's modified ggml backend (batch_size parameter), allowing a newer SYCL implementation to be swapped in without forking Ollama itself.
Adds docker-compose.ollama-sycl.yml with ONEAPI_DEVICE_SELECTOR=level_zero:0, persistent SYCL kernel cache volume, and pre-configured Open WebUI integration.
Adds .env.example documenting all tunable Ollama environment variables (context length, VRAM cap, parallel slots, flash attention, etc.).

Motivation

The Vulkan backend on Intel Arc has no native fp16 compute support (requires GGML_VK_DISABLE_F16=1) and is limited to ~12–13 tok/s on llama3.2:3B. The SYCL/oneAPI path uses Intel's native GPU compiler and oneMKL BLAS, providing a more capable and performant backend for Intel iGPU hardware.

Known limitation: hybrid recurrent architectures

The ggml-sycl backend only implements standard transformer ops. Models with hybrid recurrent+attention architectures (e.g. Qwen3.5 "qwen3next" layers) have ops that are not yet implemented in ggml-sycl, causing them to fall back to CPU execution. Each unsupported op creates a CPU↔GPU boundary crossing, and since these layers appear every few blocks, performance degrades severely:

Model	Architecture	Graph splits	tokens/s
qwen3.5:0.6b	Hybrid recurrent+attention	~112	~3
llama3.2:1b	Pure transformer	few	~19

This backend is best suited for pure transformer models (Llama, Mistral, Qwen2, Gemma, etc.). Hybrid recurrent models are currently better served by a CPU-only or Vulkan run. This is an upstream ggml-sycl gap, not a setup issue.

Kudos

Thanks to @eSlider for the contribution

Adding sycl support

e5c1def

eleiton merged commit c18042e into main Mar 15, 2026

eleiton deleted the ollama-sycl branch March 16, 2026 07:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding sycl support#39

Adding sycl support#39
eleiton merged 1 commit intomainfrom
ollama-sycl

eleiton commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eleiton commented Mar 15, 2026

Add Intel oneAPI SYCL backend for Intel Arc/iGPU

Summary

Motivation

Known limitation: hybrid recurrent architectures

Kudos

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant