Skip to content

Adding sycl support#39

Merged
eleiton merged 1 commit intomainfrom
ollama-sycl
Mar 15, 2026
Merged

Adding sycl support#39
eleiton merged 1 commit intomainfrom
ollama-sycl

Conversation

@eleiton
Copy link
Copy Markdown
Owner

@eleiton eleiton commented Mar 15, 2026

Add Intel oneAPI SYCL backend for Intel Arc/iGPU

Summary

  • This PR implements the research done by @eSlider on SYCL in feat: custom IPEX-LLM Ollama Dockerfile, Intel GPU tuning, and VRAM/context docs #38
  • Adds a new ollama-sycl/ build that compiles Ollama's ggml-sycl backend from source using Intel oneAPI (DPC++/icpx), enabling SYCL-accelerated GPU inference on Intel Arc and Intel Core Ultra iGPUs via Level Zero.
  • The Dockerfile uses a two-stage build: stage 1 compiles libggml-sycl.so against Intel oneAPI and bundles the required runtime libs (oneMKL, TBB, Unified Runtime, Level Zero); stage 2 installs the official ollama binary and drops the compiled SYCL runner into place, keeping the final image lean by stripping CUDA/MLX/Vulkan runners.
  • A patch-sycl.py script surgically patches API differences between upstream ggml-sycl (pinned llama.cpp commit) and Ollama's modified ggml backend (batch_size parameter), allowing a newer SYCL implementation to be swapped in without forking Ollama itself.
  • Adds docker-compose.ollama-sycl.yml with ONEAPI_DEVICE_SELECTOR=level_zero:0, persistent SYCL kernel cache volume, and pre-configured Open WebUI integration.
  • Adds .env.example documenting all tunable Ollama environment variables (context length, VRAM cap, parallel slots, flash attention, etc.).

Motivation

The Vulkan backend on Intel Arc has no native fp16 compute support (requires GGML_VK_DISABLE_F16=1) and is limited to ~12–13 tok/s on llama3.2:3B. The SYCL/oneAPI path uses Intel's native GPU compiler and oneMKL BLAS, providing a more capable and performant backend for Intel iGPU hardware.

Known limitation: hybrid recurrent architectures

The ggml-sycl backend only implements standard transformer ops. Models with hybrid recurrent+attention architectures (e.g. Qwen3.5 "qwen3next" layers) have ops that are not yet implemented in ggml-sycl, causing them to fall back to CPU execution. Each unsupported op creates a CPU↔GPU boundary crossing, and since these layers appear every few blocks, performance degrades severely:

Model Architecture Graph splits tokens/s
qwen3.5:0.6b Hybrid recurrent+attention ~112 ~3
llama3.2:1b Pure transformer few ~19

This backend is best suited for pure transformer models (Llama, Mistral, Qwen2, Gemma, etc.). Hybrid recurrent models are currently better served by a CPU-only or Vulkan run. This is an upstream ggml-sycl gap, not a setup issue.

Kudos

Thanks to @eSlider for the contribution

@eleiton eleiton merged commit c18042e into main Mar 15, 2026
@eleiton eleiton deleted the ollama-sycl branch March 16, 2026 07:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant