This repository provides a Windows-first, Intel-optimized llama.cpp setup for Core Ultra systems using Intel ARC / XPU via SYCL.
Focused testing was performed on Intel Core Ultra 7 155H with 96GB RAM, with notes for Core Ultra 200 and 300 series systems.
"This repo exists because Intel UMA deserves real tuning, not copy-pasted CUDA defaults."
| What this repo is | What this repo is not |
|---|---|
| Intel UMA–tuned | CUDA defaults copy-pasted |
| Thermal-aware | Max-fans benchmark chasing |
| Long-context focused | Short prompt demo rigs |
C:\llama.cpp
│
├─ configs\ # Configuration files
│ ├─ ZZZ-Base-*.cfg # Global base configurations
│ └─ *.cfg # Model-specific overrides
│
├─ sycl\ # llama.cpp SYCL build
│ ├─ llama.cpp SYCL release
│ ├─ llama-server.exe
│ └─ llama-cli.exe
│
├─ docs\ # Documentation
│ └─ guides\
│
├─ start-llama-server.ps1 # Main launcher script
├─ COMPACT.md # Context compaction guide
├─ llama.cpp model folder.lnk # Model folder shortcut
├─ llama.ico # Icon
└─ README.md # This file
Clone this repo directly into C:\:
cd C:\
git clone https://github.com/thephimart/llama.cpp_intel_uma.git llama.cppGet the latest release from: https://github.com/ggml-org/llama.cpp/releases
Extract it to:
C:\llama.cpp\sycl
Place GGUF models in:
%USERPROFILE%\AppData\Local\llama.cpp
Or let Hugging Face downloads populate it automatically.
Run the interactive launcher:
.\start-llama-server.ps1Method 1: Desktop Shortcut
- Right-click
start-llama-server.ps1in File Explorer - Select Send to > Desktop (create shortcut)
- Right-click the new shortcut and select Properties
- Click Change Icon... and browse to
C:\llama.cpp\llama.ico - Rename the shortcut (e.g., "Llama Server")
Method 2: Start Menu Pinning
Option A - From Desktop:
- Create the desktop shortcut using Method 1 above
- Right-click the shortcut and select Pin to Start
Option B - Direct to Start Menu:
- Press
Win + R, typeshell:Start Menuand press Enter - Navigate to
Programsfolder - Right-click and select New > Shortcut
- Browse to
C:\llama.cpp\start-llama-server.ps1 - Name it "Llama Server" and click Finish
- Right-click the shortcut > Properties > Change Icon...
- Browse to
C:\llama.cpp\llama.icoand apply
Method 3: Taskbar Pinning
- First create a desktop shortcut using Method 1
- Right-click the shortcut and select Pin to taskbar
- Or drag the shortcut directly to the taskbar
Tip: The provided
llama.icofile gives your shortcut a professional appearance in the Start Menu, taskbar, and desktop.
The launcher automatically sets Intel SYCL environment variables and guides you through:
-
Backend - Shows selected backend (SYCL for Intel GPU/XPU)
-
WebUI Toggle
- Default: Disabled (
--no-webui) - Option [1]: Enable web interface
- Default: Disabled (
-
Network Mode
- [1] Local only (127.0.0.1) - Default
- [2] Shared / LAN (0.0.0.0)
-
Port Configuration
- Default: 11434 (Ollama-compatible)
- Custom port supported
-
Model Selection
- [0] No model (server-only mode)
- [H] Hugging Face repository
(e.g.,unsloth/Qwen3-VL-30B-A3B-Thinking-GGUF:Q8_0) - [1-N] Local GGUF files from
%USERPROFILE%\AppData\Local\llama.cpp
-
Base Config Selection
- [0] No base config (use llama.cpp defaults)
- [1] ZZZ-Base-Config.cfg (default) - Standard chat/completion config
- [2] ZZZ-Base-Config-Large-Context.cfg - Large Context chat/completion config
- [3] ZZZ-Base-Embed-Config.cfg - Embedding-optimized config (no batch sizes)
-
Runtime Config Override
- [0] No extra config
- [1-N] Other
*.cfgfiles (excludingZZZ-Basefiles)
-
Final Configuration Review
- Complete summary before launch
- Option to restart configuration or proceed
- Automatically validates
llama-server.exeexists insycl\directory - Verifies
--mmprojfiles exist before launching vision models - Shows complete command line before execution
- Supports retry on configuration errors
See docs/core-ultra-155h-base-config.md for detailed configuration explanations.
Current base configs are tuned aggressively for large-context and long-running workloads. Earlier documentation referenced smaller batch sizes; those values have been superseded.
configs\ZZZ-Base-*.cfg
Three base configurations are available:
1. ZZZ-Base-Config.cfg (default) - Standard configuration for chat/completion models:
- Threading - 20 threads
- Batch sizes - 2048 batch / 1024 micro-batch
- KV cache quantization - q8_0 for both K and V
- Parallelism - Single parallel request
- Cache behavior - Automatic RAM cache (
-1) - Optimized for large-context, long-running inference
2. ZZZ-Base-Config-Large-Context.cfg
- Same as above, plus:
--split-mode row--no-context-shift
- Intended for extreme context growth and compaction workflows
3. ZZZ-Base-Embed-Config.cfg - Optimized for embedding models:
- Same as base but without batch size constraints
- Ideal for embedding and retrieval workloads
ℹ️ Note: Base configs are always applied first before per-model overrides.
After selecting a model, you may apply one additional config file.
These are layered after the base config and are ideal for:
- Context size overrides - Custom context lengths
- Model-specific batch tuning - Per-model optimization
- Vision models -
--mmprojconfigurations
The launcher automatically sets these environment variables:
SYCL_DEVICE_FILTER=level_zero:gpu
SYCL_UR_USE_LEVEL_ZERO_V2=1
GGML_SYCL_FORCE_CPU_KV=1On Intel UMA systems, KV cache is explicitly forced to CPU memory:
GGML_SYCL_FORCE_CPU_KV=1
This is intentional and recommended.
Why:
- Prevents iGPU memory pressure at large context sizes
- Reduces UMA contention
- Improves stability with large batches and long-running sessions
- Produces more predictable performance on Meteor Lake
The iGPU is still used opportunistically for compute, but KV cache remains CPU-resident.
This enables:
- ARC iGPU / XPU acceleration - Hardware acceleration
- Above 4GB Allocation - Large model support
- Flash Attention - Required for KV cache quantization (for almost all models, there are exceptions)
- Efficient UMA memory sharing - Unified memory architecture
Important:
--n-gpu-layers 0is required in base config to enable flash attention.
Detailed configuration and tuning guides for Intel Core Ultra systems:
| Document | Description |
|---|---|
| Core Ultra 7 155H Base Config | Complete base configuration guide for Intel Core Ultra 7 155H (Meteor Lake) |
| Core Ultra 200 & 300 Series Notes | Tuning guidance for Core Ultra 200-series (Lunar Lake) and 300-series (Arrow Lake) |
| Qwen3-Coder-30B Testing Analysis | Batch scaling, CPU vs GPU offload, and UMA behavior for 30B MoE model |
| Qwen3-Coder-Next-80B Testing Analysis | Batch ceiling and throughput plateaus for 80B sparse-MoE model |
See COMPACT.md for the recommended context compaction prompt, designed for:
- Long coding sessions
- Large-context models
- Iterative refinement workflows
MIT License — use it, fork it, break it, improve it.
Happy inference! If you find this useful, consider giving it a ⭐!