AMD Strix Halo LLM Guide: Run Local AI with Ollama, LM Studio & llama.cpp

Introduction
Hardware Overview
The Unified Memory Advantage
System Prerequisites — Ubuntu 24.04
Kernel Boot Parameters
Installing ROCm 7.2
Ollama — Setup, Vulkan, and ROCm
LM Studio
llama.cpp — The Recommended Path
Performance Benchmarks
Backend Comparison and Recommendations
References

1. Introduction {#introduction}

The AMD Ryzen AI MAX+ 395 — codenamed Strix Halo — is a fundamentally different class of hardware for local AI inference. With 40 RDNA 3.5 GPU compute units, up to 128 GB of unified LPDDR5x-8000 memory, and 256 GB/s of memory bandwidth shared between CPU and GPU, it eliminates the most painful constraint of consumer GPU inference: VRAM limits.

A single machine with this chip can run 70B parameter models in full precision, fine-tune 12B models without quantization, and serve models that would previously require a data-center GPU. The question is no longer whether you can run large models locally — it is which software stack unlocks the hardware's full potential.

This guide covers three of the most popular open-source inference tools — Ollama, LM Studio, and llama.cpp — and evaluates both the Vulkan and ROCm compute backends on Ubuntu 24.04. All configuration, benchmark data, and observations in this article are drawn from hands-on testing on a system running the Ryzen AI MAX+ Pro 395 with 128 GB of unified memory, ROCm 7.2.0, and Ubuntu 24.04 with a 6.17 OEM kernel.

2. Hardware Overview {#hardware-overview}

Specification	Value
Chip	AMD Ryzen AI MAX+ Pro 395 (Strix Halo)
GPU Architecture	RDNA 3.5 (gfx1151)
GPU Compute Units	40 CUs
GPU Max Clock	2.9 GHz
Peak FP16/BF16	59.39 TFLOPS (theoretical)
Memory Type	LPDDR5x-8000
Memory Capacity	Up to 128 GB unified
Memory Bandwidth	256 GB/s
ROCm LLVM Target	gfx1151
Ubuntu 24.04 Kernel (tested)	6.17.0 OEM

The GPU is identified as Radeon Graphics (RADV GFX1151) by the Mesa Vulkan driver and as gfx1151 by ROCm. It is explicitly listed in Ollama's Linux GPU support table under the Ryzen AI family, alongside the Ryzen AI Max 390 and Ryzen AI Max 385.

The "integrated GPU" label is misleading. Unlike traditional laptop iGPUs with 512 MB to 8 GB of VRAM, Strix Halo's GPU accesses the entire unified memory pool — up to 128 GB — at full memory bus bandwidth. There is no PCIe bottleneck, no VRAM spill to system RAM, and no discrete GPU data copy overhead.

3. The Unified Memory Advantage {#the-unified-memory-advantage}

Traditional GPU inference is constrained by VRAM. A model that doesn't fit in VRAM either fails to load or falls back to slow CPU inference. Strix Halo collapses this boundary: both CPU and GPU threads address the same physical memory pool.

In practice this means:

Llama 3.1 70B (Q4_K_M, ~40 GB) loads entirely on GPU with headroom to spare
Mistral Large 123B (Q4, ~73 GB) fits with room for KV cache at long context
Context windows of 128K tokens are practical for models up to ~30B parameters
Fine-tuning of 12B models without quantization is possible (using ~115 GB)

The bottleneck for inference is memory bandwidth, not compute. At 256 GB/s the chip delivers competitive throughput — and this has a direct implication for backend choice, as we'll cover in the benchmark section.

4. System Prerequisites — Ubuntu 24.04 {#system-prerequisites}

ROCm 7.2.0 fully supports Ubuntu 24.04 (Noble Numbat) with glibc 2.39 and kernels from 6.8 (GA) through 6.14 (HWE).¹ Kernels 6.15 and above (including the 6.17 OEM kernel) provide improved gfx1151 driver support.

Required userspace groups:

sudo usermod -aG render,video $USER

Log out and back in for group membership to take effect. The GPU device nodes at /dev/kfd and /dev/dri/renderD128 must be accessible to your user.

Verify device access:

ls -la /dev/kfd /dev/dri/renderD128
# Expected: crw-rw-rw- permissions with render group

5. Kernel Boot Parameters {#kernel-boot-parameters}

This is the most important and most commonly missed configuration step. By default, the kernel allocates only a small BIOS VRAM carveout (typically 512 MB) for the GPU GTT memory pool. Without explicit overrides, tools like rocm-smi report 94%+ VRAM usage on a pool that is only 512 MB — while the actual 128 GB unified pool is inaccessible to GPU compute.

Add the following to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub:

amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432

Parameter	Value	Effect
`amd_iommu=off`	—	Disables IOMMU; ~6% memory bandwidth improvement for LLM workloads ²
`amdgpu.gttsize=131072`	131072 MiB = 128 GiB	Exposes the full unified memory pool to the GPU driver
`ttm.pages_limit=33554432`	128 GiB in 4K pages	Allows pinning the full pool for GPU operations

Apply and reboot:

sudo update-grub
sudo reboot

After reboot, verify:

rocm-smi --showmeminfo vram
# Should show ~128 GiB total, not 512 MB

Note on amdgpu.gttsize: The Strix Halo Wiki notes this parameter may be deprecated in favour of TTM subsystem settings in future kernels. Monitor AMD's kernel documentation for updates.²

6. Installing ROCm 7.2 {#installing-rocm}

AMD ROCm 7.2.0 is the current production release and the recommended stack for Ubuntu 24.04.³ Install via the amdgpu-install utility:

wget https://repo.radeon.com/amdgpu-install/7.2/ubuntu/noble/amdgpu-install_7.2.70200-1_all.deb
sudo apt install ./amdgpu-install_7.2.70200-1_all.deb
sudo apt update
sudo apt install python3-setuptools python3-wheel
sudo apt install rocm
sudo reboot

Verify ROCm detects the GPU:

rocminfo | grep -E "Name|gfx|Marketing"

Expected output includes gfx1151 and AMD Radeon Graphics.

rocm-smi

The device should appear with temperature, power, and the correct memory pool size.

Strix Halo status in ROCm: gfx1151 is not in ROCm's main compute compatibility matrix for production workloads — Ryzen APU support is marked Preview and scoped to PyTorch on Linux.¹ For inference tooling (Ollama, llama.cpp), the ROCm libraries work but may require the workarounds described in this guide. For cutting-edge performance, the community recommends AMD's TheRock nightly builds, which ship native gfx1151-compiled binaries.⁴

7. Ollama — Setup, Vulkan, and ROCm {#ollama}

7.1 Installation

curl -fsSL https://ollama.com/install.sh | sh

Ollama installs as a systemd service running under the ollama user. The Ryzen AI MAX+ 395 is officially listed in Ollama's Linux GPU support documentation.⁵

7.2 The Two Backends: Vulkan and ROCm

Ollama on Linux supports two GPU backends for AMD hardware:

ROCm (HIP) — the primary AMD compute backend, using Ollama's bundled ROCm libraries
Vulkan — an experimental compute backend using the system's Vulkan driver, enabled via OLLAMA_VULKAN=1

The critical architectural detail: Ollama ships its own bundled ROCm libraries in /usr/local/lib/ollama/rocm, separate from the system ROCm installation at /opt/rocm. The bundled libraries may not include native gfx1151 kernels, causing a GPU discovery timeout and silent CPU fallback. This is the root cause of many "Ollama won't use the GPU" reports on Strix Halo.

How to tell which backend is active:

# Run a model and check logs immediately
ollama run llama3.2:1b "hello" &
journalctl -u ollama -n 30 --no-pager | grep "inference compute"

Log Output	Meaning
`library=Vulkan name=Vulkan0`	Vulkan GPU active ✓
`library=ROCm compute=gfx1151`	ROCm GPU active ✓
`library=cpu`	CPU fallback — GPU not detected ✗

7.3 Enabling Vulkan (Recommended for Ollama)

The Vulkan backend is the most reliable path for Strix Halo on Ollama. It uses the Mesa RADV Vulkan driver, which has strong gfx1151 support and correctly addresses the full unified memory pool.

Ollama is configured via a systemd drop-in file. Check for existing drop-ins:

ls /etc/systemd/system/ollama.service.d/

Create or update /etc/systemd/system/ollama.service.d/gpu.conf:

sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/gpu.conf > /dev/null << 'EOF'
[Service]
Environment="OLLAMA_VULKAN=1"
Environment="OLLAMA_CONTEXT_LENGTH=8192"
Environment="HSA_OVERRIDE_GFX_VERSION=11.5.1"
Environment="HSA_XNACK=1"
Environment="GPU_MAX_HW_QUEUES=8"
Environment="ROCM_PATH=/opt/rocm"
Environment="HIP_PATH=/opt/rocm"
UnsetEnvironment=OLLAMA_LLM_LIBRARY
UnsetEnvironment=HIP_VISIBLE_DEVICES
UnsetEnvironment=ROCR_VISIBLE_DEVICES
EOF

sudo systemctl daemon-reload && sudo systemctl restart ollama

Why UnsetEnvironment=OLLAMA_LLM_LIBRARY matters: If this variable is set (by a previous drop-in or install script) to the bundled ROCm path, it forces the bundled library and prevents Vulkan from being selected. Unsetting it allows Ollama to use its auto-detection logic and fall through to Vulkan.

Verify Vulkan is active:

ollama run llama3.2:1b "write a poem" &
sleep 2 && ollama ps

NAME           ID              SIZE    PROCESSOR    CONTEXT
llama3.2:1b    baf6a787fdff    2.2 GB  100% GPU     8192

The 100% GPU label confirms GPU inference. Ollama logs will show:

load_tensors: offloaded 17/17 layers to GPU
load_tensors: Vulkan0 model buffer size = 1252.41 MiB
llama_kv_cache: Vulkan0 KV buffer size = 256.00 MiB

7.4 ROCm Backend in Ollama

Getting Ollama's ROCm backend to use the GPU on Strix Halo is harder. The bundled ROCm libraries probe the GPU via HIP device enumeration, which behaves differently for unified-memory iGPUs than for discrete GPUs. The probe frequently times out:

failure during GPU discovery
extra_envs="map[GGML_CUDA_INIT:1 ROCR_VISIBLE_DEVICES:0]"
error="failed to finish discovery before timeout"

The ROCR_VISIBLE_DEVICES:0 is set internally by Ollama during the probe — not by user configuration — and can cause the iGPU to be invisible to the HIP runtime. This is a known limitation of Ollama's GPU detection code with iGPU topologies.

In our testing, the ROCm backend consistently fell back to CPU, confirmed by watching journal logs for library=cpu during inference. This is consistent with community guidance on the Strix Halo Wiki, which explicitly marks Ollama as not recommended for this hardware.²

If you want to attempt ROCm in Ollama, replace the drop-in with:

sudo tee /etc/systemd/system/ollama.service.d/gpu.conf > /dev/null << 'EOF'
[Service]
Environment="OLLAMA_VULKAN=0"
Environment="OLLAMA_CONTEXT_LENGTH=8192"
Environment="HSA_OVERRIDE_GFX_VERSION=11.5.1"
Environment="HSA_XNACK=1"
Environment="ROCBLAS_USE_HIPBLASLT=1"
Environment="GPU_MAX_HW_QUEUES=8"
Environment="ROCM_PATH=/opt/rocm"
Environment="HIP_PATH=/opt/rocm"
UnsetEnvironment=OLLAMA_LLM_LIBRARY
UnsetEnvironment=HIP_VISIBLE_DEVICES
UnsetEnvironment=ROCR_VISIBLE_DEVICES
EOF

Check the logs immediately after restart to confirm whether the GPU was detected or if it fell back to CPU.

7.5 Context Length and Memory

With Vulkan and the full 128 GB pool accessible, Ollama can serve very large context windows. Set OLLAMA_CONTEXT_LENGTH to control the default:

# In the drop-in, for 32K context:
Environment="OLLAMA_CONTEXT_LENGTH=32768"

# Or per-model in a Modelfile:
PARAMETER num_ctx 32768

For large models like Mistral Large (123B), Ollama correctly loads the model across the full GPU pool:

ollama run mistral-large "Explain quantum entanglement"

8. LM Studio {#lm-studio}

LM Studio is a cross-platform GUI application for local LLM inference. It uses llama.cpp as its inference backend, inheriting llama.cpp's Vulkan and ROCm support.

8.1 Installation on Ubuntu 24.04

LM Studio is distributed as an AppImage:

# Download from https://lmstudio.ai (version 0.3.19+)
chmod +x LM_Studio-*.AppImage
./LM_Studio-*.AppImage

ROCm/Linux support was introduced in LM Studio 0.3.19 (July 2025) for AMD RX 9000 series GPUs.⁶ For Strix Halo, the most reliable path is the Vulkan backend, which flows through llama.cpp's Vulkan implementation.

8.2 GPU Backend Selection

In LM Studio settings, set the GPU offload to your preferred backend:

Vulkan (recommended for Strix Halo): select in the inference settings; ensure AMD_VULKAN_ICD=RADV is set in your environment before launching LM Studio
ROCm: set HSA_OVERRIDE_GFX_VERSION=11.5.1 before launching; results may vary on gfx1151

# Launch LM Studio with RADV and ROCm env vars
AMD_VULKAN_ICD=RADV HSA_OVERRIDE_GFX_VERSION=11.5.1 ./LM_Studio-*.AppImage

8.3 Practical Considerations

LM Studio does not expose ROCm-level environment variables through its GUI. Users needing fine-grained control over GTT allocation, hipBLASlt tuning, or Flash Attention flags should use llama.cpp directly. LM Studio is best suited for users who want a friendly interface for model management and chat, without needing to tune inference parameters.

The model library browser in LM Studio makes it easy to download and quantize models from Hugging Face, which pairs well with Strix Halo's ability to run large models that would be impractical on discrete GPUs with limited VRAM.

9. llama.cpp — The Recommended Path {#llamacpp}

For users who want maximum performance and control, llama.cpp built from source with either Vulkan or ROCm is the recommended approach on Strix Halo. This is what the community toolboxes built by @kyuz0 and the Strix Halo Wiki are based on.⁷

9.1 Vulkan Build

Prerequisites:

sudo apt install cmake ninja-build libvulkan-dev vulkan-tools glslc

Build:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -S . -DGGML_VULKAN=ON
cmake --build build --config Release -j$(nproc)

Environment (prefer RADV over AMDVLK):

export AMD_VULKAN_ICD=RADV

Run inference:

./build/bin/llama-cli \
  -m /path/to/model.gguf \
  -ngl 999 \
  --mmap 0 \
  -p "Your prompt here"

--mmap 0 is important for models larger than half of system RAM. Without it, memory-mapped file access can cause contention between CPU and GPU buffer allocations on unified memory systems.²

9.2 ROCm Build

Prerequisites:

sudo apt install rocm cmake ninja-build

Build with hipBLASlt and Flash Attention:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -S . \
  -DGGML_HIP=ON \
  -DAMDGPU_TARGETS="gfx1151" \
  -DGGML_HIP_ROCWMMA_FATTN=ON
cmake --build build --config Release -j$(nproc)

Warning on -DGGML_HIP_ROCWMMA_FATTN=ON: As of ROCm 7.0.2+, this flag degrades performance at extended context depths (beyond ~32K tokens) on Strix Halo and should be used with caution. It is beneficial for standard context lengths (4K–16K). Monitor the upstream llama.cpp issue tracker for updates.²

Runtime environment:

export ROCBLAS_USE_HIPBLASLT=1
export HSA_OVERRIDE_GFX_VERSION=11.5.1
export ROCM_PATH=/opt/rocm

Run inference:

./build/bin/llama-cli \
  -m /path/to/model.gguf \
  -ngl 999 \
  --mmap 0 \
  -fa 1 \
  -p "Your prompt here"

The -fa 1 flag enables Flash Attention, which significantly improves throughput at longer context lengths when using the ROCm backend.

9.3 Pre-Built Binaries

If you prefer not to build from source, two pre-built options are available for Strix Halo:

Lemonade Server — an easy-to-use inference server with gfx1151 optimized builds, recommended for newcomers
@kyuz0's AMD Strix Halo Toolboxes — container-based builds using TheRock nightly ROCm with full gfx1151 native support⁷

10. Performance Benchmarks {#performance-benchmarks}

10.1 Ollama: Default vs ROCm-Enabled (Community Benchmarks)

The following benchmarks are from a GMKTec Evo-X2 system (Ryzen AI MAX+ 395, 128 GB LPDDR5x-8000, Ubuntu 24.04) — identical silicon and memory configuration to the hardware discussed in this guide.⁸

Token generation rate (tg, eval rate) — Ollama default vs ROCm-enabled:

Model	Ollama (no GPU config)	Ollama + ROCm	Speedup
GPT-OSS 20B	23.80 t/s	46.19 t/s	1.9×
GPT-OSS 120B	14.77 t/s	33.05 t/s	2.2×
Qwen3 32B	4.42 t/s	9.43 t/s	2.1×

"No GPU config" reflects Ollama's default behaviour on gfx1151 hardware when neither ROCm nor Vulkan is explicitly configured — inference runs on the Zen 5 CPU cores. With ROCm enabled (via HSA_OVERRIDE_GFX_VERSION=11.5.1 as described in section 7.4), throughput roughly doubles across model sizes.

llama.cpp as a Vulkan benchmark proxy:

Ollama's Vulkan backend uses the same underlying llama.cpp Vulkan kernels. The llama.cpp benchmarks in section 10.2 are the best available reference for Vulkan throughput on this hardware.

Note on model compatibility: Community reports indicate Ollama's Vulkan backend can hang on certain larger models (Qwen3.5 35B+ and similar). For models above ~30B parameters, ROCm is the more stable Ollama backend on gfx1151 hardware.⁹

Takeaway: With proper ROCm configuration, Ollama delivers approximately 2× the token generation rate compared to its default (CPU) mode on Strix Halo. ROCm is required for reliable inference on models above 30B parameters.

10.2 llama.cpp: Vulkan vs ROCm (Community Data)

The following benchmarks are from the Strix Halo Wiki, testing Qwen3-30B-A3B-UD-Q4_K_XL on a Ryzen AI MAX+ 395 / 128 GB system.²

Standard context depth:

Backend	Driver/Config	pp512 (t/s)	tg128 (t/s)
Vulkan	AMDVLK	741.60	81.79
Vulkan	RADV	755.14	85.11
ROCm	Standard	650.59	64.17
ROCm	+ hipBLASlt	651.93	63.95
ROCm	Tuned build	659.07	67.66

Extended context (130,560 token depth):

Backend	Driver/Config	pp512 (t/s)	tg128 (t/s)
Vulkan	AMDVLK	10.75	3.51
Vulkan	RADV	17.24	12.54
ROCm	Standard	40.58	4.98
ROCm	Tuned	51.12	13.32
ROCm	hipBLASlt tuned	51.05	13.33

Key observations:

At standard context, Vulkan RADV outperforms ROCm across both prompt processing and token generation — counterintuitive but consistent with gfx1151's memory access characteristics
At extended context (130K+), ROCm tuned builds achieve 3× better prompt processing (51 vs 17 t/s) and comparable token generation (13.32 vs 12.54 t/s)
RADV consistently outperforms AMDVLK for Vulkan; always set AMD_VULKAN_ICD=RADV

10.3 Power vs Performance

From the Strix Halo Wiki power mode analysis:²

TDP	vs 55W CPU gain	vs 55W GPU LLM gain
55W	baseline	baseline
85W	+17–19%	+8.7%
120W	+25.5–30.8%	+10.7%

For LLM workloads, memory bandwidth saturates before compute does — the gain from 85W to 120W is only 2% for GPU LLM inference. 85W is the recommended sweet spot, balancing thermal envelope and throughput.

11. Backend Comparison and Recommendations {#recommendations}

Summary Matrix

	Ollama (Vulkan)	Ollama (ROCm)	LM Studio	llama.cpp Vulkan	llama.cpp ROCm
Setup difficulty	Easy	Hard / unreliable	Easy	Medium	Hard
GPU detected reliably	✓	✗ (times out)	✓	✓	✓
Standard context perf	Good	N/A	Good	Best	Moderate
Long context (128K+) perf	Good	N/A	Good	Good	Best
API server	✓	—	✓	✓ (llama-server)	✓
GUI	✗	—	✓	✗	✗
Fine-grained control	Medium	—	Low	High	High
Recommended for	API/automation	—	Model browsing	Performance	Long context

Decision Guide

Use Ollama with Vulkan if:

You want a simple curl or SDK-compatible API server
You're integrating with tools like Open WebUI, Continue.dev, or LangChain
You want zero-configuration GPU inference on Strix Halo

Use LM Studio if:

You want a GUI for model management, download, and chat
You're exploring models without writing code
You need a quick demo environment

Use llama.cpp (Vulkan) if:

You need maximum token generation speed at standard context lengths
You want direct control over quantization, batch size, and thread count
You're building a custom inference server

Use llama.cpp (ROCm) if:

You need maximum throughput at extended context depths (64K–128K+)
You're running a production inference server and can tolerate the build complexity
You're using the @kyuz0 pre-built toolboxes or TheRock nightly builds

A Note on ROCm Maturity for gfx1151

ROCm 7.2.0 marks gfx1151 APU support as Preview and limits official framework support to PyTorch on Linux.¹ The community has moved faster than the official release: TheRock nightly builds ship native gfx1151-compiled ROCm libraries, and the kyuz0 toolbox containers demonstrate production-viable ROCm inference.⁴ Expect this to improve significantly through 2026 as Strix Halo adoption grows.

12. References {#references}

Benchmarks in this article are sourced from the community: strixhalo.wiki (Qwen3-30B-A3B-UD-Q4_K_XL, kernel 6.15+), nishtahir.com/GMKTec Evo-X2 (Ollama + ROCm on Ryzen AI MAX+ 395, 128 GB), llm-tracker.info, Level1Techs, and the Framework Community forums. Results may vary with different models, quantization levels, and software versions.

Have corrections or additional benchmark data? Open an issue or PR on the Hyperion Consulting GitHub.

Footnotes

AMD ROCm 7.2.0 Compatibility Matrix — https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html ↩ ↩² ↩³
Strix Halo Community Wiki — https://strixhalo.wiki · AI Capabilities Overview · llama.cpp Performance · llama.cpp with ROCm · Power Modes ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷
AMD ROCm Linux Quick Start — https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html ↩
kyuz0/amd-strix-halo-llm-finetuning — https://github.com/kyuz0/amd-strix-halo-llm-finetuning ↩ ↩²
Ollama GPU support documentation — https://github.com/ollama/ollama/blob/main/docs/gpu.mdx ↩
LM Studio Changelog — https://lmstudio.ai/changelog · ROCm/Linux support introduced in 0.3.19 (July 2025) ↩
kyuz0/amd-strix-halo-toolboxes — https://github.com/kyuz0/amd-strix-halo-toolboxes ↩ ↩²
GMKTec Evo-X2 (Ryzen AI MAX+ 395, 128 GB) Benchmarks — https://nishtahir.com/gmktec-evo-x2-ryzen-ai-max-395-benchmarks/ ↩
Ollama Issue #14855 — gfx1151 ROCm Working Guide — https://github.com/ollama/ollama/issues/14855 ↩

AMD Strix Halo LLM Performance: Expert Ubuntu Guide 2024

Table of Contents