Table of Contents
- Introduction
- Hardware Overview
- The Unified Memory Advantage
- System Prerequisites — Ubuntu 24.04
- Kernel Boot Parameters
- Installing ROCm 7.2
- Ollama — Setup, Vulkan, and ROCm
- LM Studio
- llama.cpp — The Recommended Path
- Performance Benchmarks
- Backend Comparison and Recommendations
- References
1. Introduction {#introduction}
The AMD Ryzen AI MAX+ 395 — codenamed Strix Halo — is a fundamentally different class of hardware for local AI inference. With 40 RDNA 3.5 GPU compute units, up to 128 GB of unified LPDDR5x-8000 memory, and 256 GB/s of memory bandwidth shared between CPU and GPU, it eliminates the most painful constraint of consumer GPU inference: VRAM limits.
A single machine with this chip can run 70B parameter models in full precision, fine-tune 12B models without quantization, and serve models that would previously require a data-center GPU. The question is no longer whether you can run large models locally — it is which software stack unlocks the hardware's full potential.
This guide covers three of the most popular open-source inference tools — Ollama, LM Studio, and llama.cpp — and evaluates both the Vulkan and ROCm compute backends on Ubuntu 24.04. All configuration, benchmark data, and observations in this article are drawn from hands-on testing on a system running the Ryzen AI MAX+ Pro 395 with 128 GB of unified memory, ROCm 7.2.0, and Ubuntu 24.04 with a 6.17 OEM kernel.
2. Hardware Overview {#hardware-overview}
| Specification | Value |
|---|---|
| Chip | AMD Ryzen AI MAX+ Pro 395 (Strix Halo) |
| GPU Architecture | RDNA 3.5 (gfx1151) |
| GPU Compute Units | 40 CUs |
| GPU Max Clock | 2.9 GHz |
| Peak FP16/BF16 | 59.39 TFLOPS (theoretical) |
| Memory Type | LPDDR5x-8000 |
| Memory Capacity | Up to 128 GB unified |
| Memory Bandwidth | 256 GB/s |
| ROCm LLVM Target | gfx1151 |
| Ubuntu 24.04 Kernel (tested) | 6.17.0 OEM |
The GPU is identified as Radeon Graphics (RADV GFX1151) by the Mesa Vulkan driver and as gfx1151 by ROCm. It is explicitly listed in Ollama's Linux GPU support table under the Ryzen AI family, alongside the Ryzen AI Max 390 and Ryzen AI Max 385.
The "integrated GPU" label is misleading. Unlike traditional laptop iGPUs with 512 MB to 8 GB of VRAM, Strix Halo's GPU accesses the entire unified memory pool — up to 128 GB — at full memory bus bandwidth. There is no PCIe bottleneck, no VRAM spill to system RAM, and no discrete GPU data copy overhead.
3. The Unified Memory Advantage {#the-unified-memory-advantage}
Traditional GPU inference is constrained by VRAM. A model that doesn't fit in VRAM either fails to load or falls back to slow CPU inference. Strix Halo collapses this boundary: both CPU and GPU threads address the same physical memory pool.
In practice this means:
- Llama 3.1 70B (Q4_K_M, ~40 GB) loads entirely on GPU with headroom to spare
- Mistral Large 123B (Q4, ~73 GB) fits with room for KV cache at long context
- Context windows of 128K tokens are practical for models up to ~30B parameters
- Fine-tuning of 12B models without quantization is possible (using ~115 GB)
The bottleneck for inference is memory bandwidth, not compute. At 256 GB/s the chip delivers competitive throughput — and this has a direct implication for backend choice, as we'll cover in the benchmark section.
4. System Prerequisites — Ubuntu 24.04 {#system-prerequisites}
ROCm 7.2.0 fully supports Ubuntu 24.04 (Noble Numbat) with glibc 2.39 and kernels from 6.8 (GA) through 6.14 (HWE).1 Kernels 6.15 and above (including the 6.17 OEM kernel) provide improved gfx1151 driver support.
Required userspace groups:
sudo usermod -aG render,video $USER
Log out and back in for group membership to take effect. The GPU device nodes at /dev/kfd and /dev/dri/renderD128 must be accessible to your user.
Verify device access:
ls -la /dev/kfd /dev/dri/renderD128
# Expected: crw-rw-rw- permissions with render group
5. Kernel Boot Parameters {#kernel-boot-parameters}
This is the most important and most commonly missed configuration step. By default, the kernel allocates only a small BIOS VRAM carveout (typically 512 MB) for the GPU GTT memory pool. Without explicit overrides, tools like rocm-smi report 94%+ VRAM usage on a pool that is only 512 MB — while the actual 128 GB unified pool is inaccessible to GPU compute.
Add the following to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub:
amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432
| Parameter | Value | Effect |
|---|---|---|
amd_iommu=off | — | Disables IOMMU; ~6% memory bandwidth improvement for LLM workloads 2 |
amdgpu.gttsize=131072 | 131072 MiB = 128 GiB | Exposes the full unified memory pool to the GPU driver |
ttm.pages_limit=33554432 | 128 GiB in 4K pages | Allows pinning the full pool for GPU operations |
Apply and reboot:
sudo update-grub
sudo reboot
After reboot, verify:
rocm-smi --showmeminfo vram
# Should show ~128 GiB total, not 512 MB
Note on
amdgpu.gttsize: The Strix Halo Wiki notes this parameter may be deprecated in favour of TTM subsystem settings in future kernels. Monitor AMD's kernel documentation for updates.2
6. Installing ROCm 7.2 {#installing-rocm}
AMD ROCm 7.2.0 is the current production release and the recommended stack for Ubuntu 24.04.3 Install via the amdgpu-install utility:
wget https://repo.radeon.com/amdgpu-install/7.2/ubuntu/noble/amdgpu-install_7.2.70200-1_all.deb
sudo apt install ./amdgpu-install_7.2.70200-1_all.deb
sudo apt update
sudo apt install python3-setuptools python3-wheel
sudo apt install rocm
sudo reboot
Verify ROCm detects the GPU:
rocminfo | grep -E "Name|gfx|Marketing"
Expected output includes gfx1151 and AMD Radeon Graphics.
rocm-smi
The device should appear with temperature, power, and the correct memory pool size.
Strix Halo status in ROCm: gfx1151 is not in ROCm's main compute compatibility matrix for production workloads — Ryzen APU support is marked Preview and scoped to PyTorch on Linux.1 For inference tooling (Ollama, llama.cpp), the ROCm libraries work but may require the workarounds described in this guide. For cutting-edge performance, the community recommends AMD's TheRock nightly builds, which ship native gfx1151-compiled binaries.4
7. Ollama — Setup, Vulkan, and ROCm {#ollama}
7.1 Installation
curl -fsSL https://ollama.com/install.sh | sh
Ollama installs as a systemd service running under the ollama user. The Ryzen AI MAX+ 395 is officially listed in Ollama's Linux GPU support documentation.5
7.2 The Two Backends: Vulkan and ROCm
Ollama on Linux supports two GPU backends for AMD hardware:
- ROCm (HIP) — the primary AMD compute backend, using Ollama's bundled ROCm libraries
- Vulkan — an experimental compute backend using the system's Vulkan driver, enabled via
OLLAMA_VULKAN=1
The critical architectural detail: Ollama ships its own bundled ROCm libraries in /usr/local/lib/ollama/rocm, separate from the system ROCm installation at /opt/rocm. The bundled libraries may not include native gfx1151 kernels, causing a GPU discovery timeout and silent CPU fallback. This is the root cause of many "Ollama won't use the GPU" reports on Strix Halo.
How to tell which backend is active:
# Run a model and check logs immediately
ollama run llama3.2:1b "hello" &
journalctl -u ollama -n 30 --no-pager | grep "inference compute"
| Log Output | Meaning |
|---|---|
library=Vulkan name=Vulkan0 | Vulkan GPU active ✓ |
library=ROCm compute=gfx1151 | ROCm GPU active ✓ |
library=cpu | CPU fallback — GPU not detected ✗ |
7.3 Enabling Vulkan (Recommended for Ollama)
The Vulkan backend is the most reliable path for Strix Halo on Ollama. It uses the Mesa RADV Vulkan driver, which has strong gfx1151 support and correctly addresses the full unified memory pool.
Ollama is configured via a systemd drop-in file. Check for existing drop-ins:
ls /etc/systemd/system/ollama.service.d/
Create or update /etc/systemd/system/ollama.service.d/gpu.conf:
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/gpu.conf > /dev/null << 'EOF'
[Service]
Environment="OLLAMA_VULKAN=1"
Environment="OLLAMA_CONTEXT_LENGTH=8192"
Environment="HSA_OVERRIDE_GFX_VERSION=11.5.1"
Environment="HSA_XNACK=1"
Environment="GPU_MAX_HW_QUEUES=8"
Environment="ROCM_PATH=/opt/rocm"
Environment="HIP_PATH=/opt/rocm"
UnsetEnvironment=OLLAMA_LLM_LIBRARY
UnsetEnvironment=HIP_VISIBLE_DEVICES
UnsetEnvironment=ROCR_VISIBLE_DEVICES
EOF
sudo systemctl daemon-reload && sudo systemctl restart ollama
Why UnsetEnvironment=OLLAMA_LLM_LIBRARY matters: If this variable is set (by a previous drop-in or install script) to the bundled ROCm path, it forces the bundled library and prevents Vulkan from being selected. Unsetting it allows Ollama to use its auto-detection logic and fall through to Vulkan.
Verify Vulkan is active:
ollama run llama3.2:1b "write a poem" &
sleep 2 && ollama ps
NAME ID SIZE PROCESSOR CONTEXT
llama3.2:1b baf6a787fdff 2.2 GB 100% GPU 8192
The 100% GPU label confirms GPU inference. Ollama logs will show:
load_tensors: offloaded 17/17 layers to GPU
load_tensors: Vulkan0 model buffer size = 1252.41 MiB
llama_kv_cache: Vulkan0 KV buffer size = 256.00 MiB
7.4 ROCm Backend in Ollama
Getting Ollama's ROCm backend to use the GPU on Strix Halo is harder. The bundled ROCm libraries probe the GPU via HIP device enumeration, which behaves differently for unified-memory iGPUs than for discrete GPUs. The probe frequently times out:
failure during GPU discovery
extra_envs="map[GGML_CUDA_INIT:1 ROCR_VISIBLE_DEVICES:0]"
error="failed to finish discovery before timeout"
The ROCR_VISIBLE_DEVICES:0 is set internally by Ollama during the probe — not by user configuration — and can cause the iGPU to be invisible to the HIP runtime. This is a known limitation of Ollama's GPU detection code with iGPU topologies.
In our testing, the ROCm backend consistently fell back to CPU, confirmed by watching journal logs for library=cpu during inference. This is consistent with community guidance on the Strix Halo Wiki, which explicitly marks Ollama as not recommended for this hardware.2
If you want to attempt ROCm in Ollama, replace the drop-in with:
sudo tee /etc/systemd/system/ollama.service.d/gpu.conf > /dev/null << 'EOF'
[Service]
Environment="OLLAMA_VULKAN=0"
Environment="OLLAMA_CONTEXT_LENGTH=8192"
Environment="HSA_OVERRIDE_GFX_VERSION=11.5.1"
Environment="HSA_XNACK=1"
Environment="ROCBLAS_USE_HIPBLASLT=1"
Environment="GPU_MAX_HW_QUEUES=8"
Environment="ROCM_PATH=/opt/rocm"
Environment="HIP_PATH=/opt/rocm"
UnsetEnvironment=OLLAMA_LLM_LIBRARY
UnsetEnvironment=HIP_VISIBLE_DEVICES
UnsetEnvironment=ROCR_VISIBLE_DEVICES
EOF
Check the logs immediately after restart to confirm whether the GPU was detected or if it fell back to CPU.
7.5 Context Length and Memory
With Vulkan and the full 128 GB pool accessible, Ollama can serve very large context windows. Set OLLAMA_CONTEXT_LENGTH to control the default:
# In the drop-in, for 32K context:
Environment="OLLAMA_CONTEXT_LENGTH=32768"
# Or per-model in a Modelfile:
PARAMETER num_ctx 32768
For large models like Mistral Large (123B), Ollama correctly loads the model across the full GPU pool:
ollama run mistral-large "Explain quantum entanglement"
8. LM Studio {#lm-studio}
LM Studio is a cross-platform GUI application for local LLM inference. It uses llama.cpp as its inference backend, inheriting llama.cpp's Vulkan and ROCm support.
8.1 Installation on Ubuntu 24.04
LM Studio is distributed as an AppImage:
# Download from https://lmstudio.ai (version 0.3.19+)
chmod +x LM_Studio-*.AppImage
./LM_Studio-*.AppImage
ROCm/Linux support was introduced in LM Studio 0.3.19 (July 2025) for AMD RX 9000 series GPUs.6 For Strix Halo, the most reliable path is the Vulkan backend, which flows through llama.cpp's Vulkan implementation.
8.2 GPU Backend Selection
In LM Studio settings, set the GPU offload to your preferred backend:
- Vulkan (recommended for Strix Halo): select in the inference settings; ensure
AMD_VULKAN_ICD=RADVis set in your environment before launching LM Studio - ROCm: set
HSA_OVERRIDE_GFX_VERSION=11.5.1before launching; results may vary on gfx1151
# Launch LM Studio with RADV and ROCm env vars
AMD_VULKAN_ICD=RADV HSA_OVERRIDE_GFX_VERSION=11.5.1 ./LM_Studio-*.AppImage
8.3 Practical Considerations
LM Studio does not expose ROCm-level environment variables through its GUI. Users needing fine-grained control over GTT allocation, hipBLASlt tuning, or Flash Attention flags should use llama.cpp directly. LM Studio is best suited for users who want a friendly interface for model management and chat, without needing to tune inference parameters.
The model library browser in LM Studio makes it easy to download and quantize models from Hugging Face, which pairs well with Strix Halo's ability to run large models that would be impractical on discrete GPUs with limited VRAM.
9. llama.cpp — The Recommended Path {#llamacpp}
For users who want maximum performance and control, llama.cpp built from source with either Vulkan or ROCm is the recommended approach on Strix Halo. This is what the community toolboxes built by @kyuz0 and the Strix Halo Wiki are based on.7
9.1 Vulkan Build
Prerequisites:
sudo apt install cmake ninja-build libvulkan-dev vulkan-tools glslc
Build:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -S . -DGGML_VULKAN=ON
cmake --build build --config Release -j$(nproc)
Environment (prefer RADV over AMDVLK):
export AMD_VULKAN_ICD=RADV
Run inference:
./build/bin/llama-cli \
-m /path/to/model.gguf \
-ngl 999 \
--mmap 0 \
-p "Your prompt here"
--mmap 0is important for models larger than half of system RAM. Without it, memory-mapped file access can cause contention between CPU and GPU buffer allocations on unified memory systems.2
9.2 ROCm Build
Prerequisites:
sudo apt install rocm cmake ninja-build
Build with hipBLASlt and Flash Attention:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -S . \
-DGGML_HIP=ON \
-DAMDGPU_TARGETS="gfx1151" \
-DGGML_HIP_ROCWMMA_FATTN=ON
cmake --build build --config Release -j$(nproc)
Warning on
-DGGML_HIP_ROCWMMA_FATTN=ON: As of ROCm 7.0.2+, this flag degrades performance at extended context depths (beyond ~32K tokens) on Strix Halo and should be used with caution. It is beneficial for standard context lengths (4K–16K). Monitor the upstream llama.cpp issue tracker for updates.2
Runtime environment:
export ROCBLAS_USE_HIPBLASLT=1
export HSA_OVERRIDE_GFX_VERSION=11.5.1
export ROCM_PATH=/opt/rocm
Run inference:
./build/bin/llama-cli \
-m /path/to/model.gguf \
-ngl 999 \
--mmap 0 \
-fa 1 \
-p "Your prompt here"
The -fa 1 flag enables Flash Attention, which significantly improves throughput at longer context lengths when using the ROCm backend.
9.3 Pre-Built Binaries
If you prefer not to build from source, two pre-built options are available for Strix Halo:
- Lemonade Server — an easy-to-use inference server with gfx1151 optimized builds, recommended for newcomers
- @kyuz0's AMD Strix Halo Toolboxes — container-based builds using TheRock nightly ROCm with full gfx1151 native support7
10. Performance Benchmarks {#performance-benchmarks}
10.1 Ollama: Default vs ROCm-Enabled (Community Benchmarks)
The following benchmarks are from a GMKTec Evo-X2 system (Ryzen AI MAX+ 395, 128 GB LPDDR5x-8000, Ubuntu 24.04) — identical silicon and memory configuration to the hardware discussed in this guide.8
Token generation rate (tg, eval rate) — Ollama default vs ROCm-enabled:
| Model | Ollama (no GPU config) | Ollama + ROCm | Speedup |
|---|---|---|---|
| GPT-OSS 20B | 23.80 t/s | 46.19 t/s | 1.9× |
| GPT-OSS 120B | 14.77 t/s | 33.05 t/s | 2.2× |
| Qwen3 32B | 4.42 t/s | 9.43 t/s | 2.1× |
"No GPU config" reflects Ollama's default behaviour on gfx1151 hardware when neither ROCm nor Vulkan is explicitly configured — inference runs on the Zen 5 CPU cores. With ROCm enabled (via HSA_OVERRIDE_GFX_VERSION=11.5.1 as described in section 7.4), throughput roughly doubles across model sizes.
llama.cpp as a Vulkan benchmark proxy:
Ollama's Vulkan backend uses the same underlying llama.cpp Vulkan kernels. The llama.cpp benchmarks in section 10.2 are the best available reference for Vulkan throughput on this hardware.
Note on model compatibility: Community reports indicate Ollama's Vulkan backend can hang on certain larger models (Qwen3.5 35B+ and similar). For models above ~30B parameters, ROCm is the more stable Ollama backend on gfx1151 hardware.9
Takeaway: With proper ROCm configuration, Ollama delivers approximately 2× the token generation rate compared to its default (CPU) mode on Strix Halo. ROCm is required for reliable inference on models above 30B parameters.
10.2 llama.cpp: Vulkan vs ROCm (Community Data)
The following benchmarks are from the Strix Halo Wiki, testing Qwen3-30B-A3B-UD-Q4_K_XL on a Ryzen AI MAX+ 395 / 128 GB system.2
Standard context depth:
| Backend | Driver/Config | pp512 (t/s) | tg128 (t/s) |
|---|---|---|---|
| Vulkan | AMDVLK | 741.60 | 81.79 |
| Vulkan | RADV | 755.14 | 85.11 |
| ROCm | Standard | 650.59 | 64.17 |
| ROCm | + hipBLASlt | 651.93 | 63.95 |
| ROCm | Tuned build | 659.07 | 67.66 |
Extended context (130,560 token depth):
| Backend | Driver/Config | pp512 (t/s) | tg128 (t/s) |
|---|---|---|---|
| Vulkan | AMDVLK | 10.75 | 3.51 |
| Vulkan | RADV | 17.24 | 12.54 |
| ROCm | Standard | 40.58 | 4.98 |
| ROCm | Tuned | 51.12 | 13.32 |
| ROCm | hipBLASlt tuned | 51.05 | 13.33 |
Key observations:
- At standard context, Vulkan RADV outperforms ROCm across both prompt processing and token generation — counterintuitive but consistent with gfx1151's memory access characteristics
- At extended context (130K+), ROCm tuned builds achieve 3× better prompt processing (51 vs 17 t/s) and comparable token generation (13.32 vs 12.54 t/s)
- RADV consistently outperforms AMDVLK for Vulkan; always set
AMD_VULKAN_ICD=RADV
10.3 Power vs Performance
From the Strix Halo Wiki power mode analysis:2
| TDP | vs 55W CPU gain | vs 55W GPU LLM gain |
|---|---|---|
| 55W | baseline | baseline |
| 85W | +17–19% | +8.7% |
| 120W | +25.5–30.8% | +10.7% |
For LLM workloads, memory bandwidth saturates before compute does — the gain from 85W to 120W is only 2% for GPU LLM inference. 85W is the recommended sweet spot, balancing thermal envelope and throughput.
11. Backend Comparison and Recommendations {#recommendations}
Summary Matrix
| Ollama (Vulkan) | Ollama (ROCm) | LM Studio | llama.cpp Vulkan | llama.cpp ROCm | |
|---|---|---|---|---|---|
| Setup difficulty | Easy | Hard / unreliable | Easy | Medium | Hard |
| GPU detected reliably | ✓ | ✗ (times out) | ✓ | ✓ | ✓ |
| Standard context perf | Good | N/A | Good | Best | Moderate |
| Long context (128K+) perf | Good | N/A | Good | Good | Best |
| API server | ✓ | — | ✓ | ✓ (llama-server) | ✓ |
| GUI | ✗ | — | ✓ | ✗ | ✗ |
| Fine-grained control | Medium | — | Low | High | High |
| Recommended for | API/automation | — | Model browsing | Performance | Long context |
Decision Guide
Use Ollama with Vulkan if:
- You want a simple
curlor SDK-compatible API server - You're integrating with tools like Open WebUI, Continue.dev, or LangChain
- You want zero-configuration GPU inference on Strix Halo
Use LM Studio if:
- You want a GUI for model management, download, and chat
- You're exploring models without writing code
- You need a quick demo environment
Use llama.cpp (Vulkan) if:
- You need maximum token generation speed at standard context lengths
- You want direct control over quantization, batch size, and thread count
- You're building a custom inference server
Use llama.cpp (ROCm) if:
- You need maximum throughput at extended context depths (64K–128K+)
- You're running a production inference server and can tolerate the build complexity
- You're using the @kyuz0 pre-built toolboxes or TheRock nightly builds
A Note on ROCm Maturity for gfx1151
ROCm 7.2.0 marks gfx1151 APU support as Preview and limits official framework support to PyTorch on Linux.1 The community has moved faster than the official release: TheRock nightly builds ship native gfx1151-compiled ROCm libraries, and the kyuz0 toolbox containers demonstrate production-viable ROCm inference.4 Expect this to improve significantly through 2026 as Strix Halo adoption grows.
12. References {#references}
Benchmarks in this article are sourced from the community: strixhalo.wiki (Qwen3-30B-A3B-UD-Q4_K_XL, kernel 6.15+), nishtahir.com/GMKTec Evo-X2 (Ollama + ROCm on Ryzen AI MAX+ 395, 128 GB), llm-tracker.info, Level1Techs, and the Framework Community forums. Results may vary with different models, quantization levels, and software versions.
Have corrections or additional benchmark data? Open an issue or PR on the Hyperion Consulting GitHub.
© 2026 Hyperion Consulting · All rights reserved
Footnotes
-
AMD ROCm 7.2.0 Compatibility Matrix — https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html ↩ ↩2 ↩3
-
Strix Halo Community Wiki — https://strixhalo.wiki · AI Capabilities Overview · llama.cpp Performance · llama.cpp with ROCm · Power Modes ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7
-
AMD ROCm Linux Quick Start — https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html ↩
-
kyuz0/amd-strix-halo-llm-finetuning — https://github.com/kyuz0/amd-strix-halo-llm-finetuning ↩ ↩2
-
Ollama GPU support documentation — https://github.com/ollama/ollama/blob/main/docs/gpu.mdx ↩
-
LM Studio Changelog — https://lmstudio.ai/changelog · ROCm/Linux support introduced in 0.3.19 (July 2025) ↩
-
kyuz0/amd-strix-halo-toolboxes — https://github.com/kyuz0/amd-strix-halo-toolboxes ↩ ↩2
-
GMKTec Evo-X2 (Ryzen AI MAX+ 395, 128 GB) Benchmarks — https://nishtahir.com/gmktec-evo-x2-ryzen-ai-max-395-benchmarks/ ↩
-
Ollama Issue #14855 — gfx1151 ROCm Working Guide — https://github.com/ollama/ollama/issues/14855 ↩
