What is Ollama and Why It's Transforming Enterprise AI
Ollama has become the de facto standard for running large language models locally. With over 90,000 GitHub stars and a rapidly growing ecosystem, it has emerged as the most popular tool for local LLM inference — and for good reason. Ollama abstracts away the complexity of model quantization, GPU memory management, and inference server configuration into a single binary that just works.
At its core, Ollama lets you run any supported model with a single command:
ollama run mistral-nemo
That one command downloads the model, loads it into GPU memory (or CPU if no GPU is available), and opens an interactive chat session. Behind the scenes, Ollama manages model quantization via llama.cpp, GPU layer allocation, context window sizing, and KV cache management.
Why Enterprise Teams Are Adopting Ollama
Data sovereignty and GDPR compliance. When you run models through Ollama, your data never leaves your infrastructure. Every prompt, every response, every document you process stays on your hardware. For European enterprises subject to GDPR, the Digital Services Act, and the EU AI Act, this is not just a nice-to-have — it is a regulatory requirement for many use cases involving personal data or sensitive business information.
OpenAI-compatible API. Ollama exposes an API endpoint at /v1/chat/completions that is wire-compatible with the OpenAI API specification. This means any application, library, or framework built for the OpenAI API can be pointed at Ollama with a single configuration change — swap the base URL from https://api.openai.com to http://localhost:11434 and you are running locally. No code changes required.
Model diversity. Ollama supports over 100 models from every major open-source provider: Meta's Llama 3.3 (70B, 8B), Mistral's Nemo and Large, Google's Gemma 2 and 3, Microsoft's Phi-4, Alibaba's Qwen 2.5 series, DeepSeek's V3 and R1, and many more. New models are typically available within days of release.
Use cases driving adoption:
- Development and testing — Run models locally during development without API costs or rate limits
- Air-gapped production — Deploy in environments with no internet access (defense, finance, healthcare)
- GDPR-compliant processing — Process personal data without third-party data transfers
- Edge deployment — Run inference on edge devices, factory floors, or retail locations
- Cost optimization — Eliminate per-token API costs for high-volume workloads
- Latency-sensitive applications — Sub-100ms first-token latency on local hardware
Current state (March 2026). Ollama is at version 0.5.x, with the latest releases bringing improved multi-GPU support, expanded model compatibility, better memory management, and native support for structured outputs. The project is maintained by a dedicated team and receives frequent updates, typically multiple releases per month.
Installation Guide — Every Platform
macOS
Ollama runs natively on macOS with full Apple Silicon support. Installation takes under a minute.
Via the install script (recommended):
curl -fsSL https://ollama.com/install.sh | sh
Via Homebrew:
brew install ollama
Via direct download:
Download the macOS app from ollama.com/download. The app installs the ollama CLI and runs the Ollama server as a background service accessible from the menu bar.
After installation, verify it works:
ollama --version
# ollama version is 0.5.x
Linux (Ubuntu/Debian/RHEL)
Linux is the primary platform for production Ollama deployments. The install script handles everything including systemd service setup.
curl -fsSL https://ollama.com/install.sh | sh
This script will:
- Detect your Linux distribution and architecture
- Download the appropriate binary
- Create the
ollamasystem user and group - Install and enable the
ollamasystemd service - Detect NVIDIA or AMD GPUs and configure drivers if needed
Verify the installation:
ollama --version
systemctl status ollama
You should see the ollama service running and active. If it is not running, start it:
sudo systemctl enable ollama
sudo systemctl start ollama
Manual installation (for air-gapped environments):
# Download the binary on a connected machine
curl -L https://ollama.com/download/ollama-linux-amd64 -o ollama
# Transfer to air-gapped server, then:
chmod +x ollama
sudo mv ollama /usr/local/bin/
# Create system user
sudo useradd -r -s /bin/false -m -d /usr/share/ollama ollama
# Create systemd service (see Ollama docs for full unit file)
sudo systemctl daemon-reload
sudo systemctl enable ollama
sudo systemctl start ollama
Windows
Ollama supports Windows natively with GPU acceleration:
- Download the installer from ollama.com/download
- Run the installer — it sets up Ollama as a background service
- Open a terminal (PowerShell or Command Prompt) and run:
ollama --version
ollama run phi4-mini
For NVIDIA GPU support on Windows, ensure you have the latest NVIDIA drivers installed (Game Ready or Studio drivers both work). Ollama will auto-detect the GPU.
For WSL2 users, you can also install the Linux version inside WSL2, which provides GPU passthrough from Windows NVIDIA drivers automatically.
Run Your First Model
With Ollama installed, pull and run a model:
# Small model — runs on any hardware (3.8B parameters)
ollama run phi4-mini
# Medium model — needs 5GB+ VRAM (7B parameters)
ollama run qwen2.5:7b
# Large model — needs 8GB+ VRAM (12B parameters)
ollama run mistral-nemo
# Extra large — needs 40GB+ VRAM (70B parameters)
ollama run llama3.3:70b
The first run downloads the model (this may take several minutes depending on your connection speed). Subsequent runs load the model from the local cache in seconds.
GPU Setup — NVIDIA, AMD, Apple Silicon
GPU acceleration is what makes Ollama practical for production workloads. Without a GPU, even a 7B model generates text at just 3-8 tokens per second. With a modern GPU, that same model produces 50-100+ tokens per second.
NVIDIA (CUDA)
NVIDIA GPUs are the best-supported option for Ollama. If you have NVIDIA drivers and CUDA installed, Ollama detects and uses your GPU automatically.
Verify your GPU is detected:
nvidia-smi
You should see your GPU model, driver version, and CUDA version. Ollama requires CUDA compute capability 5.0 or higher (Maxwell architecture and newer — basically any GPU from 2014 onward).
VRAM requirements by model size:
| Model Size | Quantization | VRAM Required | Recommended GPU |
|---|---|---|---|
| 3.8B (Phi-4-mini) | Q4_K_M | 3 GB | Any recent GPU |
| 7B (Qwen 2.5, Llama 3.1) | Q4_K_M | 5 GB | RTX 3080, RTX 4070 |
| 14B (Qwen 2.5) | Q4_K_M | 9 GB | RTX 3090, RTX 4080 |
| 32B (Qwen 2.5 Coder) | Q4_K_M | 20 GB | RTX 3090 24GB |
| 70B (Llama 3.3) | Q4_K_M | 40 GB | 2x RTX 3090 or A100 40GB |
| 70B (Llama 3.3) | Q8_0 | 70+ GB | H100 80GB |
Multi-GPU support:
Ollama automatically splits models across multiple GPUs when a single GPU does not have enough VRAM. For a 70B model on two RTX 3090 cards (24GB each = 48GB total), Ollama will distribute the model layers across both GPUs.
# Verify multi-GPU detection
nvidia-smi -L
# GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-xxx...)
# GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-yyy...)
# Ollama uses both automatically
ollama run llama3.3:70b
To restrict Ollama to specific GPUs:
CUDA_VISIBLE_DEVICES=0,1 ollama serve
AMD (ROCm)
AMD GPU support requires ROCm (Radeon Open Compute) on Linux. Supported GPUs include the Radeon RX 7900 XTX, RX 7900 XT, Radeon PRO W7900, and Instinct MI250/MI300 series.
Install ROCm:
# Ubuntu 22.04 / 24.04
wget https://repo.radeon.com/amdgpu-install/latest/ubuntu/focal/amdgpu-install_6.0.60000-1_all.deb
sudo apt install ./amdgpu-install*.deb
sudo amdgpu-install --usecase=rocm
# Add user to render and video groups
sudo usermod -aG render,video $USER
# Reboot
sudo reboot
After ROCm is installed and you have rebooted, Ollama will automatically detect AMD GPUs on Linux. Verify with:
rocminfo | grep "Name:"
Performance on AMD GPUs is generally 70-85% of equivalent NVIDIA GPUs due to the maturity difference between ROCm and CUDA, but the gap has been closing steadily.
Apple Silicon (Metal Performance Shaders)
Ollama has first-class Apple Silicon support. On any Mac with an M1, M2, M3, or M4 chip, Ollama uses Metal Performance Shaders (MPS) for GPU acceleration with zero additional configuration.
The key advantage of Apple Silicon is unified memory — the CPU and GPU share the same memory pool. This means a MacBook Pro M3 Max with 128GB of unified memory can run a 70B model comfortably, something that would require multiple discrete GPUs on a traditional PC.
Approximate performance on Apple Silicon:
| Chip | RAM | Model | Tokens/sec |
|---|---|---|---|
| M1 Pro (16GB) | 16 GB | Phi-4-mini 3.8B | 25-35 |
| M2 Max (32GB) | 32 GB | Mistral Nemo 12B | 20-30 |
| M3 Max (64GB) | 64 GB | Llama 3.3 70B Q4 | 12-18 |
| M3 Max (128GB) | 128 GB | Llama 3.3 70B Q4 | 18-25 |
| M4 Max (128GB) | 128 GB | Llama 3.3 70B Q4 | 22-30 |
No additional drivers, frameworks, or configuration needed — just install Ollama and run.
CPU-Only Deployments
Ollama works without any GPU, using CPU inference via the llama.cpp backend. This is viable for smaller models and lower-throughput use cases.
Best models for CPU-only deployment:
- Phi-4-mini (3.8B) — Best quality-to-size ratio, 3-8 tokens/sec on modern CPU
- SmolLM2 (1.7B) — Ultra-lightweight, fast on CPU
- Qwen 2.5 (1.5B) — Surprisingly capable for its size
- Gemma 2 (2B) — Good for simple classification and extraction tasks
For CPU inference, having fast RAM (DDR5) and a high core count (12+ cores) makes a meaningful difference. AVX-512 instruction support (Intel Ice Lake+, AMD Zen 4+) provides a noticeable speedup.
Ollama Model Library — What to Use When
Choosing the right model is critical. The Ollama model library hosts hundreds of models, but for enterprise use, you want to select based on your specific requirements: task type, hardware constraints, licensing, and language support.
Recommended Models by Use Case
| Use Case | Recommended Model | Size | Why |
|---|---|---|---|
| General enterprise assistant | llama3.3:70b | 40 GB | Best overall open-source quality, strong reasoning |
| EU sovereign / multilingual | mistral-nemo | 8 GB | Apache 2.0 license, excellent European language support |
| Code generation & review | qwen2.5-coder:32b | 20 GB | Best open-source code model, supports 90+ languages |
| Code generation (low VRAM) | qwen2.5-coder:7b | 5 GB | Strong code capabilities in smaller footprint |
| Reasoning & math | deepseek-r1:70b | 40 GB | Chain-of-thought reasoning, strong at complex tasks |
| Reasoning (low VRAM) | deepseek-r1:14b | 9 GB | Good reasoning in a more accessible size |
| Low VRAM / edge deployment | phi4-mini | 3 GB | Microsoft's best small model, excellent quality/size ratio |
| Text embeddings | nomic-embed-text | 274 MB | Fast, high quality, 8192 token context |
| Vision / multimodal | llava:13b | 8 GB | Image understanding and description |
| Summarization | mistral-nemo | 8 GB | Strong at condensing long documents |
| Classification & extraction | phi4-mini | 3 GB | Fast inference, good structured output |
| Translation | qwen2.5:14b | 9 GB | Excellent multilingual capabilities |
Model Naming Conventions
Ollama uses a name:tag format:
mistral-nemo— Default quantization (usually Q4_K_M)llama3.3:70b— Specific size variantllama3.3:70b-instruct-q4_K_M— Specific quantizationllama3.3:70b-instruct-q8_0— Higher quality quantization (more VRAM)
Lower quantization (Q4) uses less VRAM but slightly reduces quality. For most enterprise use cases, Q4_K_M provides the best balance of quality and resource usage. Use Q8 only when you have ample VRAM and need maximum quality.
Licensing Considerations
For enterprise deployment, model licensing matters:
- Apache 2.0 (fully permissive): Mistral Nemo, Gemma 2, Qwen 2.5, Phi-4
- Llama 3.3 Community License: Free for companies under 700M monthly active users, requires attribution
- DeepSeek License: Open for research and commercial use with some restrictions
Always review the specific license terms for your use case, especially if you are building a product that directly exposes model outputs to end users.
The Ollama REST API — Complete Reference
Ollama exposes a REST API on port 11434 by default. This is the primary interface for integrating Ollama into your applications.
List Available Models
curl http://localhost:11434/api/tags
Response:
{
"models": [
{
"name": "mistral-nemo:latest",
"model": "mistral-nemo:latest",
"modified_at": "2026-03-15T10:30:00Z",
"size": 7365960935,
"digest": "sha256:abc123...",
"details": {
"parent_model": "",
"format": "gguf",
"family": "llama",
"parameter_size": "12B",
"quantization_level": "Q4_K_M"
}
}
]
}
Generate Completion (Native API)
Non-streaming:
curl -s http://localhost:11434/api/generate \
-d '{
"model": "mistral-nemo",
"prompt": "Explain GDPR Article 5 in simple terms.",
"stream": false
}'
Streaming (default):
curl http://localhost:11434/api/generate \
-d '{
"model": "mistral-nemo",
"prompt": "Explain GDPR Article 5 in simple terms."
}'
Streaming returns newline-delimited JSON objects, one per token. This is essential for real-time chat interfaces.
Chat Completions (OpenAI-Compatible)
This is the endpoint most enterprise applications will use, as it is compatible with any OpenAI SDK or library:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistral-nemo",
"messages": [
{"role": "system", "content": "You are a helpful compliance assistant specializing in EU regulations."},
{"role": "user", "content": "What are the main GDPR obligations for AI systems processing personal data?"}
],
"temperature": 0.3,
"max_tokens": 2048
}'
Response follows the standard OpenAI format:
{
"id": "chatcmpl-xxx",
"object": "chat.completion",
"created": 1711000000,
"model": "mistral-nemo",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Under the GDPR, AI systems processing personal data must..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 42,
"completion_tokens": 512,
"total_tokens": 554
}
}
Using the OpenAI Python SDK with Ollama
This is the recommended approach for Python applications — it gives you full compatibility with OpenAI's SDK while running everything locally:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Required by SDK but not validated by Ollama
)
response = client.chat.completions.create(
model="mistral-nemo",
messages=[
{"role": "system", "content": "You are a contract analysis assistant."},
{"role": "user", "content": "Analyze this contract clause for potential risks: ..."}
],
temperature=0.2,
max_tokens=4096
)
print(response.choices[0].message.content)
This same pattern works with the Node.js, Go, Java, and every other OpenAI SDK.
Embeddings
Generate vector embeddings for RAG (Retrieval-Augmented Generation) and semantic search:
curl http://localhost:11434/api/embeddings \
-d '{
"model": "nomic-embed-text",
"prompt": "This document discusses GDPR compliance requirements for AI systems."
}'
Response:
{
"embedding": [0.0123, -0.0456, 0.0789, ...]
}
The nomic-embed-text model produces 768-dimensional vectors and supports up to 8192 tokens of input. For enterprise RAG systems, this is often the first model you should deploy.
Model Management
# Download a model
ollama pull llama3.3:70b
# List installed models
ollama list
# Show model details (license, parameters, template)
ollama show llama3.3:70b
# Copy a model (create an alias)
ollama cp mistral-nemo my-company-assistant
# Remove a model
ollama rm llama3.3:70b
Structured Output (JSON Mode)
Force the model to respond in valid JSON, which is essential for enterprise integrations:
curl http://localhost:11434/api/generate \
-d '{
"model": "mistral-nemo",
"prompt": "Extract the key entities from this text: The European Commission published the EU AI Act on March 13, 2024.",
"format": "json",
"stream": false
}'
Docker Deployment — Development to Production
Docker is the standard deployment method for Ollama in production environments. It provides isolation, reproducibility, and easy integration with container orchestration platforms.
Single Container (CPU Only)
# docker-compose.yml
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
interval: 30s
timeout: 10s
retries: 3
volumes:
ollama_data:
Start it:
docker compose up -d
With NVIDIA GPU
Pre-requisites: Install the NVIDIA Container Toolkit on the host.
# Install NVIDIA Container Toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Then use this Docker Compose configuration:
# docker-compose.gpu.yml
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
environment:
- OLLAMA_NUM_PARALLEL=4
- OLLAMA_FLASH_ATTENTION=1
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
interval: 30s
timeout: 10s
retries: 3
volumes:
ollama_data:
docker compose -f docker-compose.gpu.yml up -d
Auto-Pull Models on Startup
Create a custom Dockerfile that pre-loads your required models:
FROM ollama/ollama:latest
# Pull models at build time
RUN ollama serve & sleep 5 && \
ollama pull mistral-nemo && \
ollama pull nomic-embed-text && \
ollama pull phi4-mini && \
kill %1
Build and use it:
docker build -t ollama-enterprise:latest .
Then reference ollama-enterprise:latest in your Docker Compose file instead of ollama/ollama:latest. This ensures models are available immediately on container start, which is critical for production deployments and air-gapped environments.
Docker Compose with Application Stack
A typical enterprise setup pairs Ollama with an application server and a vector database:
# docker-compose.production.yml
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
environment:
- OLLAMA_NUM_PARALLEL=4
- OLLAMA_HOST=0.0.0.0:11434
restart: unless-stopped
networks:
- ai-network
app:
build: .
ports:
- "8000:8000"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
- CHROMADB_URL=http://chromadb:8000
depends_on:
ollama:
condition: service_healthy
restart: unless-stopped
networks:
- ai-network
chromadb:
image: chromadb/chroma:latest
volumes:
- chroma_data:/chroma/chroma
restart: unless-stopped
networks:
- ai-network
volumes:
ollama_data:
chroma_data:
networks:
ai-network:
driver: bridge
Kubernetes Deployment
For organizations running Kubernetes, deploying Ollama requires GPU-aware scheduling and persistent storage for model files.
Prerequisites
- NVIDIA GPU Operator installed in your cluster (for NVIDIA GPUs)
- Persistent storage provisioner (for model caching)
- GPU node pool with appropriate taints and labels
Deployment Manifest
# ollama-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: ai-services
---
# ollama-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-pvc
namespace: ai-services
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
storageClassName: fast-ssd
---
# ollama-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ai-services
labels:
app: ollama
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
name: http
env:
- name: OLLAMA_NUM_PARALLEL
value: "4"
- name: OLLAMA_FLASH_ATTENTION
value: "1"
resources:
requests:
memory: "8Gi"
cpu: "4"
nvidia.com/gpu: "1"
limits:
memory: "32Gi"
cpu: "8"
nvidia.com/gpu: "1"
volumeMounts:
- name: ollama-data
mountPath: /root/.ollama
readinessProbe:
httpGet:
path: /api/tags
port: 11434
initialDelaySeconds: 10
periodSeconds: 15
livenessProbe:
httpGet:
path: /api/tags
port: 11434
initialDelaySeconds: 30
periodSeconds: 30
volumes:
- name: ollama-data
persistentVolumeClaim:
claimName: ollama-pvc
nodeSelector:
nvidia.com/gpu.present: "true"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
name: ollama
namespace: ai-services
spec:
selector:
app: ollama
ports:
- port: 11434
targetPort: 11434
name: http
type: ClusterIP
Apply:
kubectl apply -f ollama-namespace.yaml
kubectl apply -f ollama-pvc.yaml
kubectl apply -f ollama-deployment.yaml
Horizontal Scaling with Multiple Replicas
For high-throughput workloads, deploy multiple Ollama pods behind a load balancer. Each pod needs its own GPU:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-pool
namespace: ai-services
spec:
replicas: 3
selector:
matchLabels:
app: ollama-pool
template:
metadata:
labels:
app: ollama-pool
spec:
containers:
- name: ollama
image: ollama-enterprise:latest # Pre-loaded models
ports:
- containerPort: 11434
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
This requires 3 GPU nodes. Use the Kubernetes Horizontal Pod Autoscaler (HPA) with custom metrics (e.g., request queue depth) for dynamic scaling.
Init Container for Model Loading
If you do not want to build a custom image, use an init container to pull models before the main container starts:
initContainers:
- name: model-loader
image: ollama/ollama:latest
command: ["/bin/sh", "-c"]
args:
- |
ollama serve &
sleep 5
ollama pull mistral-nemo
ollama pull nomic-embed-text
kill %1
volumeMounts:
- name: ollama-data
mountPath: /root/.ollama
Security Hardening for Enterprise and Air-Gapped Deployments
Ollama does not include built-in authentication or authorization. For enterprise deployments, you must layer security on top of the Ollama service.
Network Isolation
By default, Ollama listens on 127.0.0.1:11434 (localhost only). This is secure for single-server deployments. If you need to expose it to other hosts:
# Bind to all interfaces (use only behind a reverse proxy)
OLLAMA_HOST=0.0.0.0:11434 ollama serve
For systemd service configuration:
sudo systemctl edit ollama
Add:
[Service]
Environment="OLLAMA_HOST=127.0.0.1:11434"
Then restart:
sudo systemctl daemon-reload
sudo systemctl restart ollama
Nginx Reverse Proxy with Authentication
Place Ollama behind nginx with TLS and authentication:
upstream ollama_backend {
server 127.0.0.1:11434;
keepalive 32;
}
server {
listen 443 ssl http2;
server_name ai.internal.company.com;
ssl_certificate /etc/nginx/certs/internal.crt;
ssl_certificate_key /etc/nginx/certs/internal.key;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers HIGH:!aNULL:!MD5;
# Basic authentication
auth_basic "Ollama AI Service";
auth_basic_user_file /etc/nginx/.ollama_htpasswd;
# Rate limiting
limit_req zone=ollama burst=20 nodelay;
# Request size limit (important for large prompts)
client_max_body_size 50m;
location / {
proxy_pass http://ollama_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# Disable buffering for streaming responses
proxy_buffering off;
proxy_cache off;
# Long timeout for generation requests
proxy_read_timeout 600s;
proxy_send_timeout 600s;
}
}
Rate limiting configuration:
# In http block
limit_req_zone $binary_remote_addr zone=ollama:10m rate=10r/s;
limit_req_zone $binary_remote_addr zone=ollama_generate:10m rate=2r/s;
# In server block
location /api/generate {
limit_req zone=ollama_generate burst=5 nodelay;
proxy_pass http://ollama_backend;
}
location /v1/chat/completions {
limit_req zone=ollama_generate burst=5 nodelay;
proxy_pass http://ollama_backend;
}
location /api/ {
limit_req zone=ollama burst=20 nodelay;
proxy_pass http://ollama_backend;
}
Air-Gapped Deployment
For environments with no internet access (defense, critical infrastructure, sensitive financial systems), follow this checklist:
1. Prepare models on a connected machine:
# Download all required models
ollama pull mistral-nemo
ollama pull nomic-embed-text
ollama pull phi4-mini
# Models are stored in ~/.ollama/models/
ls -la ~/.ollama/models/
2. Package the Ollama binary:
# Download the binary
curl -L https://ollama.com/download/ollama-linux-amd64 -o ollama-binary
chmod +x ollama-binary
3. Transfer to the air-gapped environment:
# Copy models
rsync -av ~/.ollama/models/ /media/transfer-drive/ollama-models/
# Copy binary
cp ollama-binary /media/transfer-drive/
# On the air-gapped server:
cp /media/transfer-drive/ollama-binary /usr/local/bin/ollama
mkdir -p ~/.ollama/models/
rsync -av /media/transfer-drive/ollama-models/ ~/.ollama/models/
4. Configure for air-gapped operation:
# Prevent Ollama from attempting external connections
OLLAMA_HOST=127.0.0.1:11434 ollama serve
5. Verify models are available:
ollama list
# Should show all pre-loaded models
Audit Logging
Ollama itself does not produce detailed audit logs. For enterprise compliance, implement logging at the reverse proxy layer:
log_format ollama_audit '$remote_addr - $remote_user [$time_local] '
'"$request" $status $body_bytes_sent '
'"$http_referer" "$http_user_agent" '
'rt=$request_time';
access_log /var/log/nginx/ollama_audit.log ollama_audit;
For more granular logging (capturing prompt content for compliance), implement an API gateway layer between your application and Ollama.
Performance Optimization
Optimizing Ollama for production workloads involves tuning concurrency, context windows, memory management, and hardware utilization.
Concurrent Request Handling
By default, Ollama processes one request at a time per model. For multi-user environments, enable parallel processing:
# Allow 4 concurrent requests per model
OLLAMA_NUM_PARALLEL=4 ollama serve
More parallel requests means more VRAM usage — each concurrent request maintains its own KV cache. Monitor VRAM usage and reduce NUM_PARALLEL if you see out-of-memory errors.
Context Window Tuning
The default context window varies by model (often 2048 or 4096 tokens). For enterprise workloads involving long documents, you will often need more:
# Run with extended context (32K tokens)
ollama run mistral-nemo --num-ctx 32768
Via the API:
curl http://localhost:11434/api/generate \
-d '{
"model": "mistral-nemo",
"prompt": "Summarize this long document: ...",
"options": {
"num_ctx": 32768
}
}'
Larger context windows consume proportionally more VRAM. A 7B model at 4K context uses roughly 5GB VRAM; the same model at 32K context may use 8-10GB.
Flash Attention
Flash Attention significantly reduces memory usage and improves speed for long context windows:
OLLAMA_FLASH_ATTENTION=1 ollama serve
This is enabled by default on supported hardware (most modern NVIDIA and Apple Silicon GPUs). Flash Attention is particularly beneficial for context windows above 8K tokens.
Model Preloading and Keep-Alive
By default, Ollama unloads models from memory after 5 minutes of inactivity. For production services, keep models loaded:
# Keep model loaded for 1 hour after last request
curl http://localhost:11434/api/generate \
-d '{"model": "mistral-nemo", "prompt": "", "keep_alive": "1h"}'
# Keep model loaded indefinitely
curl http://localhost:11434/api/generate \
-d '{"model": "mistral-nemo", "prompt": "", "keep_alive": -1}'
This eliminates the cold-start latency (which can be several seconds for large models) on the first request after idle.
Environment Variable Reference
| Variable | Default | Description |
|---|---|---|
OLLAMA_HOST | 127.0.0.1:11434 | Bind address and port |
OLLAMA_NUM_PARALLEL | 1 | Max concurrent requests per model |
OLLAMA_MAX_LOADED_MODELS | 1 | Max models loaded simultaneously |
OLLAMA_FLASH_ATTENTION | 1 | Enable Flash Attention |
OLLAMA_KEEP_ALIVE | 5m | Default model keep-alive duration |
OLLAMA_MAX_QUEUE | 512 | Max queued requests |
OLLAMA_MODELS | ~/.ollama/models | Model storage path |
CUDA_VISIBLE_DEVICES | all | GPU selection for NVIDIA |
Benchmark Results: Tokens Per Second by Hardware
| Hardware | Model | Quantization | Tokens/sec (generation) |
|---|---|---|---|
| MacBook Pro M3 Max 128GB | Llama 3.3 70B | Q4_K_M | 18-25 |
| MacBook Pro M4 Max 128GB | Llama 3.3 70B | Q4_K_M | 22-30 |
| NVIDIA RTX 4090 24GB | Mistral Nemo 12B | Q4_K_M | 65-85 |
| NVIDIA RTX 4090 24GB | Qwen 2.5 7B | Q4_K_M | 90-120 |
| NVIDIA A100 80GB | Llama 3.3 70B | Q4_K_M | 35-50 |
| 2x NVIDIA A100 80GB NVLink | Llama 3.3 70B | Q4_K_M | 60-90 |
| NVIDIA H100 80GB | Llama 3.3 70B | Q4_K_M | 55-80 |
| CPU only (AMD Ryzen 9 7950X) | Phi-4-mini 3.8B | Q4_K_M | 3-8 |
| CPU only (Intel i9-14900K) | Phi-4-mini 3.8B | Q4_K_M | 4-10 |
These are approximate generation speeds (output tokens). Prompt processing (input tokens) is typically 2-5x faster than generation.
Modelfile — Custom Models and System Prompts
Ollama's Modelfile system lets you create custom model configurations with persistent system prompts, parameter settings, and templates. This is how enterprise teams standardize model behavior.
Basic Modelfile
# Modelfile for an EU compliance assistant
FROM mistral-nemo
SYSTEM """You are an EU AI Act compliance expert for Hyperion Consulting.
You only provide information relevant to the EU AI Act, GDPR, and ISO 42001.
Always cite specific articles and clauses when referencing regulations.
Never provide legal advice — always recommend consulting qualified legal counsel.
Respond in the language of the user's query."""
PARAMETER temperature 0.3
PARAMETER num_ctx 16384
PARAMETER top_p 0.9
Build and run:
ollama create compliance-assistant -f Modelfile
ollama run compliance-assistant
Advanced Modelfile: Code Review Assistant
FROM qwen2.5-coder:32b
SYSTEM """You are a senior code reviewer at a fintech company.
Review code for:
1. Security vulnerabilities (OWASP Top 10)
2. Performance issues
3. Code style and maintainability
4. Potential bugs and edge cases
Always provide specific line references and suggest concrete fixes.
Rate severity as: CRITICAL, HIGH, MEDIUM, LOW, INFO."""
PARAMETER temperature 0.1
PARAMETER num_ctx 32768
Modelfile for Structured Data Extraction
FROM phi4-mini
SYSTEM """You are a data extraction system. Given any input text, extract structured information and return it as valid JSON. Never include explanations outside the JSON structure. If a field cannot be determined from the input, use null."""
PARAMETER temperature 0.0
PARAMETER num_ctx 8192
Managing Custom Models
# List all models (including custom ones)
ollama list
# Show the Modelfile for an existing model
ollama show compliance-assistant --modelfile
# Remove a custom model
ollama rm compliance-assistant
# Copy/rename a model
ollama cp compliance-assistant compliance-v2
Custom models share the base model weights — creating a custom model from mistral-nemo only stores the Modelfile configuration, not a second copy of the weights.
Integration with LangChain and LlamaIndex
Ollama integrates natively with the two most popular LLM application frameworks. This makes it straightforward to build RAG systems, agents, and complex AI workflows running entirely on local hardware.
LangChain Integration
from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.chat_models import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
# Initialize LLM
llm = ChatOllama(
model="mistral-nemo",
base_url="http://localhost:11434",
temperature=0.3
)
# Initialize embeddings
embeddings = OllamaEmbeddings(
model="nomic-embed-text",
base_url="http://localhost:11434"
)
# Build a simple chain
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant that answers questions about {topic}."),
("user", "{question}")
])
chain = prompt | llm | StrOutputParser()
result = chain.invoke({
"topic": "EU AI Act compliance",
"question": "What are the requirements for high-risk AI systems?"
})
print(result)
RAG with LangChain and Ollama:
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import PDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
# Load and chunk documents
loader = PDFLoader("eu-ai-act.pdf")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)
# Create vector store with Ollama embeddings
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
# Create RAG chain
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k": 5})
)
result = qa_chain.invoke("What are the penalties for non-compliance with the EU AI Act?")
print(result["result"])
LlamaIndex Integration
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
# Configure Ollama as the default LLM and embedding model
Settings.llm = Ollama(
model="mistral-nemo",
request_timeout=120.0,
base_url="http://localhost:11434"
)
Settings.embed_model = OllamaEmbedding(
model_name="nomic-embed-text",
base_url="http://localhost:11434"
)
# Load documents and create index
documents = SimpleDirectoryReader("./compliance-docs").load_data()
index = VectorStoreIndex.from_documents(documents)
# Query
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query(
"What documentation is required for high-risk AI systems under the EU AI Act?"
)
print(response)
Direct OpenAI SDK Integration
For simpler use cases that do not need a framework, the OpenAI SDK works directly:
from openai import OpenAI
import json
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama"
)
# Streaming response
stream = client.chat.completions.create(
model="mistral-nemo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the EU AI Act risk categories."}
],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Monitoring Ollama in Production
Production deployments need observability. While Ollama does not natively export Prometheus metrics, there are effective approaches to monitoring.
Health Checks
The simplest health check queries the model list endpoint:
# Returns 200 if Ollama is running and responsive
curl -sf http://localhost:11434/api/tags > /dev/null && echo "healthy" || echo "unhealthy"
For Docker and Kubernetes, use this as your health check endpoint. A more thorough check verifies the model can actually generate:
curl -sf http://localhost:11434/api/generate \
-d '{"model": "mistral-nemo", "prompt": "ping", "stream": false}' \
| jq -r '.response' > /dev/null && echo "model healthy" || echo "model unhealthy"
Systemd Logs
On Linux with systemd:
# Follow Ollama logs in real time
journalctl -u ollama -f
# View logs from the last hour
journalctl -u ollama --since "1 hour ago"
# View only errors
journalctl -u ollama -p err
GPU Monitoring
For NVIDIA GPUs, monitor VRAM usage and GPU utilization:
# Real-time GPU monitoring
watch -n 1 nvidia-smi
# Log GPU stats to file
nvidia-smi --query-gpu=timestamp,name,utilization.gpu,utilization.memory,memory.used,memory.total,temperature.gpu --format=csv -l 5 >> /var/log/gpu-stats.csv
Prometheus and Grafana
While Ollama does not expose a native /metrics endpoint, you can use community exporters or build a lightweight custom exporter:
# ollama_exporter.py — Simple Prometheus exporter
from prometheus_client import start_http_server, Gauge, Counter
import requests
import time
ollama_models_loaded = Gauge('ollama_models_loaded', 'Number of loaded models')
ollama_health = Gauge('ollama_health', 'Ollama health status (1=healthy, 0=unhealthy)')
def collect_metrics():
try:
resp = requests.get('http://localhost:11434/api/tags', timeout=5)
if resp.status_code == 200:
models = resp.json().get('models', [])
ollama_models_loaded.set(len(models))
ollama_health.set(1)
else:
ollama_health.set(0)
except Exception:
ollama_health.set(0)
if __name__ == '__main__':
start_http_server(9091)
while True:
collect_metrics()
time.sleep(15)
Alerting
Combine health checks with your existing alerting infrastructure. A simple cron-based approach:
#!/bin/bash
# /usr/local/bin/ollama-health-check.sh
if ! curl -sf http://localhost:11434/api/tags > /dev/null 2>&1; then
echo "Ollama is down at $(date)" | mail -s "ALERT: Ollama Down" [email protected]
# Or send to Slack/Teams/Telegram
fi
Add to crontab:
*/5 * * * * /usr/local/bin/ollama-health-check.sh
Ollama vs. Alternatives: When to Use What
Ollama is excellent for most enterprise local LLM use cases, but it is not the only option. Understanding when to use alternatives helps you make the right architecture decision.
| Tool | Best For | Not Ideal For |
|---|---|---|
| Ollama | General-purpose local LLM, development, small-to-medium production | Ultra-high-throughput (>100 concurrent users) |
| vLLM | High-throughput production serving, continuous batching | Quick setup, development |
| llama.cpp server | Maximum control, custom quantization | Ease of use, model management |
| TGI (Text Generation Inference) | HuggingFace model ecosystem, production serving | Simple local development |
| LocalAI | OpenAI API compatibility with multiple backends | Single-model performance |
Ollama wins on developer experience and operational simplicity. If you need to serve hundreds of concurrent users with maximum throughput, consider vLLM. For most enterprise use cases — development, testing, internal tools, and moderate-traffic production services — Ollama is the right choice.
Frequently Asked Questions
1. How does Ollama model quality compare to GPT-4 or Claude?
Open-source models running through Ollama have improved dramatically. Llama 3.3 70B and Qwen 2.5 72B are competitive with GPT-4-turbo on many benchmarks, particularly for structured tasks like summarization, extraction, and code generation. For complex reasoning, multi-step analysis, and creative writing, proprietary models like GPT-4o and Claude still hold an edge. The practical approach is to use Ollama for tasks where open-source models perform well (80%+ of enterprise use cases) and reserve API calls for the remaining complex tasks.
2. How much VRAM do I actually need?
The rule of thumb: a Q4-quantized model requires approximately 0.6 GB of VRAM per billion parameters, plus overhead for the KV cache. So a 7B model needs about 5 GB, a 14B model needs about 9 GB, and a 70B model needs about 40 GB. If your GPU does not have enough VRAM, Ollama will automatically offload some layers to CPU, which works but is significantly slower.
3. Can Ollama use multiple GPUs?
Yes. Ollama automatically distributes model layers across multiple NVIDIA GPUs. If you have two RTX 3090s (24 GB each = 48 GB total), Ollama can run a 70B Q4 model by splitting layers between them. NVLink is not required but provides better performance for multi-GPU setups. For AMD multi-GPU, support depends on the ROCm version.
4. How do I update models to newer versions?
Pull the model again to get the latest version:
ollama pull mistral-nemo
Ollama uses a Docker-like layer system — if you already have most layers, only the diff is downloaded. You can automate this with a cron job or a weekly maintenance script.
5. Ollama vs. vLLM — which should I choose?
Use Ollama for: development, testing, simple production deployments, air-gapped environments, teams that want simplicity. Use vLLM for: high-throughput production serving with continuous batching, environments already running Python-heavy ML infrastructure, when you need maximum tokens-per-second per GPU. Ollama is simpler to operate; vLLM extracts more performance from the same hardware at higher concurrency.
6. How many concurrent users can Ollama handle?
With OLLAMA_NUM_PARALLEL=4, a single Ollama instance can handle 4 simultaneous generation requests. Each concurrent request adds VRAM overhead for the KV cache. For a 7B model on an RTX 4090, you can comfortably serve 4-8 concurrent users. For higher concurrency, deploy multiple Ollama instances behind a load balancer, each with its own GPU.
7. Can multiple users share the same Ollama instance?
Yes. Ollama's API is stateless — multiple clients can send requests to the same instance. Use OLLAMA_NUM_PARALLEL to control concurrency, and put nginx in front for authentication and rate limiting. Model weights are shared in memory across all concurrent requests; only the KV cache is per-request.
8. How do I create custom Modelfiles for my organization?
Create a Modelfile with your system prompt, parameters, and base model. See the Modelfile section above for detailed examples. Custom Modelfiles are the recommended way to standardize model behavior across your organization — version-control your Modelfiles alongside your application code.
9. Does Ollama work with Windows WSL2?
Yes. Install the Linux version of Ollama inside WSL2. NVIDIA GPU passthrough works automatically if you have the Windows NVIDIA drivers installed (WSL2-specific CUDA drivers are no longer needed with recent driver versions). Performance is nearly identical to native Linux. This is actually the recommended approach for Windows development environments.
10. How do I add authentication to the Ollama API?
Ollama does not include built-in authentication. The recommended approach is to place Ollama behind a reverse proxy (nginx, Caddy, or Traefik) with authentication. Options include: HTTP Basic Auth for simple setups, OAuth2 Proxy for SSO integration, mutual TLS (mTLS) for service-to-service authentication, or an API gateway like Kong for enterprise-grade access control. See the Security Hardening section above for nginx configuration examples.
Conclusion: Your Ollama Enterprise Deployment Checklist
Ollama has matured into a production-ready platform for running open-source LLMs on your own infrastructure. Here is a summary checklist for enterprise deployment:
Planning:
- Identify your use cases and select appropriate models
- Calculate VRAM requirements based on model sizes and concurrency needs
- Review model licenses for your commercial use case
- Decide deployment topology: single server, Docker, or Kubernetes
Infrastructure:
- Provision GPU hardware (NVIDIA recommended for production)
- Install Ollama and verify GPU detection
- Pre-pull all required models
- Configure persistent storage for model cache
Security:
- Bind Ollama to localhost (never expose directly to the network)
- Deploy nginx reverse proxy with TLS and authentication
- Implement rate limiting at the proxy layer
- Enable audit logging for compliance
- For air-gapped: prepare offline model packages
Performance:
- Enable Flash Attention
- Set
OLLAMA_NUM_PARALLELbased on your GPU capacity - Configure model keep-alive for production services
- Tune context window size for your workload
Operations:
- Set up health checks and monitoring
- Configure GPU utilization alerting
- Document model update procedures
- Establish backup procedures for custom Modelfiles
Ollama eliminates the complexity of local LLM deployment while giving you full control over your AI infrastructure — no data leaves your premises, no per-token costs, and no vendor lock-in. For European enterprises navigating GDPR, the EU AI Act, and data sovereignty requirements, it is the most practical path to production AI.
