Ollama Enterprise Deployment Guide: Setup, Scaling & Security (2026)

Q: Why Enterprise Teams Are Adopting Ollama

Data sovereignty and GDPR compliance. When you run models through Ollama, your data never leaves your infrastructure. Every prompt, every response, every document you process stays on your hardware. For European enterprises subject to GDPR, the Digital Services Act, and the EU AI Act, this is not just a nice-to-have — it is a regulatory requirement for many use cases involving personal data or sensitive business information.

Q: 6. How many concurrent users can Ollama handle?

With OLLAMANUMPARALLEL=4, a single Ollama instance can handle 4 simultaneous generation requests. Each concurrent request adds VRAM overhead for the KV cache. For a 7B model on an RTX 4090, you can comfortably serve 4-8 concurrent users. For higher concurrency, deploy multiple Ollama instances behind a load balancer, each with its own GPU.

Q: 8. How do I create custom Modelfiles for my organization?

Create a Modelfile with your system prompt, parameters, and base model. See the Modelfile section above for detailed examples. Custom Modelfiles are the recommended way to standardize model behavior across your organization — version-control your Modelfiles alongside your application code.

What is Ollama and Why It's Transforming Enterprise AI

Ollama has become the de facto standard for running large language models locally. With over 90,000 GitHub stars and a rapidly growing ecosystem, it has emerged as the most popular tool for local LLM inference — and for good reason. Ollama abstracts away the complexity of model quantization, GPU memory management, and inference server configuration into a single binary that just works.

At its core, Ollama lets you run any supported model with a single command:

ollama run mistral-nemo

That one command downloads the model, loads it into GPU memory (or CPU if no GPU is available), and opens an interactive chat session. Behind the scenes, Ollama manages model quantization via llama.cpp, GPU layer allocation, context window sizing, and KV cache management.

Why Enterprise Teams Are Adopting Ollama

Data sovereignty and GDPR compliance. When you run models through Ollama, your data never leaves your infrastructure. Every prompt, every response, every document you process stays on your hardware. For European enterprises subject to GDPR, the Digital Services Act, and the EU AI Act, this is not just a nice-to-have — it is a regulatory requirement for many use cases involving personal data or sensitive business information.

OpenAI-compatible API. Ollama exposes an API endpoint at /v1/chat/completions that is wire-compatible with the OpenAI API specification. This means any application, library, or framework built for the OpenAI API can be pointed at Ollama with a single configuration change — swap the base URL from https://api.openai.com to http://localhost:11434 and you are running locally. No code changes required.

Model diversity. Ollama supports over 100 models from every major open-source provider: Meta's Llama 3.3 (70B, 8B), Mistral's Nemo and Large, Google's Gemma 2 and 3, Microsoft's Phi-4, Alibaba's Qwen 2.5 series, DeepSeek's V3 and R1, and many more. New models are typically available within days of release.

Use cases driving adoption:

Development and testing — Run models locally during development without API costs or rate limits
Air-gapped production — Deploy in environments with no internet access (defense, finance, healthcare)
GDPR-compliant processing — Process personal data without third-party data transfers
Edge deployment — Run inference on edge devices, factory floors, or retail locations
Cost optimization — Eliminate per-token API costs for high-volume workloads
Latency-sensitive applications — Sub-100ms first-token latency on local hardware

Current state (March 2026). Ollama is at version 0.5.x, with the latest releases bringing improved multi-GPU support, expanded model compatibility, better memory management, and native support for structured outputs. The project is maintained by a dedicated team and receives frequent updates, typically multiple releases per month.

Installation Guide — Every Platform

macOS

Ollama runs natively on macOS with full Apple Silicon support. Installation takes under a minute.

Via the install script (recommended):

curl -fsSL https://ollama.com/install.sh | sh

Via Homebrew:

brew install ollama

Via direct download:

Download the macOS app from ollama.com/download. The app installs the ollama CLI and runs the Ollama server as a background service accessible from the menu bar.

After installation, verify it works:

ollama --version
# ollama version is 0.5.x

Linux (Ubuntu/Debian/RHEL)

Linux is the primary platform for production Ollama deployments. The install script handles everything including systemd service setup.

curl -fsSL https://ollama.com/install.sh | sh

This script will:

Detect your Linux distribution and architecture
Download the appropriate binary
Create the ollama system user and group
Install and enable the ollama systemd service
Detect NVIDIA or AMD GPUs and configure drivers if needed

Verify the installation:

ollama --version
systemctl status ollama

You should see the ollama service running and active. If it is not running, start it:

sudo systemctl enable ollama
sudo systemctl start ollama

Manual installation (for air-gapped environments):

# Download the binary on a connected machine
curl -L https://ollama.com/download/ollama-linux-amd64 -o ollama

# Transfer to air-gapped server, then:
chmod +x ollama
sudo mv ollama /usr/local/bin/

# Create system user
sudo useradd -r -s /bin/false -m -d /usr/share/ollama ollama

# Create systemd service (see Ollama docs for full unit file)
sudo systemctl daemon-reload
sudo systemctl enable ollama
sudo systemctl start ollama

Windows

Ollama supports Windows natively with GPU acceleration:

Download the installer from ollama.com/download
Run the installer — it sets up Ollama as a background service
Open a terminal (PowerShell or Command Prompt) and run:

ollama --version
ollama run phi4-mini

For NVIDIA GPU support on Windows, ensure you have the latest NVIDIA drivers installed (Game Ready or Studio drivers both work). Ollama will auto-detect the GPU.

For WSL2 users, you can also install the Linux version inside WSL2, which provides GPU passthrough from Windows NVIDIA drivers automatically.

Run Your First Model

With Ollama installed, pull and run a model:

# Small model — runs on any hardware (3.8B parameters)
ollama run phi4-mini

# Medium model — needs 5GB+ VRAM (7B parameters)
ollama run qwen2.5:7b

# Large model — needs 8GB+ VRAM (12B parameters)
ollama run mistral-nemo

# Extra large — needs 40GB+ VRAM (70B parameters)
ollama run llama3.3:70b

The first run downloads the model (this may take several minutes depending on your connection speed). Subsequent runs load the model from the local cache in seconds.

GPU Setup — NVIDIA, AMD, Apple Silicon

GPU acceleration is what makes Ollama practical for production workloads. Without a GPU, even a 7B model generates text at just 3-8 tokens per second. With a modern GPU, that same model produces 50-100+ tokens per second.

NVIDIA (CUDA)

NVIDIA GPUs are the best-supported option for Ollama. If you have NVIDIA drivers and CUDA installed, Ollama detects and uses your GPU automatically.

Verify your GPU is detected:

nvidia-smi

You should see your GPU model, driver version, and CUDA version. Ollama requires CUDA compute capability 5.0 or higher (Maxwell architecture and newer — basically any GPU from 2014 onward).

VRAM requirements by model size:

Model Size	Quantization	VRAM Required	Recommended GPU
3.8B (Phi-4-mini)	Q4_K_M	3 GB	Any recent GPU
7B (Qwen 2.5, Llama 3.1)	Q4_K_M	5 GB	RTX 3080, RTX 4070
14B (Qwen 2.5)	Q4_K_M	9 GB	RTX 3090, RTX 4080
32B (Qwen 2.5 Coder)	Q4_K_M	20 GB	RTX 3090 24GB
70B (Llama 3.3)	Q4_K_M	40 GB	2x RTX 3090 or A100 40GB
70B (Llama 3.3)	Q8_0	70+ GB	H100 80GB

Multi-GPU support:

Ollama automatically splits models across multiple GPUs when a single GPU does not have enough VRAM. For a 70B model on two RTX 3090 cards (24GB each = 48GB total), Ollama will distribute the model layers across both GPUs.

# Verify multi-GPU detection
nvidia-smi -L
# GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-xxx...)
# GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-yyy...)

# Ollama uses both automatically
ollama run llama3.3:70b

To restrict Ollama to specific GPUs:

CUDA_VISIBLE_DEVICES=0,1 ollama serve

AMD (ROCm)

AMD GPU support requires ROCm (Radeon Open Compute) on Linux. Supported GPUs include the Radeon RX 7900 XTX, RX 7900 XT, Radeon PRO W7900, and Instinct MI250/MI300 series.

Install ROCm:

# Ubuntu 22.04 / 24.04
wget https://repo.radeon.com/amdgpu-install/latest/ubuntu/focal/amdgpu-install_6.0.60000-1_all.deb
sudo apt install ./amdgpu-install*.deb
sudo amdgpu-install --usecase=rocm

# Add user to render and video groups
sudo usermod -aG render,video $USER

# Reboot
sudo reboot

After ROCm is installed and you have rebooted, Ollama will automatically detect AMD GPUs on Linux. Verify with:

rocminfo | grep "Name:"

Performance on AMD GPUs is generally 70-85% of equivalent NVIDIA GPUs due to the maturity difference between ROCm and CUDA, but the gap has been closing steadily.

Apple Silicon (Metal Performance Shaders)

Ollama has first-class Apple Silicon support. On any Mac with an M1, M2, M3, or M4 chip, Ollama uses Metal Performance Shaders (MPS) for GPU acceleration with zero additional configuration.

The key advantage of Apple Silicon is unified memory — the CPU and GPU share the same memory pool. This means a MacBook Pro M3 Max with 128GB of unified memory can run a 70B model comfortably, something that would require multiple discrete GPUs on a traditional PC.

Approximate performance on Apple Silicon:

Chip	RAM	Model	Tokens/sec
M1 Pro (16GB)	16 GB	Phi-4-mini 3.8B	25-35
M2 Max (32GB)	32 GB	Mistral Nemo 12B	20-30
M3 Max (64GB)	64 GB	Llama 3.3 70B Q4	12-18
M3 Max (128GB)	128 GB	Llama 3.3 70B Q4	18-25
M4 Max (128GB)	128 GB	Llama 3.3 70B Q4	22-30

No additional drivers, frameworks, or configuration needed — just install Ollama and run.

CPU-Only Deployments

Ollama works without any GPU, using CPU inference via the llama.cpp backend. This is viable for smaller models and lower-throughput use cases.

Best models for CPU-only deployment:

Phi-4-mini (3.8B) — Best quality-to-size ratio, 3-8 tokens/sec on modern CPU
SmolLM2 (1.7B) — Ultra-lightweight, fast on CPU
Qwen 2.5 (1.5B) — Surprisingly capable for its size
Gemma 2 (2B) — Good for simple classification and extraction tasks

For CPU inference, having fast RAM (DDR5) and a high core count (12+ cores) makes a meaningful difference. AVX-512 instruction support (Intel Ice Lake+, AMD Zen 4+) provides a noticeable speedup.

Ollama Model Library — What to Use When

Choosing the right model is critical. The Ollama model library hosts hundreds of models, but for enterprise use, you want to select based on your specific requirements: task type, hardware constraints, licensing, and language support.

Recommended Models by Use Case

Use Case	Recommended Model	Size	Why
General enterprise assistant	`llama3.3:70b`	40 GB	Best overall open-source quality, strong reasoning
EU sovereign / multilingual	`mistral-nemo`	8 GB	Apache 2.0 license, excellent European language support
Code generation & review	`qwen2.5-coder:32b`	20 GB	Best open-source code model, supports 90+ languages
Code generation (low VRAM)	`qwen2.5-coder:7b`	5 GB	Strong code capabilities in smaller footprint
Reasoning & math	`deepseek-r1:70b`	40 GB	Chain-of-thought reasoning, strong at complex tasks
Reasoning (low VRAM)	`deepseek-r1:14b`	9 GB	Good reasoning in a more accessible size
Low VRAM / edge deployment	`phi4-mini`	3 GB	Microsoft's best small model, excellent quality/size ratio
Text embeddings	`nomic-embed-text`	274 MB	Fast, high quality, 8192 token context
Vision / multimodal	`llava:13b`	8 GB	Image understanding and description
Summarization	`mistral-nemo`	8 GB	Strong at condensing long documents
Classification & extraction	`phi4-mini`	3 GB	Fast inference, good structured output
Translation	`qwen2.5:14b`	9 GB	Excellent multilingual capabilities

Model Naming Conventions

Ollama uses a name:tag format:

mistral-nemo — Default quantization (usually Q4_K_M)
llama3.3:70b — Specific size variant
llama3.3:70b-instruct-q4_K_M — Specific quantization
llama3.3:70b-instruct-q8_0 — Higher quality quantization (more VRAM)

Lower quantization (Q4) uses less VRAM but slightly reduces quality. For most enterprise use cases, Q4_K_M provides the best balance of quality and resource usage. Use Q8 only when you have ample VRAM and need maximum quality.

Licensing Considerations

For enterprise deployment, model licensing matters:

Apache 2.0 (fully permissive): Mistral Nemo, Gemma 2, Qwen 2.5, Phi-4
Llama 3.3 Community License: Free for companies under 700M monthly active users, requires attribution
DeepSeek License: Open for research and commercial use with some restrictions

Always review the specific license terms for your use case, especially if you are building a product that directly exposes model outputs to end users.

The Ollama REST API — Complete Reference

Ollama exposes a REST API on port 11434 by default. This is the primary interface for integrating Ollama into your applications.

List Available Models

curl http://localhost:11434/api/tags

Response:

{
  "models": [
    {
      "name": "mistral-nemo:latest",
      "model": "mistral-nemo:latest",
      "modified_at": "2026-03-15T10:30:00Z",
      "size": 7365960935,
      "digest": "sha256:abc123...",
      "details": {
        "parent_model": "",
        "format": "gguf",
        "family": "llama",
        "parameter_size": "12B",
        "quantization_level": "Q4_K_M"
      }
    }
  ]
}

Generate Completion (Native API)

Non-streaming:

curl -s http://localhost:11434/api/generate \
  -d '{
    "model": "mistral-nemo",
    "prompt": "Explain GDPR Article 5 in simple terms.",
    "stream": false
  }'

Streaming (default):

curl http://localhost:11434/api/generate \
  -d '{
    "model": "mistral-nemo",
    "prompt": "Explain GDPR Article 5 in simple terms."
  }'

Streaming returns newline-delimited JSON objects, one per token. This is essential for real-time chat interfaces.

Chat Completions (OpenAI-Compatible)

This is the endpoint most enterprise applications will use, as it is compatible with any OpenAI SDK or library:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-nemo",
    "messages": [
      {"role": "system", "content": "You are a helpful compliance assistant specializing in EU regulations."},
      {"role": "user", "content": "What are the main GDPR obligations for AI systems processing personal data?"}
    ],
    "temperature": 0.3,
    "max_tokens": 2048
  }'

Response follows the standard OpenAI format:

{
  "id": "chatcmpl-xxx",
  "object": "chat.completion",
  "created": 1711000000,
  "model": "mistral-nemo",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Under the GDPR, AI systems processing personal data must..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 42,
    "completion_tokens": 512,
    "total_tokens": 554
  }
}

Using the OpenAI Python SDK with Ollama

This is the recommended approach for Python applications — it gives you full compatibility with OpenAI's SDK while running everything locally:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required by SDK but not validated by Ollama
)

response = client.chat.completions.create(
    model="mistral-nemo",
    messages=[
        {"role": "system", "content": "You are a contract analysis assistant."},
        {"role": "user", "content": "Analyze this contract clause for potential risks: ..."}
    ],
    temperature=0.2,
    max_tokens=4096
)

print(response.choices[0].message.content)

This same pattern works with the Node.js, Go, Java, and every other OpenAI SDK.

Embeddings

Generate vector embeddings for RAG (Retrieval-Augmented Generation) and semantic search:

curl http://localhost:11434/api/embeddings \
  -d '{
    "model": "nomic-embed-text",
    "prompt": "This document discusses GDPR compliance requirements for AI systems."
  }'

Response:

{
  "embedding": [0.0123, -0.0456, 0.0789, ...]
}

The nomic-embed-text model produces 768-dimensional vectors and supports up to 8192 tokens of input. For enterprise RAG systems, this is often the first model you should deploy.

Model Management

# Download a model
ollama pull llama3.3:70b

# List installed models
ollama list

# Show model details (license, parameters, template)
ollama show llama3.3:70b

# Copy a model (create an alias)
ollama cp mistral-nemo my-company-assistant

# Remove a model
ollama rm llama3.3:70b

Structured Output (JSON Mode)

Force the model to respond in valid JSON, which is essential for enterprise integrations:

curl http://localhost:11434/api/generate \
  -d '{
    "model": "mistral-nemo",
    "prompt": "Extract the key entities from this text: The European Commission published the EU AI Act on March 13, 2024.",
    "format": "json",
    "stream": false
  }'

Docker Deployment — Development to Production

Docker is the standard deployment method for Ollama in production environments. It provides isolation, reproducibility, and easy integration with container orchestration platforms.

Single Container (CPU Only)

# docker-compose.yml
version: '3.8'
services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 3

volumes:
  ollama_data:

Start it:

docker compose up -d

With NVIDIA GPU

Pre-requisites: Install the NVIDIA Container Toolkit on the host.

# Install NVIDIA Container Toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Then use this Docker Compose configuration:

# docker-compose.gpu.yml
version: '3.8'
services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    environment:
      - OLLAMA_NUM_PARALLEL=4
      - OLLAMA_FLASH_ATTENTION=1
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 3

volumes:
  ollama_data:

docker compose -f docker-compose.gpu.yml up -d

Auto-Pull Models on Startup

Create a custom Dockerfile that pre-loads your required models:

FROM ollama/ollama:latest

# Pull models at build time
RUN ollama serve & sleep 5 && \
    ollama pull mistral-nemo && \
    ollama pull nomic-embed-text && \
    ollama pull phi4-mini && \
    kill %1

Build and use it:

docker build -t ollama-enterprise:latest .

Then reference ollama-enterprise:latest in your Docker Compose file instead of ollama/ollama:latest. This ensures models are available immediately on container start, which is critical for production deployments and air-gapped environments.

Docker Compose with Application Stack

A typical enterprise setup pairs Ollama with an application server and a vector database:

# docker-compose.production.yml
version: '3.8'
services:
  ollama:
    image: ollama/ollama:latest
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    environment:
      - OLLAMA_NUM_PARALLEL=4
      - OLLAMA_HOST=0.0.0.0:11434
    restart: unless-stopped
    networks:
      - ai-network

  app:
    build: .
    ports:
      - "8000:8000"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - CHROMADB_URL=http://chromadb:8000
    depends_on:
      ollama:
        condition: service_healthy
    restart: unless-stopped
    networks:
      - ai-network

  chromadb:
    image: chromadb/chroma:latest
    volumes:
      - chroma_data:/chroma/chroma
    restart: unless-stopped
    networks:
      - ai-network

volumes:
  ollama_data:
  chroma_data:

networks:
  ai-network:
    driver: bridge

Kubernetes Deployment

For organizations running Kubernetes, deploying Ollama requires GPU-aware scheduling and persistent storage for model files.

Prerequisites

NVIDIA GPU Operator installed in your cluster (for NVIDIA GPUs)
Persistent storage provisioner (for model caching)
GPU node pool with appropriate taints and labels

Deployment Manifest

# ollama-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: ai-services
---
# ollama-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-pvc
  namespace: ai-services
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: fast-ssd
---
# ollama-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ai-services
  labels:
    app: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
          name: http
        env:
        - name: OLLAMA_NUM_PARALLEL
          value: "4"
        - name: OLLAMA_FLASH_ATTENTION
          value: "1"
        resources:
          requests:
            memory: "8Gi"
            cpu: "4"
            nvidia.com/gpu: "1"
          limits:
            memory: "32Gi"
            cpu: "8"
            nvidia.com/gpu: "1"
        volumeMounts:
        - name: ollama-data
          mountPath: /root/.ollama
        readinessProbe:
          httpGet:
            path: /api/tags
            port: 11434
          initialDelaySeconds: 10
          periodSeconds: 15
        livenessProbe:
          httpGet:
            path: /api/tags
            port: 11434
          initialDelaySeconds: 30
          periodSeconds: 30
      volumes:
      - name: ollama-data
        persistentVolumeClaim:
          claimName: ollama-pvc
      nodeSelector:
        nvidia.com/gpu.present: "true"
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: ai-services
spec:
  selector:
    app: ollama
  ports:
  - port: 11434
    targetPort: 11434
    name: http
  type: ClusterIP

Apply:

kubectl apply -f ollama-namespace.yaml
kubectl apply -f ollama-pvc.yaml
kubectl apply -f ollama-deployment.yaml

Horizontal Scaling with Multiple Replicas

For high-throughput workloads, deploy multiple Ollama pods behind a load balancer. Each pod needs its own GPU:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-pool
  namespace: ai-services
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ollama-pool
  template:
    metadata:
      labels:
        app: ollama-pool
    spec:
      containers:
      - name: ollama
        image: ollama-enterprise:latest  # Pre-loaded models
        ports:
        - containerPort: 11434
        resources:
          requests:
            nvidia.com/gpu: "1"
          limits:
            nvidia.com/gpu: "1"

This requires 3 GPU nodes. Use the Kubernetes Horizontal Pod Autoscaler (HPA) with custom metrics (e.g., request queue depth) for dynamic scaling.

Init Container for Model Loading

If you do not want to build a custom image, use an init container to pull models before the main container starts:

initContainers:
- name: model-loader
  image: ollama/ollama:latest
  command: ["/bin/sh", "-c"]
  args:
  - |
    ollama serve &
    sleep 5
    ollama pull mistral-nemo
    ollama pull nomic-embed-text
    kill %1
  volumeMounts:
  - name: ollama-data
    mountPath: /root/.ollama

Security Hardening for Enterprise and Air-Gapped Deployments

Ollama does not include built-in authentication or authorization. For enterprise deployments, you must layer security on top of the Ollama service.

Network Isolation

By default, Ollama listens on 127.0.0.1:11434 (localhost only). This is secure for single-server deployments. If you need to expose it to other hosts:

# Bind to all interfaces (use only behind a reverse proxy)
OLLAMA_HOST=0.0.0.0:11434 ollama serve

For systemd service configuration:

sudo systemctl edit ollama

Add:

[Service]
Environment="OLLAMA_HOST=127.0.0.1:11434"

Then restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Nginx Reverse Proxy with Authentication

Place Ollama behind nginx with TLS and authentication:

upstream ollama_backend {
    server 127.0.0.1:11434;
    keepalive 32;
}

server {
    listen 443 ssl http2;
    server_name ai.internal.company.com;

    ssl_certificate /etc/nginx/certs/internal.crt;
    ssl_certificate_key /etc/nginx/certs/internal.key;
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers HIGH:!aNULL:!MD5;

    # Basic authentication
    auth_basic "Ollama AI Service";
    auth_basic_user_file /etc/nginx/.ollama_htpasswd;

    # Rate limiting
    limit_req zone=ollama burst=20 nodelay;

    # Request size limit (important for large prompts)
    client_max_body_size 50m;

    location / {
        proxy_pass http://ollama_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # Disable buffering for streaming responses
        proxy_buffering off;
        proxy_cache off;

        # Long timeout for generation requests
        proxy_read_timeout 600s;
        proxy_send_timeout 600s;
    }
}

Rate limiting configuration:

# In http block
limit_req_zone $binary_remote_addr zone=ollama:10m rate=10r/s;
limit_req_zone $binary_remote_addr zone=ollama_generate:10m rate=2r/s;

# In server block
location /api/generate {
    limit_req zone=ollama_generate burst=5 nodelay;
    proxy_pass http://ollama_backend;
}

location /v1/chat/completions {
    limit_req zone=ollama_generate burst=5 nodelay;
    proxy_pass http://ollama_backend;
}

location /api/ {
    limit_req zone=ollama burst=20 nodelay;
    proxy_pass http://ollama_backend;
}

Air-Gapped Deployment

For environments with no internet access (defense, critical infrastructure, sensitive financial systems), follow this checklist:

1. Prepare models on a connected machine:

# Download all required models
ollama pull mistral-nemo
ollama pull nomic-embed-text
ollama pull phi4-mini

# Models are stored in ~/.ollama/models/
ls -la ~/.ollama/models/

2. Package the Ollama binary:

# Download the binary
curl -L https://ollama.com/download/ollama-linux-amd64 -o ollama-binary
chmod +x ollama-binary

3. Transfer to the air-gapped environment:

# Copy models
rsync -av ~/.ollama/models/ /media/transfer-drive/ollama-models/

# Copy binary
cp ollama-binary /media/transfer-drive/

# On the air-gapped server:
cp /media/transfer-drive/ollama-binary /usr/local/bin/ollama
mkdir -p ~/.ollama/models/
rsync -av /media/transfer-drive/ollama-models/ ~/.ollama/models/

4. Configure for air-gapped operation:

# Prevent Ollama from attempting external connections
OLLAMA_HOST=127.0.0.1:11434 ollama serve

5. Verify models are available:

ollama list
# Should show all pre-loaded models

Audit Logging

Ollama itself does not produce detailed audit logs. For enterprise compliance, implement logging at the reverse proxy layer:

log_format ollama_audit '$remote_addr - $remote_user [$time_local] '
                        '"$request" $status $body_bytes_sent '
                        '"$http_referer" "$http_user_agent" '
                        'rt=$request_time';

access_log /var/log/nginx/ollama_audit.log ollama_audit;

For more granular logging (capturing prompt content for compliance), implement an API gateway layer between your application and Ollama.

Performance Optimization

Optimizing Ollama for production workloads involves tuning concurrency, context windows, memory management, and hardware utilization.

Concurrent Request Handling

By default, Ollama processes one request at a time per model. For multi-user environments, enable parallel processing:

# Allow 4 concurrent requests per model
OLLAMA_NUM_PARALLEL=4 ollama serve

More parallel requests means more VRAM usage — each concurrent request maintains its own KV cache. Monitor VRAM usage and reduce NUM_PARALLEL if you see out-of-memory errors.

Context Window Tuning

The default context window varies by model (often 2048 or 4096 tokens). For enterprise workloads involving long documents, you will often need more:

# Run with extended context (32K tokens)
ollama run mistral-nemo --num-ctx 32768

Via the API:

curl http://localhost:11434/api/generate \
  -d '{
    "model": "mistral-nemo",
    "prompt": "Summarize this long document: ...",
    "options": {
      "num_ctx": 32768
    }
  }'

Larger context windows consume proportionally more VRAM. A 7B model at 4K context uses roughly 5GB VRAM; the same model at 32K context may use 8-10GB.

Flash Attention

Flash Attention significantly reduces memory usage and improves speed for long context windows:

OLLAMA_FLASH_ATTENTION=1 ollama serve

This is enabled by default on supported hardware (most modern NVIDIA and Apple Silicon GPUs). Flash Attention is particularly beneficial for context windows above 8K tokens.

Model Preloading and Keep-Alive

By default, Ollama unloads models from memory after 5 minutes of inactivity. For production services, keep models loaded:

# Keep model loaded for 1 hour after last request
curl http://localhost:11434/api/generate \
  -d '{"model": "mistral-nemo", "prompt": "", "keep_alive": "1h"}'

# Keep model loaded indefinitely
curl http://localhost:11434/api/generate \
  -d '{"model": "mistral-nemo", "prompt": "", "keep_alive": -1}'

This eliminates the cold-start latency (which can be several seconds for large models) on the first request after idle.

Environment Variable Reference

Variable	Default	Description
`OLLAMA_HOST`	`127.0.0.1:11434`	Bind address and port
`OLLAMA_NUM_PARALLEL`	`1`	Max concurrent requests per model
`OLLAMA_MAX_LOADED_MODELS`	`1`	Max models loaded simultaneously
`OLLAMA_FLASH_ATTENTION`	`1`	Enable Flash Attention
`OLLAMA_KEEP_ALIVE`	`5m`	Default model keep-alive duration
`OLLAMA_MAX_QUEUE`	`512`	Max queued requests
`OLLAMA_MODELS`	`~/.ollama/models`	Model storage path
`CUDA_VISIBLE_DEVICES`	all	GPU selection for NVIDIA

Benchmark Results: Tokens Per Second by Hardware

Hardware	Model	Quantization	Tokens/sec (generation)
MacBook Pro M3 Max 128GB	Llama 3.3 70B	Q4_K_M	18-25
MacBook Pro M4 Max 128GB	Llama 3.3 70B	Q4_K_M	22-30
NVIDIA RTX 4090 24GB	Mistral Nemo 12B	Q4_K_M	65-85
NVIDIA RTX 4090 24GB	Qwen 2.5 7B	Q4_K_M	90-120
NVIDIA A100 80GB	Llama 3.3 70B	Q4_K_M	35-50
2x NVIDIA A100 80GB NVLink	Llama 3.3 70B	Q4_K_M	60-90
NVIDIA H100 80GB	Llama 3.3 70B	Q4_K_M	55-80
CPU only (AMD Ryzen 9 7950X)	Phi-4-mini 3.8B	Q4_K_M	3-8
CPU only (Intel i9-14900K)	Phi-4-mini 3.8B	Q4_K_M	4-10

These are approximate generation speeds (output tokens). Prompt processing (input tokens) is typically 2-5x faster than generation.

Modelfile — Custom Models and System Prompts

Ollama's Modelfile system lets you create custom model configurations with persistent system prompts, parameter settings, and templates. This is how enterprise teams standardize model behavior.

Basic Modelfile

# Modelfile for an EU compliance assistant
FROM mistral-nemo

SYSTEM """You are an EU AI Act compliance expert for Hyperion Consulting.
You only provide information relevant to the EU AI Act, GDPR, and ISO 42001.
Always cite specific articles and clauses when referencing regulations.
Never provide legal advice — always recommend consulting qualified legal counsel.
Respond in the language of the user's query."""

PARAMETER temperature 0.3
PARAMETER num_ctx 16384
PARAMETER top_p 0.9

Build and run:

ollama create compliance-assistant -f Modelfile
ollama run compliance-assistant

Advanced Modelfile: Code Review Assistant

FROM qwen2.5-coder:32b

SYSTEM """You are a senior code reviewer at a fintech company.
Review code for:
1. Security vulnerabilities (OWASP Top 10)
2. Performance issues
3. Code style and maintainability
4. Potential bugs and edge cases

Always provide specific line references and suggest concrete fixes.
Rate severity as: CRITICAL, HIGH, MEDIUM, LOW, INFO."""

PARAMETER temperature 0.1
PARAMETER num_ctx 32768

Modelfile for Structured Data Extraction

FROM phi4-mini

SYSTEM """You are a data extraction system. Given any input text, extract structured information and return it as valid JSON. Never include explanations outside the JSON structure. If a field cannot be determined from the input, use null."""

PARAMETER temperature 0.0
PARAMETER num_ctx 8192

Managing Custom Models

# List all models (including custom ones)
ollama list

# Show the Modelfile for an existing model
ollama show compliance-assistant --modelfile

# Remove a custom model
ollama rm compliance-assistant

# Copy/rename a model
ollama cp compliance-assistant compliance-v2

Custom models share the base model weights — creating a custom model from mistral-nemo only stores the Modelfile configuration, not a second copy of the weights.

Integration with LangChain and LlamaIndex

Ollama integrates natively with the two most popular LLM application frameworks. This makes it straightforward to build RAG systems, agents, and complex AI workflows running entirely on local hardware.

LangChain Integration

from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.chat_models import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Initialize LLM
llm = ChatOllama(
    model="mistral-nemo",
    base_url="http://localhost:11434",
    temperature=0.3
)

# Initialize embeddings
embeddings = OllamaEmbeddings(
    model="nomic-embed-text",
    base_url="http://localhost:11434"
)

# Build a simple chain
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant that answers questions about {topic}."),
    ("user", "{question}")
])

chain = prompt | llm | StrOutputParser()

result = chain.invoke({
    "topic": "EU AI Act compliance",
    "question": "What are the requirements for high-risk AI systems?"
})
print(result)

RAG with LangChain and Ollama:

from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import PDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA

# Load and chunk documents
loader = PDFLoader("eu-ai-act.pdf")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)

# Create vector store with Ollama embeddings
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# Create RAG chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5})
)

result = qa_chain.invoke("What are the penalties for non-compliance with the EU AI Act?")
print(result["result"])

LlamaIndex Integration

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding

# Configure Ollama as the default LLM and embedding model
Settings.llm = Ollama(
    model="mistral-nemo",
    request_timeout=120.0,
    base_url="http://localhost:11434"
)
Settings.embed_model = OllamaEmbedding(
    model_name="nomic-embed-text",
    base_url="http://localhost:11434"
)

# Load documents and create index
documents = SimpleDirectoryReader("./compliance-docs").load_data()
index = VectorStoreIndex.from_documents(documents)

# Query
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query(
    "What documentation is required for high-risk AI systems under the EU AI Act?"
)
print(response)

Direct OpenAI SDK Integration

For simpler use cases that do not need a framework, the OpenAI SDK works directly:

from openai import OpenAI
import json

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)

# Streaming response
stream = client.chat.completions.create(
    model="mistral-nemo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain the EU AI Act risk categories."}
    ],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Monitoring Ollama in Production

Production deployments need observability. While Ollama does not natively export Prometheus metrics, there are effective approaches to monitoring.

Health Checks

The simplest health check queries the model list endpoint:

# Returns 200 if Ollama is running and responsive
curl -sf http://localhost:11434/api/tags > /dev/null && echo "healthy" || echo "unhealthy"

For Docker and Kubernetes, use this as your health check endpoint. A more thorough check verifies the model can actually generate:

curl -sf http://localhost:11434/api/generate \
  -d '{"model": "mistral-nemo", "prompt": "ping", "stream": false}' \
  | jq -r '.response' > /dev/null && echo "model healthy" || echo "model unhealthy"

Systemd Logs

On Linux with systemd:

# Follow Ollama logs in real time
journalctl -u ollama -f

# View logs from the last hour
journalctl -u ollama --since "1 hour ago"

# View only errors
journalctl -u ollama -p err

GPU Monitoring

For NVIDIA GPUs, monitor VRAM usage and GPU utilization:

# Real-time GPU monitoring
watch -n 1 nvidia-smi

# Log GPU stats to file
nvidia-smi --query-gpu=timestamp,name,utilization.gpu,utilization.memory,memory.used,memory.total,temperature.gpu --format=csv -l 5 >> /var/log/gpu-stats.csv

Prometheus and Grafana

While Ollama does not expose a native /metrics endpoint, you can use community exporters or build a lightweight custom exporter:

# ollama_exporter.py — Simple Prometheus exporter
from prometheus_client import start_http_server, Gauge, Counter
import requests
import time

ollama_models_loaded = Gauge('ollama_models_loaded', 'Number of loaded models')
ollama_health = Gauge('ollama_health', 'Ollama health status (1=healthy, 0=unhealthy)')

def collect_metrics():
    try:
        resp = requests.get('http://localhost:11434/api/tags', timeout=5)
        if resp.status_code == 200:
            models = resp.json().get('models', [])
            ollama_models_loaded.set(len(models))
            ollama_health.set(1)
        else:
            ollama_health.set(0)
    except Exception:
        ollama_health.set(0)

if __name__ == '__main__':
    start_http_server(9091)
    while True:
        collect_metrics()
        time.sleep(15)

Alerting

Combine health checks with your existing alerting infrastructure. A simple cron-based approach:

#!/bin/bash
# /usr/local/bin/ollama-health-check.sh
if ! curl -sf http://localhost:11434/api/tags > /dev/null 2>&1; then
    echo "Ollama is down at $(date)" | mail -s "ALERT: Ollama Down" [email protected]
    # Or send to Slack/Teams/Telegram
fi

Add to crontab:

*/5 * * * * /usr/local/bin/ollama-health-check.sh

Ollama vs. Alternatives: When to Use What

Ollama is excellent for most enterprise local LLM use cases, but it is not the only option. Understanding when to use alternatives helps you make the right architecture decision.

Tool	Best For	Not Ideal For
Ollama	General-purpose local LLM, development, small-to-medium production	Ultra-high-throughput (>100 concurrent users)
vLLM	High-throughput production serving, continuous batching	Quick setup, development
llama.cpp server	Maximum control, custom quantization	Ease of use, model management
TGI (Text Generation Inference)	HuggingFace model ecosystem, production serving	Simple local development
LocalAI	OpenAI API compatibility with multiple backends	Single-model performance

Ollama wins on developer experience and operational simplicity. If you need to serve hundreds of concurrent users with maximum throughput, consider vLLM. For most enterprise use cases — development, testing, internal tools, and moderate-traffic production services — Ollama is the right choice.

Frequently Asked Questions

1. How does Ollama model quality compare to GPT-4 or Claude?

Open-source models running through Ollama have improved dramatically. Llama 3.3 70B and Qwen 2.5 72B are competitive with GPT-4-turbo on many benchmarks, particularly for structured tasks like summarization, extraction, and code generation. For complex reasoning, multi-step analysis, and creative writing, proprietary models like GPT-4o and Claude still hold an edge. The practical approach is to use Ollama for tasks where open-source models perform well (80%+ of enterprise use cases) and reserve API calls for the remaining complex tasks.

2. How much VRAM do I actually need?

The rule of thumb: a Q4-quantized model requires approximately 0.6 GB of VRAM per billion parameters, plus overhead for the KV cache. So a 7B model needs about 5 GB, a 14B model needs about 9 GB, and a 70B model needs about 40 GB. If your GPU does not have enough VRAM, Ollama will automatically offload some layers to CPU, which works but is significantly slower.

3. Can Ollama use multiple GPUs?

Yes. Ollama automatically distributes model layers across multiple NVIDIA GPUs. If you have two RTX 3090s (24 GB each = 48 GB total), Ollama can run a 70B Q4 model by splitting layers between them. NVLink is not required but provides better performance for multi-GPU setups. For AMD multi-GPU, support depends on the ROCm version.

4. How do I update models to newer versions?

Pull the model again to get the latest version:

ollama pull mistral-nemo

Ollama uses a Docker-like layer system — if you already have most layers, only the diff is downloaded. You can automate this with a cron job or a weekly maintenance script.

5. Ollama vs. vLLM — which should I choose?

Use Ollama for: development, testing, simple production deployments, air-gapped environments, teams that want simplicity. Use vLLM for: high-throughput production serving with continuous batching, environments already running Python-heavy ML infrastructure, when you need maximum tokens-per-second per GPU. Ollama is simpler to operate; vLLM extracts more performance from the same hardware at higher concurrency.

6. How many concurrent users can Ollama handle?

With OLLAMA_NUM_PARALLEL=4, a single Ollama instance can handle 4 simultaneous generation requests. Each concurrent request adds VRAM overhead for the KV cache. For a 7B model on an RTX 4090, you can comfortably serve 4-8 concurrent users. For higher concurrency, deploy multiple Ollama instances behind a load balancer, each with its own GPU.

7. Can multiple users share the same Ollama instance?

Yes. Ollama's API is stateless — multiple clients can send requests to the same instance. Use OLLAMA_NUM_PARALLEL to control concurrency, and put nginx in front for authentication and rate limiting. Model weights are shared in memory across all concurrent requests; only the KV cache is per-request.

8. How do I create custom Modelfiles for my organization?

Create a Modelfile with your system prompt, parameters, and base model. See the Modelfile section above for detailed examples. Custom Modelfiles are the recommended way to standardize model behavior across your organization — version-control your Modelfiles alongside your application code.

9. Does Ollama work with Windows WSL2?

Yes. Install the Linux version of Ollama inside WSL2. NVIDIA GPU passthrough works automatically if you have the Windows NVIDIA drivers installed (WSL2-specific CUDA drivers are no longer needed with recent driver versions). Performance is nearly identical to native Linux. This is actually the recommended approach for Windows development environments.

10. How do I add authentication to the Ollama API?

Ollama does not include built-in authentication. The recommended approach is to place Ollama behind a reverse proxy (nginx, Caddy, or Traefik) with authentication. Options include: HTTP Basic Auth for simple setups, OAuth2 Proxy for SSO integration, mutual TLS (mTLS) for service-to-service authentication, or an API gateway like Kong for enterprise-grade access control. See the Security Hardening section above for nginx configuration examples.

Conclusion: Your Ollama Enterprise Deployment Checklist

Ollama has matured into a production-ready platform for running open-source LLMs on your own infrastructure. Here is a summary checklist for enterprise deployment:

Planning:

Identify your use cases and select appropriate models
Calculate VRAM requirements based on model sizes and concurrency needs
Review model licenses for your commercial use case
Decide deployment topology: single server, Docker, or Kubernetes

Infrastructure:

Provision GPU hardware (NVIDIA recommended for production)
Install Ollama and verify GPU detection
Pre-pull all required models
Configure persistent storage for model cache

Security:

Bind Ollama to localhost (never expose directly to the network)
Deploy nginx reverse proxy with TLS and authentication
Implement rate limiting at the proxy layer
Enable audit logging for compliance
For air-gapped: prepare offline model packages

Performance:

Enable Flash Attention
Set OLLAMA_NUM_PARALLEL based on your GPU capacity
Configure model keep-alive for production services
Tune context window size for your workload

Operations:

Set up health checks and monitoring
Configure GPU utilization alerting
Document model update procedures
Establish backup procedures for custom Modelfiles

Ollama eliminates the complexity of local LLM deployment while giving you full control over your AI infrastructure — no data leaves your premises, no per-token costs, and no vendor lock-in. For European enterprises navigating GDPR, the EU AI Act, and data sovereignty requirements, it is the most practical path to production AI.

The Complete Ollama Enterprise Deployment Guide (2026)