英語のコンテンツこのリソースは現在英語のみで提供されています。他の言語への翻訳は今後のアップデートで予定されています。

Resources/Engineering Guide

Cost Engineering

LLM Cost Optimization: The Engineering Guide

Most teams overspend on LLM inference by 3-10x. This guide covers the engineering techniques that cut costs by 60-90% without sacrificing output quality -- from model routing and semantic caching to fine-tuning economics and self-hosting break-even analysis.

10 Sections

Comprehensive coverage

30 min read

With code examples

60-90% Savings

Typical cost reduction

Updated March 2026

Real pricing data included

The Cost Problem

LLM costs have a nasty habit of growing exponentially. What starts as a manageable $200/day prototype quickly becomes a $2,000/day production nightmare. The math is simple but brutal: per-token pricing x growing usage x context window inflation = exponential cost curves.

Here is a real scenario we see repeatedly: A team builds a customer support chatbot. In development, they test with short conversations and simple queries. Cost: $8/day. They launch to 500 users. Conversations get longer, context windows fill up, retry logic fires on timeouts, and the system prompt grows with each edge case fix. Within three weeks, the same chatbot costs $2,400/day -- a 300x increase that nobody budgeted for.

Why Costs Spiral

•Context window inflation: Conversation history grows with every turn, and you pay for the full context each time
•Retry loops: Timeout retries, validation retries, and parsing retries can 2-5x your actual call volume
•Over-prompting: Teams add instructions for every edge case, ballooning system prompts to 3,000+ tokens
•Wrong model for the job: Using GPT-4o for tasks that GPT-4o mini handles equally well

The Optimization Mindset

•Measure first: You cannot optimize what you do not measure -- instrument every LLM call
•Right-size models: 80% of LLM tasks do not need the most expensive model
•Cache aggressively: Many queries are semantically identical to previous ones
•Batch when possible: Async batch APIs are 50% cheaper on most providers

The $200/day to $2,000/day Story

A B2B SaaS company launched an AI assistant using GPT-4o for all queries. Their cost trajectory:

Week 1

$200/day

50 users, short queries

Week 3

$800/day

200 users, longer chats

Week 5

$1,500/day

400 users, retry loops

Week 7

$2,400/day

500 users, prompt bloat

After implementing the techniques in this guide (routing + caching + prompt compression), they brought costs down to $320/day at 500 users -- an 87% reduction.

Cost Anatomy

Before optimizing, you need to understand where money goes. LLM costs break down into several distinct categories, and the split varies dramatically by application type.

Input Tokens (60-80%)

System prompts, conversation history, retrieved context (RAG), few-shot examples. This is where most money goes, and where the biggest savings are.

Output Tokens (15-30%)

Generated responses. Output tokens cost 2-4x more per token than input, but volume is typically lower. Verbose responses are the main cost driver.

Overhead (5-15%)

Embedding generation, fine-tuning compute, vector storage, logging, and monitoring infrastructure. Small per-unit but adds up at scale.

Model Pricing Comparison (per 1M tokens)

Model	Provider	Input	Output	Context	Notes
GPT-4o	OpenAI	$2.50	$10.00	128K	Best general-purpose, multimodal
GPT-4o mini	OpenAI	$0.15	$0.60	128K	Great for simple tasks, 17x cheaper input than 4o
Claude Sonnet 4.6	Anthropic	$3.00	$15.00	200K	Strong reasoning, large context window
Claude Haiku 4.5	Anthropic	$0.80	$4.00	200K	Fast, cost-effective for classification
Mistral Large 3	Mistral	$2.00	$6.00	128K	European provider, GDPR-friendly
Llama 4 Maverick (self-hosted)	Meta (open-source)	~$0.30*	~$0.30*	1M	GPU cost only, no per-token fee

* Self-hosted costs are approximate, based on A100 GPU rental at ~$2/hr serving Llama 4 Maverick with vLLM. Actual costs depend on throughput and utilization.

Key Insight: The 17x Gap

GPT-4o input tokens cost $2.50/1M. GPT-4o mini costs $0.15/1M. That is a 17x price difference. For classification, extraction, and simple Q&A, the quality difference is often negligible. Model routing exploits this gap.

Model Routing

Model routing is the single highest-impact optimization. The idea is simple: route easy tasks to cheap models and hard tasks to expensive models. Most production workloads are 70-80% simple tasks that a small model handles perfectly. Typical savings: 60-80%.

Complexity Classifier

A small model or heuristic classifies the query complexity, then routes to the appropriate model tier.

Use embeddings or keyword-based scoring3 tiers: simple, medium, complexFallback to large model on low confidenceLatency overhead: 50-100ms

Task-Based Router

Route by task type: classification, extraction, summarization, generation, reasoning. Each task maps to an optimal model.

Summarization -> small modelClassification -> fine-tuned small modelComplex reasoning -> large modelCode generation -> specialized model

Cascade Pattern

Start with the cheapest model. If confidence is low or the response fails validation, escalate to a larger model.

Small model first (90% of queries)Medium model on low confidenceLarge model as final fallbackSaves 60-80% vs always using large

Quality Gate

A small verifier model checks if the cheap model's output meets quality thresholds before returning it.

Cheap generation + cheap verificationOnly escalate verified failuresAdds ~30% latency, saves ~50% costWorks well for factual queries

Implementation Pattern: Cascade Router

Classify the query

Use a lightweight classifier (logistic regression on embeddings, or a rule-based system) to score query complexity on a 0-1 scale. Cost: ~0.01ms per query.

Route to model tier

Score < 0.3 goes to GPT-4o mini ($0.15/1M input). Score 0.3-0.7 goes to Claude Haiku 4.5 ($0.80/1M). Score > 0.7 goes to GPT-4o ($2.50/1M).

Validate and escalate

If the cheap model returns low-confidence output or fails validation, automatically escalate to the next tier. Typically only 5-10% of queries escalate.

Real-World Savings: Model Routing

A customer support platform processing 50,000 queries/day switched from GPT-4o for everything to a routing setup: 72% to GPT-4o mini, 20% to Claude Haiku 4.5, 8% to GPT-4o. Monthly cost dropped from $38,000 to $6,200 -- an 84% reduction with no measurable quality degradation on their eval suite.

Semantic Caching

If a user asks "What is your return policy?" and another asks "How do I return an item?", they want the same answer. Semantic caching detects these similar queries and serves cached responses instead of making redundant API calls. For applications with repetitive query patterns, this alone can cut costs by 30-60%.

Caching Strategy Comparison

Approach	Hit Rate	Effort	Savings	Best For
Exact Match Cache	10-20%	Low	Low	Repeated identical queries (FAQ bots, autocomplete)
Semantic Cache (cosine > 0.95)	30-50%	Medium	High	Similar questions with same answer (customer support)
Prompt-Aware Cache	40-60%	High	Very High	Same system prompt + similar user queries
Prefix Caching (API-level)	Automatic	None	Medium	Shared system prompts across requests (Anthropic, OpenAI)

Implementation: Redis + Embeddings

Embed incoming query

Generate an embedding vector for the user query using a fast embedding model (e.g., text-embedding-3-small at $0.02/1M tokens).

Search cache with cosine similarity

Use Redis with the vector search module (RediSearch) or a lightweight vector DB. Set threshold at 0.95+ cosine similarity for high precision.

Return cached response or generate new

On hit: return cached response in <50ms. On miss: call LLM, store result with embedding and TTL (e.g., 24 hours for dynamic content, 7 days for static).

Hit Rate Optimization

•Normalize queries (lowercase, strip punctuation) before embedding
•Cache at the semantic intent level, not raw text level
•Separate caches per system prompt to avoid cross-contamination
•Monitor and tune the similarity threshold (start at 0.95, adjust based on false-positive rate)

Tools & Libraries

•GPTCache: Open-source semantic caching library with multiple backends
•Redis + RediSearch: Production-grade vector search with TTL support
•Anthropic / OpenAI prompt caching: Built-in prefix caching, zero implementation effort
•LiteLLM: Proxy with built-in caching support across providers

Prompt Optimization

Every token in your prompt costs money. Most production prompts contain 30-50% redundant tokens -- verbose instructions, unnecessary examples, and formatting that the model does not need. Prompt optimization is the lowest-effort, highest-return starting point.

System Prompt Compression

20-40% input tokensLow effort

Remove redundant instructions, use abbreviations, consolidate rules. A 2000-token system prompt often compresses to 800 tokens with zero quality loss.

Few-Shot to Zero-Shot Migration

50-80% input tokensMedium effort

Replace verbose few-shot examples with concise instructions. Fine-tune a small model on examples instead of passing them every call.

Structured Output Enforcement

30-50% output tokensLow effort

Use JSON mode or function calling to eliminate verbose prose. 'Explain your reasoning' adds 200+ tokens per response.

Context Window Pruning

40-70% input tokensMedium effort

Only include relevant conversation history. Summarize old turns. Remove system messages that the model already learned from fine-tuning.

Response Length Control

20-60% output tokensLow effort

Set max_tokens appropriately. Use 'Be concise' or 'Answer in under 100 words' in the prompt. Stop sequences for early termination.

Before / After: System Prompt Compression

Before (1,847 tokens)

You are a helpful customer support assistant for Acme Corp. You should always be polite and professional. You should answer questions about our products, services, and policies. If you do not know the answer, you should say that you do not know and suggest the user contact our support team. You should never make up information. You should always cite sources when possible... [+40 more lines of instructions]

After (612 tokens)

Role: Acme Corp support agent. Rules: Answer from provided context only. Unknown = "I don't have that info, contact [email protected]". Cite sources. No speculation. Format: concise paragraphs, max 150 words. Tone: professional, direct.

Same behavior, 67% fewer input tokens. At 50K requests/day with GPT-4o, this saves ~$190/day ($5,700/month) on system prompt tokens alone.

Batch Processing

If your workload does not require real-time responses, batch APIs offer an immediate 50% cost reduction with zero engineering effort. OpenAI's Batch API, Anthropic's Message Batches, and most providers offer discounted pricing for asynchronous processing.

When to Use Batch

•Content generation (blog posts, product descriptions, emails)
•Data classification and labeling pipelines
•Document summarization backfill
•Evaluation and testing suites
•Embedding generation for large corpora

When NOT to Use Batch

•Interactive chatbots (users expect <3s response)
•Real-time content moderation
•Streaming responses in UI
•Tasks where output depends on previous result (chains)
•Anything with SLA under 24 hours (batch can take up to 24h)

Queue-Based Architecture

For mixed workloads, implement a queue that separates real-time and batch-eligible requests. Use priority queues to route latency-sensitive work to synchronous APIs and everything else to batch endpoints.

Redis Queue / BullMQAWS SQS + LambdaCelery + Redis50% cost reduction on batch-eligible traffic

Fine-Tuning Economics

Fine-tuning lets you replace a large model + complex prompt with a small model that has the behavior baked in. The economics are compelling: a fine-tuned GPT-4o mini can match GPT-4o quality on narrow tasks at 1/15th the inference cost. But fine-tuning has upfront costs and is only worth it at sufficient scale.

Break-Even Analysis

Approach	Cost/1K calls	Quality	Latency	Setup Cost	Break-Even
GPT-4o + detailed prompt	$25.00	95%	High	$0	N/A
GPT-4o mini + few-shot	$1.50	88%	Low	$0	N/A
GPT-4o mini fine-tuned	$0.90	93%	Low	$50-200	~300
Llama 4 Scout fine-tuned (self-hosted)	$0.10	90%	Very Low	$500-2000	~2,000

Fine-Tune When...

•You have a well-defined, narrow task (classification, extraction, formatting)
•You make 10K+ calls/day on that task
•You have 500+ high-quality training examples
•You need to eliminate long system prompts or few-shot examples

Do NOT Fine-Tune When...

•Your task requires broad general knowledge (use RAG instead)
•Requirements change frequently (re-training is expensive)
•You have fewer than 200 training examples
•Prompt engineering with a smaller model achieves acceptable quality

Self-Hosting Open-Source Models

At high volume, self-hosting open-source models (Llama 4, Mistral Large 3, Qwen) can reduce per-token costs by 80-95%. The trade-off is operational complexity: you need GPU infrastructure, model serving, monitoring, and on-call support. The break-even point depends on your volume.

Total Cost of Ownership (Monthly)

Option	100K req/mo	1M req/mo	10M req/mo	Pros	Cons
OpenAI API (GPT-4o)	$2,500	$25,000	$250,000	No ops, always latest model	Highest marginal cost, vendor lock-in
GPU Rental (A100 80GB)	$2,000	$2,000	$6,000	Fixed cost at scale, data stays local	Ops burden, capacity planning
Owned Hardware (H100)	$4,500*	$4,500*	$4,500*	Lowest long-term cost, full control	High upfront ($30-40K), depreciation

* Owned hardware cost amortized over 36 months. Does not include electricity (~$200/mo for H100), rack space, or ops personnel.

Serving Stack

•vLLM: Best throughput, PagedAttention, continuous batching
•TGI (HuggingFace): Production-ready, Docker-native, built-in quantization
•Ollama: Simple local development, not for production scale
•TensorRT-LLM: NVIDIA-optimized, highest performance on NVIDIA GPUs

GPU Rental Options

•RunPod: $1.64/hr A100 80GB, good for experimentation
•Lambda Labs: $1.99/hr A100, reserved instances available
•AWS/GCP/Azure: Higher cost, enterprise SLAs, integrated ecosystem
•Together AI / Fireworks: Serverless inference, pay-per-token on open models

Self-Hosting Decision Framework

Self-host when you have (a) consistent volume above 1M tokens/day, (b) an ML ops team or willingness to build one, (c) data sovereignty requirements (GDPR, HIPAA), or (d) API spend exceeding $5,000/month. Below those thresholds, the operational complexity almost never justifies the savings. Start with serverless inference providers (Together AI, Fireworks) as a middle ground before committing to raw GPU rental.

Monitoring & Alerting

Cost optimization is not a one-time project. Without continuous monitoring, costs creep back up through prompt drift, new features, and changing usage patterns. You need real-time visibility into where every dollar goes.

Key Metrics to Track

Metric	Description	Target	Tool
Cost per Request	Total cost (input + output tokens) per API call, broken down by feature	Track trend, < budget	Custom logging / Helicone
Cost per User Session	Aggregate cost across all LLM calls in one user interaction	< $0.05 for most apps	LangSmith / custom
Cache Hit Rate	Percentage of requests served from semantic cache	> 30%	Redis metrics / custom
Token Efficiency	Ratio of useful output tokens to total tokens consumed	> 60%	Custom analysis
Model Routing Distribution	What percentage of traffic goes to each model tier	< 20% to large model	Custom dashboard
Daily Spend Rate	Rolling daily cost with anomaly detection for spikes	< 2x daily average	Helicone / alerts

Observability Tools

•Helicone: Proxy-based, zero-code cost tracking, per-request logging
•LangSmith: Full tracing, evaluation, prompt versioning (LangChain ecosystem)
•Langfuse: Open-source alternative, self-hostable, cost attribution
•OpenLLMetry: OpenTelemetry-based, plugs into existing observability stack

Alert Rules

•Daily spend > 2x average: Catch runaway loops or abuse early
•Average tokens/request > 150% baseline: Detect prompt bloat
•Cache hit rate < 20%: Cache invalidation issues or new query patterns
•Error rate > 5%: Retries are silently multiplying your costs

Per-Feature Cost Attribution

Tag every LLM call with the feature it serves (e.g., "chat", "search", "summarization", "classification"). This lets you answer: "Which feature costs the most?" and "Is the cost per user interaction sustainable?". Without this, you are optimizing blind. Pass metadata like {feature: "chat", user_tier: "free"} through your LLM proxy headers.

The Optimization Playbook

Do not try to implement everything at once. Follow this priority order based on effort-to-impact ratio. Each step compounds on the previous ones.

Step-by-Step Optimization Order

Audit & Measure (Day 1)

Add logging to every LLM call. Track tokens in/out, model used, feature, cost, latency. You cannot optimize what you do not measure.

Compress Prompts (Day 2-3)

Review and compress every system prompt. Remove redundancy, shorten instructions, cut unnecessary few-shot examples. Typical savings: 20-40%.

Implement Model Routing (Week 1-2)

Set up a basic router. Start with task-based routing (simple rules), then graduate to a classifier. Route 70%+ of traffic to the cheapest viable model.

Add Semantic Caching (Week 2-3)

Deploy a semantic cache for high-traffic endpoints. Start with exact-match, then add embedding similarity. Target 30%+ hit rate.

Move Batch-Eligible Work to Batch APIs (Week 3)

Identify workloads that do not need real-time responses. Switch to batch endpoints for 50% savings on those calls.

Set Up Monitoring & Alerts (Week 3-4)

Deploy cost dashboards with per-feature attribution. Set up anomaly alerts. Make LLM cost a first-class operational metric.

Evaluate Fine-Tuning & Self-Hosting (Month 2+)

Once you have data on per-task costs and volumes, evaluate whether fine-tuning or self-hosting makes economic sense for your highest-volume tasks.

Priority Matrix

Optimization	Effort	Impact	Savings	When to Do It
Prompt compression	Low	Medium	20-40%	Always do first
Model routing	Medium	Very High	60-80%	When > $500/mo spend
Semantic caching	Medium	High	30-60%	When queries are repetitive
Batch processing	Low	Medium	50% on batch-eligible	When latency is not critical
Fine-tuning	High	High	70-90%	When > 10K calls/day on one task
Self-hosting	Very High	Very High	80-95%	When > $10K/mo or data sovereignty

Compound Savings Example

Starting baseline: $10,000/month on LLM APIs.

After Prompt Optimization

$7,000

-30%

After Model Routing

$2,100

-70% of remaining

After Caching

$1,260

-40% of remaining

After Batch APIs

$1,008

Total: -90%

Ready to Cut Your LLM Costs?

Whether you are spending $500 or $50,000/month on LLM APIs, there are concrete engineering steps to reduce that by 60-90%. I help teams audit their LLM spend, implement routing and caching, and set up cost monitoring that prevents regression.

Related Resources

RAG Implementation Guide

Build production RAG systems -- and optimize their costs

Production AI Systems Service

End-to-end AI system optimization and deployment

AI Lab Demos

See AI optimization patterns in action