Most teams overspend on LLM inference by 3-10x. This guide covers the engineering techniques that cut costs by 60-90% without sacrificing output quality -- from model routing and semantic caching to fine-tuning economics and self-hosting break-even analysis.
LLM costs have a nasty habit of growing exponentially. What starts as a manageable $200/day prototype quickly becomes a $2,000/day production nightmare. The math is simple but brutal: per-token pricing x growing usage x context window inflation = exponential cost curves.
Here is a real scenario we see repeatedly: A team builds a customer support chatbot. In development, they test with short conversations and simple queries. Cost: $8/day. They launch to 500 users. Conversations get longer, context windows fill up, retry logic fires on timeouts, and the system prompt grows with each edge case fix. Within three weeks, the same chatbot costs $2,400/day -- a 300x increase that nobody budgeted for.
A B2B SaaS company launched an AI assistant using GPT-4o for all queries. Their cost trajectory:
After implementing the techniques in this guide (routing + caching + prompt compression), they brought costs down to $320/day at 500 users -- an 87% reduction.
Before optimizing, you need to understand where money goes. LLM costs break down into several distinct categories, and the split varies dramatically by application type.
System prompts, conversation history, retrieved context (RAG), few-shot examples. This is where most money goes, and where the biggest savings are.
Generated responses. Output tokens cost 2-4x more per token than input, but volume is typically lower. Verbose responses are the main cost driver.
Embedding generation, fine-tuning compute, vector storage, logging, and monitoring infrastructure. Small per-unit but adds up at scale.
| Model | Provider | Input | Output | Context | Notes |
|---|---|---|---|---|---|
| GPT-4o | OpenAI | $2.50 | $10.00 | 128K | Best general-purpose, multimodal |
| GPT-4o mini | OpenAI | $0.15 | $0.60 | 128K | Great for simple tasks, 17x cheaper input than 4o |
| Claude 3.5 Sonnet | Anthropic | $3.00 | $15.00 | 200K | Strong reasoning, large context window |
| Claude 3.5 Haiku | Anthropic | $0.80 | $4.00 | 200K | Fast, cost-effective for classification |
| Mistral Large | Mistral | $2.00 | $6.00 | 128K | European provider, GDPR-friendly |
| Llama 3.1 70B (self-hosted) | Meta (open-source) | ~$0.30* | ~$0.30* | 128K | GPU cost only, no per-token fee |
* Self-hosted costs are approximate, based on A100 GPU rental at ~$2/hr serving Llama 3.1 70B with vLLM. Actual costs depend on throughput and utilization.
GPT-4o input tokens cost $2.50/1M. GPT-4o mini costs $0.15/1M. That is a 17x price difference. For classification, extraction, and simple Q&A, the quality difference is often negligible. Model routing exploits this gap.
Model routing is the single highest-impact optimization. The idea is simple: route easy tasks to cheap models and hard tasks to expensive models. Most production workloads are 70-80% simple tasks that a small model handles perfectly. Typical savings: 60-80%.
A small model or heuristic classifies the query complexity, then routes to the appropriate model tier.
Route by task type: classification, extraction, summarization, generation, reasoning. Each task maps to an optimal model.
Start with the cheapest model. If confidence is low or the response fails validation, escalate to a larger model.
A small verifier model checks if the cheap model's output meets quality thresholds before returning it.
Use a lightweight classifier (logistic regression on embeddings, or a rule-based system) to score query complexity on a 0-1 scale. Cost: ~0.01ms per query.
Score < 0.3 goes to GPT-4o mini ($0.15/1M input). Score 0.3-0.7 goes to Claude 3.5 Haiku ($0.80/1M). Score > 0.7 goes to GPT-4o ($2.50/1M).
If the cheap model returns low-confidence output or fails validation, automatically escalate to the next tier. Typically only 5-10% of queries escalate.
A customer support platform processing 50,000 queries/day switched from GPT-4o for everything to a routing setup: 72% to GPT-4o mini, 20% to Claude Haiku, 8% to GPT-4o. Monthly cost dropped from $38,000 to $6,200 -- an 84% reduction with no measurable quality degradation on their eval suite.
If a user asks "What is your return policy?" and another asks "How do I return an item?", they want the same answer. Semantic caching detects these similar queries and serves cached responses instead of making redundant API calls. For applications with repetitive query patterns, this alone can cut costs by 30-60%.
| Approach | Hit Rate | Effort | Savings | Best For |
|---|---|---|---|---|
| Exact Match Cache | 10-20% | Low | Low | Repeated identical queries (FAQ bots, autocomplete) |
| Semantic Cache (cosine > 0.95) | 30-50% | Medium | High | Similar questions with same answer (customer support) |
| Prompt-Aware Cache | 40-60% | High | Very High | Same system prompt + similar user queries |
| Prefix Caching (API-level) | Automatic | None | Medium | Shared system prompts across requests (Anthropic, OpenAI) |
Generate an embedding vector for the user query using a fast embedding model (e.g., text-embedding-3-small at $0.02/1M tokens).
Use Redis with the vector search module (RediSearch) or a lightweight vector DB. Set threshold at 0.95+ cosine similarity for high precision.
On hit: return cached response in <50ms. On miss: call LLM, store result with embedding and TTL (e.g., 24 hours for dynamic content, 7 days for static).
Every token in your prompt costs money. Most production prompts contain 30-50% redundant tokens -- verbose instructions, unnecessary examples, and formatting that the model does not need. Prompt optimization is the lowest-effort, highest-return starting point.
Remove redundant instructions, use abbreviations, consolidate rules. A 2000-token system prompt often compresses to 800 tokens with zero quality loss.
Replace verbose few-shot examples with concise instructions. Fine-tune a small model on examples instead of passing them every call.
Use JSON mode or function calling to eliminate verbose prose. 'Explain your reasoning' adds 200+ tokens per response.
Only include relevant conversation history. Summarize old turns. Remove system messages that the model already learned from fine-tuning.
Set max_tokens appropriately. Use 'Be concise' or 'Answer in under 100 words' in the prompt. Stop sequences for early termination.
Same behavior, 67% fewer input tokens. At 50K requests/day with GPT-4o, this saves ~$190/day ($5,700/month) on system prompt tokens alone.
If your workload does not require real-time responses, batch APIs offer an immediate 50% cost reduction with zero engineering effort. OpenAI's Batch API, Anthropic's Message Batches, and most providers offer discounted pricing for asynchronous processing.
For mixed workloads, implement a queue that separates real-time and batch-eligible requests. Use priority queues to route latency-sensitive work to synchronous APIs and everything else to batch endpoints.
Fine-tuning lets you replace a large model + complex prompt with a small model that has the behavior baked in. The economics are compelling: a fine-tuned GPT-4o mini can match GPT-4o quality on narrow tasks at 1/15th the inference cost. But fine-tuning has upfront costs and is only worth it at sufficient scale.
| Approach | Cost/1K calls | Quality | Latency | Setup Cost | Break-Even |
|---|---|---|---|---|---|
| GPT-4o + detailed prompt | $25.00 | 95% | High | $0 | N/A |
| GPT-4o mini + few-shot | $1.50 | 88% | Low | $0 | N/A |
| GPT-4o mini fine-tuned | $0.90 | 93% | Low | $50-200 | ~300 |
| Llama 3.1 8B fine-tuned (self-hosted) | $0.10 | 90% | Very Low | $500-2000 | ~2,000 |
At high volume, self-hosting open-source models (Llama 3.1, Mistral, Qwen) can reduce per-token costs by 80-95%. The trade-off is operational complexity: you need GPU infrastructure, model serving, monitoring, and on-call support. The break-even point depends on your volume.
| Option | 100K req/mo | 1M req/mo | 10M req/mo | Pros | Cons |
|---|---|---|---|---|---|
| OpenAI API (GPT-4o) | $2,500 | $25,000 | $250,000 | No ops, always latest model | Highest marginal cost, vendor lock-in |
| GPU Rental (A100 80GB) | $2,000 | $2,000 | $6,000 | Fixed cost at scale, data stays local | Ops burden, capacity planning |
| Owned Hardware (H100) | $4,500* | $4,500* | $4,500* | Lowest long-term cost, full control | High upfront ($30-40K), depreciation |
* Owned hardware cost amortized over 36 months. Does not include electricity (~$200/mo for H100), rack space, or ops personnel.
Self-host when you have (a) consistent volume above 1M tokens/day, (b) an ML ops team or willingness to build one, (c) data sovereignty requirements (GDPR, HIPAA), or (d) API spend exceeding $5,000/month. Below those thresholds, the operational complexity almost never justifies the savings. Start with serverless inference providers (Together AI, Fireworks) as a middle ground before committing to raw GPU rental.
Cost optimization is not a one-time project. Without continuous monitoring, costs creep back up through prompt drift, new features, and changing usage patterns. You need real-time visibility into where every dollar goes.
| Metric | Description | Target | Tool |
|---|---|---|---|
| Cost per Request | Total cost (input + output tokens) per API call, broken down by feature | Track trend, < budget | Custom logging / Helicone |
| Cost per User Session | Aggregate cost across all LLM calls in one user interaction | < $0.05 for most apps | LangSmith / custom |
| Cache Hit Rate | Percentage of requests served from semantic cache | > 30% | Redis metrics / custom |
| Token Efficiency | Ratio of useful output tokens to total tokens consumed | > 60% | Custom analysis |
| Model Routing Distribution | What percentage of traffic goes to each model tier | < 20% to large model | Custom dashboard |
| Daily Spend Rate | Rolling daily cost with anomaly detection for spikes | < 2x daily average | Helicone / alerts |
Tag every LLM call with the feature it serves (e.g., "chat", "search", "summarization", "classification"). This lets you answer: "Which feature costs the most?" and "Is the cost per user interaction sustainable?". Without this, you are optimizing blind. Pass metadata like {feature: "chat", user_tier: "free"} through your LLM proxy headers.
Do not try to implement everything at once. Follow this priority order based on effort-to-impact ratio. Each step compounds on the previous ones.
Add logging to every LLM call. Track tokens in/out, model used, feature, cost, latency. You cannot optimize what you do not measure.
Review and compress every system prompt. Remove redundancy, shorten instructions, cut unnecessary few-shot examples. Typical savings: 20-40%.
Set up a basic router. Start with task-based routing (simple rules), then graduate to a classifier. Route 70%+ of traffic to the cheapest viable model.
Deploy a semantic cache for high-traffic endpoints. Start with exact-match, then add embedding similarity. Target 30%+ hit rate.
Identify workloads that do not need real-time responses. Switch to batch endpoints for 50% savings on those calls.
Deploy cost dashboards with per-feature attribution. Set up anomaly alerts. Make LLM cost a first-class operational metric.
Once you have data on per-task costs and volumes, evaluate whether fine-tuning or self-hosting makes economic sense for your highest-volume tasks.
| Optimization | Effort | Impact | Savings | When to Do It |
|---|---|---|---|---|
| Prompt compression | Low | Medium | 20-40% | Always do first |
| Model routing | Medium | Very High | 60-80% | When > $500/mo spend |
| Semantic caching | Medium | High | 30-60% | When queries are repetitive |
| Batch processing | Low | Medium | 50% on batch-eligible | When latency is not critical |
| Fine-tuning | High | High | 70-90% | When > 10K calls/day on one task |
| Self-hosting | Very High | Very High | 80-95% | When > $10K/mo or data sovereignty |
Starting baseline: $10,000/month on LLM APIs.