A complete decision framework for evaluating AI vendors across 8 dimensions. From the $2M mistake pattern through 25 RFP questions, 12 red flags, and a real case study — everything you need to select the right AI vendor and avoid costly lock-in.
A European fintech chose their LLM vendor based on a 45-minute demo and a favourable benchmark blog post. Eighteen months later, they spent $2.1M migrating off it. The model had been deprecated, their compliance team rejected the vendor's data processing agreement, and per-token costs had tripled since their initial budget. None of this was unforeseeable. All of it would have been caught by a structured evaluation.
This story is not unusual. In conversations with over 80 engineering leaders across Europe, the same failure modes appear repeatedly. The root cause is almost never the technology. It is the process — or the absence of one.
Provider-specific prompt formats, function calling schemas, and SDK patterns accumulate into invisible migration debt. Average engineering cost to switch LLM providers mid-project: $50K–$200K and 3–6 months. Most teams don't discover the dependency until they receive a deprecation notice or pricing increase.
Public benchmarks (MMLU, GPQA, HumanEval) measure general academic capability. Your production workload is not general. A model ranking #1 on MMLU may rank #4 on your specific contract extraction or customer support task. Decisions based on benchmarks without domain-specific piloting routinely disappoint.
Per-token API pricing is only 40–60% of actual AI infrastructure spend. Egress fees, fine-tuning compute, compliance audits, support tier upgrades, and migration engineering are the invisible majority. Teams that budget only for tokens routinely see 2–3x cost overruns in year two.
Every AI vendor selection should be evaluated across these eight dimensions. The default weights below suit an enterprise deploying LLM infrastructure in a regulated European context — adjust weights to match your specific priorities. A healthcare CISO will weight Security at 35%. A startup racing to market may weight Technical Performance at 40%.
Weights must sum to 100. Sections 3, 4, and 5 provide deep-dives on the three highest-weight dimensions.
Model quality on your specific tasks, latency, throughput, and accuracy under realistic conditions.
Certifications (SOC 2, ISO 27001, HIPAA), data residency, GDPR posture, EU AI Act alignment.
API pricing, training costs, hidden fees, egress, support tiers, and migration engineering overhead.
Uptime guarantees, support response times, dedicated CSM, enterprise tier availability.
SDK quality, framework compatibility (LangChain, LlamaIndex), CI/CD integration, documentation.
Financial runway, model release cadence, deprecation policy, alignment with your product roadmap.
Sector-specific requirements — HIPAA for healthcare, PCI-DSS for fintech, EU AI Act risk categorization.
Data export mechanisms, model portability, migration path, contractual exit clauses.
flowchart TD
A([Start: Vendor Evaluation]) --> B[Discovery & Requirements]
B --> B1[Define use case & constraints]
B --> B2[Set must-have criteria]
B --> B3[Identify 15-20 candidate vendors]
B1 & B2 & B3 --> C[Initial Shortlist]
C --> C1[Apply MoSCoW filter]
C1 --> C2{Passes must-haves?}
C2 -- No --> X1[Eliminate]
C2 -- Yes --> D[PoC / Pilot Phase]
D --> D1[Technical benchmark on your data]
D --> D2[Security review & DPA check]
D --> D3[Pricing & TCO modelling]
D1 & D2 & D3 --> E[Weighted Scoring Matrix]
E --> E1[Score top 3 vendors]
E1 --> F[Commercial Negotiation]
F --> F1[SLA terms]
F --> F2[Data processing agreement]
F --> F3[Exit clause negotiation]
F1 & F2 & F3 --> G([Vendor Selected])
style A fill:#1a1a2e,stroke:#7c3aed,color:#e2e8f0
style G fill:#0d1f12,stroke:#22c55e,color:#e2e8f0
style X1 fill:#1f0d0d,stroke:#ef4444,color:#e2e8f0
style C2 fill:#1e1b4b,stroke:#6366f1,color:#e2e8f0Default weight: 25%
Technical performance evaluation has three components: benchmarking methodology, latency and throughput measurement, and accuracy testing on your specific domain. All three must be run before committing.
Public benchmarks are a starting point, not a decision input. MMLU tests broad academic knowledge. HumanEval tests Python code generation. Neither tests your specific task. Build a domain-specific evaluation set from real production data before running any vendor comparison.
Never evaluate latency with a single request. Measure under realistic concurrent load using your expected production traffic pattern. Vendor demo latency is always best-case single-request.
| Metric | What It Measures | Acceptable Threshold | How to Measure |
|---|---|---|---|
| P50 Latency | Median response time | < 400ms for simple tasks | Load test at 1x prod volume |
| P95 Latency | 95th percentile — the user experience floor | < 1,200ms for complex tasks | Load test at 2x prod volume |
| P99 Latency | Worst-case — worst 1% of users | < 3,000ms (SLA ceiling) | Load test at 3x prod volume |
| Time to First Token | Perceived speed for streaming responses | < 300ms at P95 | Measure TTFT separately from total latency |
| Tokens/second | Generation throughput per request | > 40 tokens/s for real-time UX | Token count / total generation time |
| Rate limit capacity | Max concurrent requests / tokens per minute | ≥ 2x peak production volume | Review docs + test burst behaviour |
Default weight: 20%
Security and compliance is the most common reason AI vendor selections fail post-commit. These checks must happen before the PoC, not after. A vendor that cannot clear the compliance bar is eliminated regardless of technical performance.
| Provider | EU Region | Data Never Leaves EU | Self-Hosted Option | DPA Available |
|---|---|---|---|---|
| OpenAI (direct) | Not available | No — US servers | No | Yes (Enterprise) |
| OpenAI via Azure | Yes (Sweden, France, Netherlands) | Yes (PTU) | No | Yes (Azure DPA) |
| Anthropic (direct) | Not available | No — US servers | No | Yes (Enterprise) |
| Anthropic via Bedrock | Yes (Frankfurt, Ireland) | Yes | No | Yes (AWS DPA) |
| Mistral (direct) | Yes (France) | Yes — EU-native | Open weights | Yes (standard) |
| Google Vertex AI | Yes (Belgium, Netherlands) | Yes (regional endpoint) | No | Yes (GCP DPA) |
Default weight: 15%
TCO modelling for AI vendors has 5 cost categories. Most teams budget only for Category 1. The full picture is usually 2–3x higher than initial estimates. Build a 3-year model before committing.
This is the only cost most teams include in their budget.
Typically adds 20–40% to API costs for teams using fine-tuning.
Often 30–60% of API costs for mature production deployments.
One-time and annual recurring costs totalling $10K–$50K/year for regulated industries.
The most underestimated cost category. Estimate 3–6 months of migration if switching mid-project.
A worked example comparing four vendors for a European enterprise LLM deployment. Score each vendor 1–10 per dimension, multiply by dimension weight, and sum for the weighted total.
| Dimension | Weight | Vendor AUS hyperscaler | Vendor BCloud platform | Vendor CEU-native | Vendor DOpen-source host |
|---|---|---|---|---|---|
| Technical Performance | 25% | 9/10(22.5) | 8/10(20.0) | 7/10(17.5) | 6/10(15.0) |
| Security & Compliance | 20% | 5/10(10.0) | 8/10(16.0) | 10/10(20.0) | 7/10(14.0) |
| Total Cost of Ownership | 15% | 6/10(9.0) | 7/10(10.5) | 8/10(12.0) | 9/10(13.5) |
| Support & SLAs | 10% | 8/10(8.0) | 9/10(9.0) | 6/10(6.0) | 5/10(5.0) |
| Integration & Ecosystem | 10% | 9/10(9.0) | 7/10(7.0) | 6/10(6.0) | 5/10(5.0) |
| Vendor Roadmap & Stability | 10% | 8/10(8.0) | 7/10(7.0) | 9/10(9.0) | 6/10(6.0) |
| Compliance & Regulatory Fit | 5% | 4/10(2.0) | 7/10(3.5) | 10/10(5.0) | 8/10(4.0) |
| Exit Strategy & Portability | 5% | 4/10(2.0) | 6/10(3.0) | 9/10(4.5) | 8/10(4.0) |
| Weighted Total | 100% | 70.5 | 76.0 | 80.0Winner | 66.5 |
Vendor C (EU-native) wins despite scoring lower on Technical Performance and Integration. The heavy weight on Security & Compliance (20%) and Regulatory Fit (5%) reflects the enterprise context. A startup without compliance requirements would see a different winner.
Tiebreaker rule: If two vendors are within 5 points of each other, run a 2-week parallel pilot on production-scale traffic. The matrix narrows the field — real-world data on your workload makes the final call.
Weight adjustment: Before scoring, have your key stakeholders (CTO, CISO, CFO, DPO) independently assign weights and then average or negotiate. Different weights produce different winners — the weighting conversation is as important as the scoring.
Send these questions to every vendor under consideration before running a pilot. Vendors who refuse to answer or whose answers are vague signal problems. Require written responses — verbal answers from a sales engineer are not contractually binding.
These are observable signals that correlate strongly with production failures, compliance problems, or relationship deterioration. Critical flags are hard stops — do not proceed. High flags require deep investigation. Medium flags are caution signals to manage contractually.
| # | Red Flag | Severity | What It Signals |
|---|---|---|---|
| 1 | No public status page or historical uptime data | Critical | Vendor has something to hide about reliability. Every serious production provider publishes incident history. |
| 2 | Training opt-out requires legal review, not a UI toggle | Critical | Your proprietary prompts and business data are likely being used for model training. Non-negotiable for enterprise. |
| 3 | No SOC 2 Type II report available (only Type I) | Critical | Type I is a point-in-time snapshot with no sustained controls evidence. Type II covers a 6-12 month operating period. |
| 4 | GDPR/DPA documentation requires sales escalation | Critical | A DPA should be self-service or standard. Escalation requirements signal either legal immaturity or deliberate friction. |
| 5 | Pricing requires a sales call for basic tier information | High | Hidden pricing usually means it varies based on perceived budget, creating unpredictability in your cost forecasting. |
| 6 | Model deprecation notice shorter than 6 months | High | Production systems cannot migrate safely in under 6 months. Short deprecation windows destroy engineering plans. |
| 7 | No self-hosted or VPC deployment option for enterprise tier | High | For regulated industries or high-sensitivity data, shared tenancy is often unacceptable. No self-hosted = no deal. |
| 8 | SDK is a thin REST wrapper with no retry/backoff logic | High | Engineering maturity signal. Production-grade SDKs handle retries, streaming, rate limit backoff, and error classification. |
| 9 | Rate limits not documented or changed without prior notice | Medium | Undocumented or volatile rate limits make capacity planning impossible and cause unexpected production failures. |
| 10 | No data residency commitment in writing | Medium | Verbal assurances are not enforceable. Data residency requirements must be in the DPA or MSA, not in a sales deck. |
| 11 | Company founded less than 18 months ago with no enterprise referenceable customers | Medium | Early-stage vendors may pivot, run out of funding, or be acquired. For production AI infrastructure, longevity matters. |
| 12 | No exit clause or data deletion guarantee in standard contract | Medium | What happens to your data and fine-tuned models when you leave? If the contract is silent, assume the worst. |
Hard stop. Eliminate vendor immediately unless you can get contractual remediation.
Require detailed investigation and written mitigation plan before proceeding.
Caution signal. Manage via contractual protections or documented risk acceptance.
Most vendor evaluations stall because teams try to evaluate too many options in parallel. This 2-week process uses progressive elimination to get to 3 qualified finalists efficiently, saving PoC effort for vendors who actually deserve it.
Cast wide net: 15-20 vendors
Apply hard must-have criteria
Deep-dive on remaining 6-8 vendors
30-min call with each vendor, ask the 25 RFP questions
Apply weighted scoring matrix to top 3-4 vendors
Apply these as binary pass/fail gates. Any vendor failing a Must Have is eliminated immediately — no exceptions.
3-month process • 12 vendors evaluated • Decision rationale documented
A pan-European retail bank with operations in 7 countries needed an LLM vendor for internal document search and contract analysis. With 52,000 documents, PII-heavy content, and regulatory requirements across multiple jurisdictions, the stakes were high. Here is how they ran the evaluation.
The selected vendor was a European-headquartered provider with native EU data residency. Despite ranking third on raw model performance benchmarks, it ranked first once the 30% weight assigned to Security & Compliance was applied. The two technically superior vendors were both US-headquartered with no EU-only data residency guarantee at the time of evaluation.
The contractual exit clause negotiated gave the bank the right to export all fine-tuned adapters and switch providers with 90 days notice. This single term reduced the migration risk premium in the risk model by €400K — the cost of assumed future migration engineering.
12-month outcome: The bank processed 890,000 document queries in the first year at a TCO 30% below initial estimates. The vendor expanded EU coverage, which further strengthened the relationship. The structured evaluation process was adopted as the standard for all future AI vendor selections.
Selecting a vendor is the beginning, not the end. Vendor relationships degrade without active management. The teams that get the best outcomes treat vendor management as an ongoing discipline with regular cadence, documented SLA tracking, and clear escalation paths.
| Metric | SLA Target | Measurement | Escalation Trigger |
|---|---|---|---|
| API Uptime | ≥ 99.9% monthly | Synthetic monitoring every 60s from EU region | P1 incident if downtime > 15 minutes |
| P95 Latency | < 800ms for standard requests | 95th percentile of response times over rolling 24h window | Alert if P95 exceeds 1,200ms for > 5 minutes |
| Error Rate | < 0.5% 5xx errors per hour | Error rate across all API endpoints, excluding client errors | Escalate to vendor if > 1% for two consecutive hours |
| Rate Limit Headroom | ≥ 30% spare capacity vs contracted limits | Daily peak usage vs contracted rate limit ceiling | Request limit increase when headroom < 20% for 5 consecutive days |
| Cost per 1K API Calls | Within 10% of modelled baseline | Rolling 7-day average vs original TCO model | Review and renegotiate if sustained > 20% above baseline |
| Quarterly Business Review | Held every 90 days | Vendor roadmap update, incident review, pricing review, SLA compliance report | Trigger formal performance review if any Critical SLA missed |
Start 3 months before contract renewal. This is your leverage window.
The single most effective way to reduce vendor lock-in is to abstract your LLM calls behind a routing layer from day one. This is 1–3 days of engineering investment that eliminates months of migration risk.