Resources/Evaluation Framework

Strategic Framework

AI Vendor & Model Selection Matrix

A complete decision framework for evaluating AI vendors across 8 dimensions. From the $2M mistake pattern through 25 RFP questions, 12 red flags, and a real case study — everything you need to select the right AI vendor and avoid costly lock-in.

11 Sections

End-to-end coverage

35 min read

With templates & tables

25 RFP Questions

Ready to send

Updated March 2026

Enterprise & Mid-Market

2-week evaluation process

Why AI Vendor Selection Fails — The $2M Mistake Pattern

A European fintech chose their LLM vendor based on a 45-minute demo and a favourable benchmark blog post. Eighteen months later, they spent $2.1M migrating off it. The model had been deprecated, their compliance team rejected the vendor's data processing agreement, and per-token costs had tripled since their initial budget. None of this was unforeseeable. All of it would have been caught by a structured evaluation.

This story is not unusual. In conversations with over 80 engineering leaders across Europe, the same failure modes appear repeatedly. The root cause is almost never the technology. It is the process — or the absence of one.

Lock-in Risk

Provider-specific prompt formats, function calling schemas, and SDK patterns accumulate into invisible migration debt. Average engineering cost to switch LLM providers mid-project: $50K–$200K and 3–6 months. Most teams don't discover the dependency until they receive a deprecation notice or pricing increase.

Hype vs Reality

Public benchmarks (MMLU, GPQA, HumanEval) measure general academic capability. Your production workload is not general. A model ranking #1 on MMLU may rank #4 on your specific contract extraction or customer support task. Decisions based on benchmarks without domain-specific piloting routinely disappoint.

The Hidden 60%

Per-token API pricing is only 40–60% of actual AI infrastructure spend. Egress fees, fine-tuning compute, compliance audits, support tier upgrades, and migration engineering are the invisible majority. Teams that budget only for tokens routinely see 2–3x cost overruns in year two.

The three failure modes, by frequency

47%

Compliance mismatch

Vendor cannot meet data residency or regulatory requirements discovered during legal review, post-commit

31%

Cost overrun

Hidden fees, pricing changes, or usage growth not modelled in initial TCO analysis drives budget overrun

22%

Performance gap

Model quality or latency in production doesn't match demo or benchmark, requiring costly migration

The 8 Evaluation Dimensions

Every AI vendor selection should be evaluated across these eight dimensions. The default weights below suit an enterprise deploying LLM infrastructure in a regulated European context — adjust weights to match your specific priorities. A healthcare CISO will weight Security at 35%. A startup racing to market may weight Technical Performance at 40%.

Weights must sum to 100. Sections 3, 4, and 5 provide deep-dives on the three highest-weight dimensions.

Technical Performance

Model quality on your specific tasks, latency, throughput, and accuracy under realistic conditions.

25%

Security & Compliance

Certifications (SOC 2, ISO 27001, HIPAA), data residency, GDPR posture, EU AI Act alignment.

20%

Total Cost of Ownership

API pricing, training costs, hidden fees, egress, support tiers, and migration engineering overhead.

15%

Support & SLAs

Uptime guarantees, support response times, dedicated CSM, enterprise tier availability.

10%

Integration & Ecosystem

SDK quality, framework compatibility (LangChain, LlamaIndex), CI/CD integration, documentation.

10%

Vendor Roadmap & Stability

Financial runway, model release cadence, deprecation policy, alignment with your product roadmap.

10%

Compliance & Regulatory Fit

Sector-specific requirements — HIPAA for healthcare, PCI-DSS for fintech, EU AI Act risk categorization.

Exit Strategy & Portability

Data export mechanisms, model portability, migration path, contractual exit clauses.

Evaluation Process Flowchart

flowchart TD
    A([Start: Vendor Evaluation]) --> B[Discovery & Requirements]
    B --> B1[Define use case & constraints]
    B --> B2[Set must-have criteria]
    B --> B3[Identify 15-20 candidate vendors]
    B1 & B2 & B3 --> C[Initial Shortlist]
    C --> C1[Apply MoSCoW filter]
    C1 --> C2{Passes must-haves?}
    C2 -- No --> X1[Eliminate]
    C2 -- Yes --> D[PoC / Pilot Phase]
    D --> D1[Technical benchmark on your data]
    D --> D2[Security review & DPA check]
    D --> D3[Pricing & TCO modelling]
    D1 & D2 & D3 --> E[Weighted Scoring Matrix]
    E --> E1[Score top 3 vendors]
    E1 --> F[Commercial Negotiation]
    F --> F1[SLA terms]
    F --> F2[Data processing agreement]
    F --> F3[Exit clause negotiation]
    F1 & F2 & F3 --> G([Vendor Selected])
    style A fill:#1a1a2e,stroke:#7c3aed,color:#e2e8f0
    style B fill:#1e293b,stroke:#475569,color:#e2e8f0
    style B1 fill:#1e293b,stroke:#475569,color:#e2e8f0
    style B2 fill:#1e293b,stroke:#475569,color:#e2e8f0
    style B3 fill:#1e293b,stroke:#475569,color:#e2e8f0
    style C fill:#1e293b,stroke:#6366f1,color:#e2e8f0
    style C1 fill:#1e293b,stroke:#475569,color:#e2e8f0
    style C2 fill:#1e1b4b,stroke:#6366f1,color:#e2e8f0
    style D fill:#1e293b,stroke:#3b82f6,color:#e2e8f0
    style D1 fill:#1e293b,stroke:#475569,color:#e2e8f0
    style D2 fill:#1e293b,stroke:#475569,color:#e2e8f0
    style D3 fill:#1e293b,stroke:#475569,color:#e2e8f0
    style E fill:#1e293b,stroke:#8b5cf6,color:#e2e8f0
    style E1 fill:#1e293b,stroke:#475569,color:#e2e8f0
    style F fill:#1e293b,stroke:#f59e0b,color:#e2e8f0
    style F1 fill:#1e293b,stroke:#475569,color:#e2e8f0
    style F2 fill:#1e293b,stroke:#475569,color:#e2e8f0
    style F3 fill:#1e293b,stroke:#475569,color:#e2e8f0
    style X1 fill:#1f0d0d,stroke:#ef4444,color:#e2e8f0
    style G fill:#0d1f12,stroke:#22c55e,color:#e2e8f0

Dimension 1 Deep-Dive: Technical Performance

Default weight: 25%

Technical performance evaluation has three components: benchmarking methodology, latency and throughput measurement, and accuracy testing on your specific domain. All three must be run before committing.

Benchmarking Methodology

Public benchmarks are a starting point, not a decision input. MMLU tests broad academic knowledge. HumanEval tests Python code generation. Neither tests your specific task. Build a domain-specific evaluation set from real production data before running any vendor comparison.

Evaluation Set Construction

100–500 examples from real production data
Include edge cases and known failure modes
Cover easy, medium, and hard difficulty
Include adversarial prompts relevant to your use case
Create ground truth labels with human annotators

Accuracy Metrics by Task Type

Extraction: F1 score on entity/value accuracy
Classification: Precision, recall, F1 per class
Generation: Human eval + ROUGE/BERTScore
Reasoning: Pass@1 and Pass@3 on logic tasks
Hallucination rate: factual claim verification

Latency & Throughput Metrics

Never evaluate latency with a single request. Measure under realistic concurrent load using your expected production traffic pattern. Vendor demo latency is always best-case single-request.

Metric	What It Measures	Acceptable Threshold	How to Measure
P50 Latency	Median response time	< 400ms for simple tasks	Load test at 1x prod volume
P95 Latency	95th percentile — the user experience floor	< 1,200ms for complex tasks	Load test at 2x prod volume
P99 Latency	Worst-case — worst 1% of users	< 3,000ms (SLA ceiling)	Load test at 3x prod volume
Time to First Token	Perceived speed for streaming responses	< 300ms at P95	Measure TTFT separately from total latency
Tokens/second	Generation throughput per request	> 40 tokens/s for real-time UX	Token count / total generation time
Rate limit capacity	Max concurrent requests / tokens per minute	≥ 2x peak production volume	Review docs + test burst behaviour

Accuracy Testing Protocol

Week 1

Baseline Evaluation

Run eval set against each vendor with identical prompts
Record accuracy, latency, token count per response
Flag obvious hallucinations or format failures

Week 2

Prompt Optimization

Optimize prompts for each vendor independently
Measure accuracy gain from optimization
Document prompt engineering effort per vendor

Week 3

Stress & Edge Case Testing

Test adversarial inputs and known failure modes
Measure performance on long-context documents
Test behaviour at context window limits

Dimension 2 Deep-Dive: Security & Compliance

Default weight: 20%

Security and compliance is the most common reason AI vendor selections fail post-commit. These checks must happen before the PoC, not after. A vendor that cannot clear the compliance bar is eliminated regardless of technical performance.

SOC 2 Type II

Type II covers a 6–12 month period (not a point-in-time snapshot)
Request the full report under NDA — a summary is insufficient
Check Trust Services Criteria covered: Security + Availability minimum
Review the management letter for open exceptions or qualifications
Verify audit firm credibility (Big 4 or recognised specialist)

ISO 27001

Verify the certificate is current and not expired
Check the scope: does it cover the specific services you will use?
Cloud-hosted AI services should include Annex A.17 (Business Continuity)
Distinguish between ISO 27001 certification and mere compliance claims
Pair with ISO 27017 (cloud security) and ISO 27018 (PII in cloud) where relevant

GDPR & Data Processing

A signed DPA (Data Processing Agreement) is mandatory under GDPR Art. 28
Review subprocessor list and approval rights for subprocessor changes
Confirm data residency: EU region only, no US transfer without SCCs
Verify deletion SLA: how quickly is data deleted on termination or request?
Confirm prompts and outputs will never be used for model training

EU AI Act Alignment

Classify your AI use case under the EU AI Act risk categories
High-risk use cases (HR, credit, healthcare) require conformity assessments
Request vendor statement on GPAI model obligations (transparency, copyright)
Verify provider maintains technical documentation per Art. 53
Check provider's position on incident reporting obligations under Art. 62

Data Residency Options by Major Provider

Provider	EU Region	Data Never Leaves EU	Self-Hosted Option	DPA Available
OpenAI (direct)	Not available	No — US servers	No	Yes (Enterprise)
OpenAI via Azure	Yes (Sweden, France, Netherlands)	Yes (PTU)	No	Yes (Azure DPA)
Anthropic (direct)	Not available	No — US servers	No	Yes (Enterprise)
Anthropic via Bedrock	Yes (Frankfurt, Ireland)	Yes	No	Yes (AWS DPA)
Mistral (direct)	Yes (France)	Yes — EU-native	Open weights	Yes (standard)
Google Vertex AI	Yes (Belgium, Netherlands)	Yes (regional endpoint)	No	Yes (GCP DPA)

Dimension 3 Deep-Dive: Total Cost of Ownership

Default weight: 15%

TCO modelling for AI vendors has 5 cost categories. Most teams budget only for Category 1. The full picture is usually 2–3x higher than initial estimates. Build a 3-year model before committing.

Category 1

API & Inference Costs

Input token pricing × projected monthly input token volume
Output token pricing × projected monthly output token volume
Volume discount tiers — model at each pricing band
Committed use vs pay-as-you-go tradeoff at your projected volume
Rate limit upgrade costs if you need dedicated throughput

This is the only cost most teams include in their budget.

Category 2

Training & Customisation

Fine-tuning compute: training run cost per epoch × number of expected runs
Fine-tuning inference: fine-tuned models typically cost 2–4x base model inference
Evaluation compute: running your eval suite on every model version update
Embedding re-indexing cost when switching or updating embedding models
Data preparation and labelling for training sets

Typically adds 20–40% to API costs for teams using fine-tuning.

Category 3

Operational Overhead

Engineering time for SDK integration, prompt engineering, and maintenance
Monitoring and observability tooling (LangSmith, Langfuse, Datadog LLM)
Multi-region replication for latency or failover requirements
Support tier subscription cost for enterprise SLA
Legal review of terms of service changes (2–4x per year)

Often 30–60% of API costs for mature production deployments.

Category 4

Compliance & Security

SOC 2 bridge letters and third-party assessments
GDPR DPA legal review and annual re-review
Penetration testing and security assessment costs
Audit logging infrastructure and retention storage
Data residency premium (if applicable)

One-time and annual recurring costs totalling $10K–$50K/year for regulated industries.

Category 5

Migration & Exit Costs

Engineering time to rewrite prompts and adapters if switching providers
Regression testing against your eval suite after migration
Parallel running costs during migration period (2 vendors simultaneously)
Re-embedding entire corpus when changing embedding providers
Downtime risk and revenue impact during migration window

The most underestimated cost category. Estimate 3–6 months of migration if switching mid-project.

Build vs Buy vs Partner Decision Framework

Buy (SaaS API)

Best When

Speed to market is the top priority
Team lacks ML infrastructure expertise
Volume is under $5K/month in API costs
Use case is standard (summarization, classification)

Key Risks

Vendor lock-in and pricing changes at scale
Data leaves your perimeter
Limited customization for domain-specific tasks

Examples: OpenAI API, Anthropic API, Mistral API

Build (Self-Host Open Source)

Best When

Volume exceeds $10K/month on API costs
Data sovereignty is non-negotiable
Need deep customization and fine-tuning control
Team has ML infrastructure and DevOps capability

Key Risks

High operational overhead and reliability responsibility
GPU capex or rental costs
Model update and security patching burden on your team

Examples: Llama 4, Mistral (open weights), Falcon, Gemma

Partner (Cloud AI Platform)

Best When

Already heavily invested in AWS, Azure, or GCP
Need enterprise compliance with existing cloud agreements
Want model variety without managing multiple vendor relationships
Need managed fine-tuning and evaluation tooling

Key Risks

Cloud platform lock-in on top of model lock-in
Model availability lags behind direct provider APIs
Complex pricing with multiple dimensions (tokens + compute + storage)

Examples: AWS Bedrock, Azure OpenAI Service, GCP Vertex AI

The Scoring Matrix Template

A worked example comparing four vendors for a European enterprise LLM deployment. Score each vendor 1–10 per dimension, multiply by dimension weight, and sum for the weighted total.

Formula: Weighted Total = Σ(Dimension Weight% × Score) / 10Score 1–3: Does not meet requirements | 4–6: Partially meets | 7–9: Meets or exceeds | 10: Exceptional

Dimension	Weight	Vendor AUS hyperscaler	Vendor BCloud platform	Vendor CEU-native	Vendor DOpen-source host
Technical Performance	25%	9/10(22.5)	8/10(20.0)	7/10(17.5)	6/10(15.0)
Security & Compliance	20%	5/10(10.0)	8/10(16.0)	10/10(20.0)	7/10(14.0)
Total Cost of Ownership	15%	6/10(9.0)	7/10(10.5)	8/10(12.0)	9/10(13.5)
Support & SLAs	10%	8/10(8.0)	9/10(9.0)	6/10(6.0)	5/10(5.0)
Integration & Ecosystem	10%	9/10(9.0)	7/10(7.0)	6/10(6.0)	5/10(5.0)
Vendor Roadmap & Stability	10%	8/10(8.0)	7/10(7.0)	9/10(9.0)	6/10(6.0)
Compliance & Regulatory Fit	5%	4/10(2.0)	7/10(3.5)	10/10(5.0)	8/10(4.0)
Exit Strategy & Portability	5%	4/10(2.0)	6/10(3.0)	9/10(4.5)	8/10(4.0)
Weighted Total	100%	70.5	76.0	80.0Winner	66.5

Reading the Results

Vendor C (EU-native) wins despite scoring lower on Technical Performance and Integration. The heavy weight on Security & Compliance (20%) and Regulatory Fit (5%) reflects the enterprise context. A startup without compliance requirements would see a different winner.

Tiebreaker rule: If two vendors are within 5 points of each other, run a 2-week parallel pilot on production-scale traffic. The matrix narrows the field — real-world data on your workload makes the final call.

Weight adjustment: Before scoring, have your key stakeholders (CTO, CISO, CFO, DPO) independently assign weights and then average or negotiate. Different weights produce different winners — the weighting conversation is as important as the scoring.

RFP Template: 25 Questions to Send Every Vendor

Send these questions to every vendor under consideration before running a pilot. Vendors who refuse to answer or whose answers are vague signal problems. Require written responses — verbal answers from a sales engineer are not contractually binding.

Technical Performance

1What are your published P50, P95, and P99 latency targets for our expected request size?
2What throughput (tokens/second) can you guarantee under a dedicated tier vs shared capacity?
3How do you handle latency degradation during peak demand? Do you shed load or queue requests?
4What is your model accuracy benchmark methodology, and how do you validate it on domain-specific data?
5What is your process for communicating model updates that may change output behavior?

Security & Compliance

6Can you provide your current SOC 2 Type II report under NDA?
7Do you hold an ISO 27001 certificate? If so, what scope does it cover?
8What are your data retention and deletion policies for API call logs, prompt data, and model outputs?
9Do you offer a Data Processing Agreement (DPA) compliant with GDPR Article 28?
10Can you confirm in writing that our prompts and outputs will never be used for model training without explicit opt-in?
11What EU data residency options do you offer, and which regions?

Pricing & Commercial Terms

12What is your full pricing structure including input tokens, output tokens, fine-tuning, and storage?
13Are volume discounts available? At what tier, and what is the committed use contract structure?
14What happens to pricing if we exceed our committed volume in a given month?
15Are there any egress, data transfer, or API gateway fees not included in per-token pricing?
16What are the terms for enterprise support tiers, and what does each include?

Operational & Integration

17What is your SLA for API uptime? How do you calculate and credit downtime?
18How do you handle rate limit increases for production traffic spikes?
19What observability and logging do you expose to customers (token usage, error rates, latency)?
20Do you provide a staging/sandbox environment for testing that mirrors production behaviour?
21What SDKs do you officially support and what is your deprecation process for SDK versions?

Strategic & Exit

22What is your roadmap for the next 12-18 months? What model capabilities are planned?
23What is the minimum notice period before deprecating a model version we are using in production?
24How can we export our fine-tuned model weights or adapter layers if we choose to leave?
25What is the contractual process for terminating the agreement early, and what data deletion guarantees apply?
26Can you provide references from customers in our industry or with similar compliance requirements?

How to Use These Questions

Send as a formal written RFP, not as questions in a sales call

Set a 5-business-day response deadline

Rate each response 1–3 (inadequate, partial, complete)

Vendors with more than 3 inadequate responses in Security/Compliance sections should be eliminated

Request evidence documents (SOC 2 report, DPA template) alongside responses

Ask follow-up questions on any response that is vague or deferred to a contract negotiation

Red Flags: 12 Signs a Vendor Will Disappoint You

These are observable signals that correlate strongly with production failures, compliance problems, or relationship deterioration. Critical flags are hard stops — do not proceed. High flags require deep investigation. Medium flags are caution signals to manage contractually.

#	Red Flag	Severity	What It Signals
1	No public status page or historical uptime data	Critical	Vendor has something to hide about reliability. Every serious production provider publishes incident history.
2	Training opt-out requires legal review, not a UI toggle	Critical	Your proprietary prompts and business data are likely being used for model training. Non-negotiable for enterprise.
3	No SOC 2 Type II report available (only Type I)	Critical	Type I is a point-in-time snapshot with no sustained controls evidence. Type II covers a 6-12 month operating period.
4	GDPR/DPA documentation requires sales escalation	Critical	A DPA should be self-service or standard. Escalation requirements signal either legal immaturity or deliberate friction.
5	Pricing requires a sales call for basic tier information	High	Hidden pricing usually means it varies based on perceived budget, creating unpredictability in your cost forecasting.
6	Model deprecation notice shorter than 6 months	High	Production systems cannot migrate safely in under 6 months. Short deprecation windows destroy engineering plans.
7	No self-hosted or VPC deployment option for enterprise tier	High	For regulated industries or high-sensitivity data, shared tenancy is often unacceptable. No self-hosted = no deal.
8	SDK is a thin REST wrapper with no retry/backoff logic	High	Engineering maturity signal. Production-grade SDKs handle retries, streaming, rate limit backoff, and error classification.
9	Rate limits not documented or changed without prior notice	Medium	Undocumented or volatile rate limits make capacity planning impossible and cause unexpected production failures.
10	No data residency commitment in writing	Medium	Verbal assurances are not enforceable. Data residency requirements must be in the DPA or MSA, not in a sales deck.
11	Company founded less than 18 months ago with no enterprise referenceable customers	Medium	Early-stage vendors may pivot, run out of funding, or be acquired. For production AI infrastructure, longevity matters.
12	No exit clause or data deletion guarantee in standard contract	Medium	What happens to your data and fine-tuned models when you leave? If the contract is silent, assume the worst.

Critical

Hard stop. Eliminate vendor immediately unless you can get contractual remediation.

High

Require detailed investigation and written mitigation plan before proceeding.

Medium

Caution signal. Manage via contractual protections or documented risk acceptance.

Shortlisting Process: From 20 Vendors to 3 Finalists in 2 Weeks

Most vendor evaluations stall because teams try to evaluate too many options in parallel. This 2-week process uses progressive elimination to get to 3 qualified finalists efficiently, saving PoC effort for vendors who actually deserve it.

Week 1

Discovery

Cast wide net: 15-20 vendors

Output: Long list with one-line qualification

Tool: Market research, G2, analyst reports

Week 1

MoSCoW Filter

Apply hard must-have criteria

Output: Eliminate ~60% of vendors automatically

Tool: Compliance checklist, pricing floor

Week 2

Desk Research

Deep-dive on remaining 6-8 vendors

Output: Security posture, pricing, ecosystem maturity

Tool: Public docs, SOC 2 requests, RFP send

Week 2

Demo & Technical Call

30-min call with each vendor, ask the 25 RFP questions

Output: Score responses, eliminate vendors with critical red flags

Tool: Structured interview template

Week 2

Scoring & Shortlist

Apply weighted scoring matrix to top 3-4 vendors

Output: Ranked shortlist of 3 finalists for PoC

Tool: Scoring matrix (see Section 6)

MoSCoW Filter Criteria

Apply these as binary pass/fail gates. Any vendor failing a Must Have is eliminated immediately — no exceptions.

Must Have (any failure = eliminate)

SOC 2 Type II report available
GDPR-compliant DPA available
EU data residency (if required by your DPO)
Published pricing (no sales-gated basic tiers)
Uptime SLA ≥ 99.9% in contract
Training opt-out as a standard account setting

Should Have (score higher, not eliminate)

ISO 27001 certificate
Dedicated enterprise support tier
Self-hosted or VPC deployment option
Fine-tuning and customisation support
Multi-region availability
Model portability and export

Case Study: How a European Bank Selected Its LLM Vendor

3-month process • 12 vendors evaluated • Decision rationale documented

A pan-European retail bank with operations in 7 countries needed an LLM vendor for internal document search and contract analysis. With 52,000 documents, PII-heavy content, and regulatory requirements across multiple jurisdictions, the stakes were high. Here is how they ran the evaluation.

Vendors evaluated

Months end-to-end

PoC finalists

Selected vendor

Month 1

Discovery & Requirements

Defined use case: internal document search and contract analysis (52,000 documents)
Established hard requirements: EU data residency, GDPR DPA, SOC 2 Type II, <800ms P95 latency
Identified 12 candidate vendors from market research and existing cloud relationships
Applied MoSCoW filter — eliminated 5 vendors immediately (no EU residency or no DPA)

Month 2

PoC & Technical Evaluation

Ran 4-week parallel PoC with 3 finalists on a representative 500-document subset
Measured: extraction accuracy on IBAN/legal clauses, P95 latency under 50 req/s, hallucination rate
Security review: reviewed SOC 2 Type II reports, DPA terms, subprocessor lists
TCO modelling: projected 3-year cost including API, fine-tuning, and support tiers

Month 3

Negotiation & Selection

Shortlisted 2 vendors within 8 weighted scoring points of each other
Ran 2-week production-volume stress test (200 req/s peak) on both finalists
Negotiated contractual exit clause: 90-day notice, full data deletion, model weight export
Final decision: EU-native provider won on data residency weight (30% of score) and exit terms

Final Decision Rationale

The selected vendor was a European-headquartered provider with native EU data residency. Despite ranking third on raw model performance benchmarks, it ranked first once the 30% weight assigned to Security & Compliance was applied. The two technically superior vendors were both US-headquartered with no EU-only data residency guarantee at the time of evaluation.

The contractual exit clause negotiated gave the bank the right to export all fine-tuned adapters and switch providers with 90 days notice. This single term reduced the migration risk premium in the risk model by €400K — the cost of assumed future migration engineering.

12-month outcome: The bank processed 890,000 document queries in the first year at a TCO 30% below initial estimates. The vendor expanded EU coverage, which further strengthened the relationship. The structured evaluation process was adopted as the standard for all future AI vendor selections.

Post-Selection: Vendor Management and SLA Monitoring

Selecting a vendor is the beginning, not the end. Vendor relationships degrade without active management. The teams that get the best outcomes treat vendor management as an ongoing discipline with regular cadence, documented SLA tracking, and clear escalation paths.

SLA Monitoring Dashboard: Key Metrics

Metric	SLA Target	Measurement	Escalation Trigger
API Uptime	≥ 99.9% monthly	Synthetic monitoring every 60s from EU region	P1 incident if downtime > 15 minutes
P95 Latency	< 800ms for standard requests	95th percentile of response times over rolling 24h window	Alert if P95 exceeds 1,200ms for > 5 minutes
Error Rate	< 0.5% 5xx errors per hour	Error rate across all API endpoints, excluding client errors	Escalate to vendor if > 1% for two consecutive hours
Rate Limit Headroom	≥ 30% spare capacity vs contracted limits	Daily peak usage vs contracted rate limit ceiling	Request limit increase when headroom < 20% for 5 consecutive days
Cost per 1K API Calls	Within 10% of modelled baseline	Rolling 7-day average vs original TCO model	Review and renegotiate if sustained > 20% above baseline
Quarterly Business Review	Held every 90 days	Vendor roadmap update, incident review, pricing review, SLA compliance report	Trigger formal performance review if any Critical SLA missed

Vendor Management Cadence

DailyAutomated SLA monitoring alerts — uptime, latency, error rate

WeeklyInternal review of cost per request trend and rate limit headroom

MonthlyVendor status page review, incident post-mortem for any P1 events

QuarterlyQBR with vendor: roadmap update, pricing review, SLA compliance report

AnnuallyFull re-evaluation: re-run scoring matrix, assess market alternatives, renegotiate contract

Contract Renewal Checklist

Start 3 months before contract renewal. This is your leverage window.

Re-run weighted scoring matrix with current-year market data
Request updated SOC 2 Type II report
Benchmark 2–3 alternative vendors to establish negotiating position
Review vendor's deprecation notices — any models you depend on at risk?
Calculate full TCO for past 12 months vs initial model
Review DPA for any terms changes in the past 12 months
Negotiate: volume commitments for improved pricing, improved SLA, extended notice periods
Update exit clause: ensure data deletion SLA and model portability rights are current

Multi-Vendor Abstraction Strategy

The single most effective way to reduce vendor lock-in is to abstract your LLM calls behind a routing layer from day one. This is 1–3 days of engineering investment that eliminates months of migration risk.

Open-source routing options

LiteLLM — unified API for 100+ providers
Portkey — gateway with observability and fallbacks
OpenRouter — marketplace routing with cost optimisation
Custom abstraction layer with provider interface

What abstraction gives you

Swap primary vendor without rewriting application code
Run fallback vendor automatically on rate limit or downtime
A/B test two vendors on live traffic
Route by cost vs quality dynamically at request time

Need Help Running Your AI Vendor Evaluation?

I help CTOs and engineering leaders run structured vendor evaluations — from requirements definition through PoC design, scoring, and contract negotiation. You get an objective framework and someone who has seen the same mistakes made 50 times.

Related Resources

LLM Cost Optimization Guide

Cut LLM inference costs by 60–90% with model routing, caching, and fine-tuning

AI Security & Red-Teaming Playbook

Protect your AI systems from prompt injection and model attacks

EU AI Act Compliance Guide

Navigate regulatory requirements for AI systems in Europe

Resources/Evaluation Framework

Strategic Framework

AI Vendor & Model Selection Matrix

11 Sections

End-to-end coverage

35 min read

With templates & tables

25 RFP Questions

Ready to send

Updated March 2026

Enterprise & Mid-Market

2-week evaluation process

Why AI Vendor Selection Fails — The $2M Mistake Pattern

Lock-in Risk

Hype vs Reality

The Hidden 60%

The three failure modes, by frequency

47%

Compliance mismatch

Vendor cannot meet data residency or regulatory requirements discovered during legal review, post-commit

31%

Cost overrun

Hidden fees, pricing changes, or usage growth not modelled in initial TCO analysis drives budget overrun

22%

Performance gap

Model quality or latency in production doesn't match demo or benchmark, requiring costly migration

The 8 Evaluation Dimensions

Weights must sum to 100. Sections 3, 4, and 5 provide deep-dives on the three highest-weight dimensions.

Technical Performance

Model quality on your specific tasks, latency, throughput, and accuracy under realistic conditions.

25%

Security & Compliance

Certifications (SOC 2, ISO 27001, HIPAA), data residency, GDPR posture, EU AI Act alignment.

20%

Total Cost of Ownership

API pricing, training costs, hidden fees, egress, support tiers, and migration engineering overhead.

15%

Support & SLAs

Uptime guarantees, support response times, dedicated CSM, enterprise tier availability.

10%

Integration & Ecosystem

SDK quality, framework compatibility (LangChain, LlamaIndex), CI/CD integration, documentation.

10%

Vendor Roadmap & Stability

Financial runway, model release cadence, deprecation policy, alignment with your product roadmap.

10%

Compliance & Regulatory Fit

Sector-specific requirements — HIPAA for healthcare, PCI-DSS for fintech, EU AI Act risk categorization.

Exit Strategy & Portability

Data export mechanisms, model portability, migration path, contractual exit clauses.

Evaluation Process Flowchart

flowchart TD
    A([Start: Vendor Evaluation]) --> B[Discovery & Requirements]
    B --> B1[Define use case & constraints]
    B --> B2[Set must-have criteria]
    B --> B3[Identify 15-20 candidate vendors]
    B1 & B2 & B3 --> C[Initial Shortlist]
    C --> C1[Apply MoSCoW filter]
    C1 --> C2{Passes must-haves?}
    C2 -- No --> X1[Eliminate]
    C2 -- Yes --> D[PoC / Pilot Phase]
    D --> D1[Technical benchmark on your data]
    D --> D2[Security review & DPA check]
    D --> D3[Pricing & TCO modelling]
    D1 & D2 & D3 --> E[Weighted Scoring Matrix]
    E --> E1[Score top 3 vendors]
    E1 --> F[Commercial Negotiation]
    F --> F1[SLA terms]
    F --> F2[Data processing agreement]
    F --> F3[Exit clause negotiation]
    F1 & F2 & F3 --> G([Vendor Selected])
    style A fill:#1a1a2e,stroke:#7c3aed,color:#e2e8f0
    style B fill:#1e293b,stroke:#475569,color:#e2e8f0
    style B1 fill:#1e293b,stroke:#475569,color:#e2e8f0
    style B2 fill:#1e293b,stroke:#475569,color:#e2e8f0
    style B3 fill:#1e293b,stroke:#475569,color:#e2e8f0
    style C fill:#1e293b,stroke:#6366f1,color:#e2e8f0
    style C1 fill:#1e293b,stroke:#475569,color:#e2e8f0
    style C2 fill:#1e1b4b,stroke:#6366f1,color:#e2e8f0
    style D fill:#1e293b,stroke:#3b82f6,color:#e2e8f0
    style D1 fill:#1e293b,stroke:#475569,color:#e2e8f0
    style D2 fill:#1e293b,stroke:#475569,color:#e2e8f0
    style D3 fill:#1e293b,stroke:#475569,color:#e2e8f0
    style E fill:#1e293b,stroke:#8b5cf6,color:#e2e8f0
    style E1 fill:#1e293b,stroke:#475569,color:#e2e8f0
    style F fill:#1e293b,stroke:#f59e0b,color:#e2e8f0
    style F1 fill:#1e293b,stroke:#475569,color:#e2e8f0
    style F2 fill:#1e293b,stroke:#475569,color:#e2e8f0
    style F3 fill:#1e293b,stroke:#475569,color:#e2e8f0
    style X1 fill:#1f0d0d,stroke:#ef4444,color:#e2e8f0
    style G fill:#0d1f12,stroke:#22c55e,color:#e2e8f0

Dimension 1 Deep-Dive: Technical Performance

Default weight: 25%

Benchmarking Methodology

Evaluation Set Construction

100–500 examples from real production data
Include edge cases and known failure modes
Cover easy, medium, and hard difficulty
Include adversarial prompts relevant to your use case
Create ground truth labels with human annotators

Accuracy Metrics by Task Type

Extraction: F1 score on entity/value accuracy
Classification: Precision, recall, F1 per class
Generation: Human eval + ROUGE/BERTScore
Reasoning: Pass@1 and Pass@3 on logic tasks
Hallucination rate: factual claim verification

Latency & Throughput Metrics

Never evaluate latency with a single request. Measure under realistic concurrent load using your expected production traffic pattern. Vendor demo latency is always best-case single-request.

Metric	What It Measures	Acceptable Threshold	How to Measure
P50 Latency	Median response time	< 400ms for simple tasks	Load test at 1x prod volume
P95 Latency	95th percentile — the user experience floor	< 1,200ms for complex tasks	Load test at 2x prod volume
P99 Latency	Worst-case — worst 1% of users	< 3,000ms (SLA ceiling)	Load test at 3x prod volume
Time to First Token	Perceived speed for streaming responses	< 300ms at P95	Measure TTFT separately from total latency
Tokens/second	Generation throughput per request	> 40 tokens/s for real-time UX	Token count / total generation time
Rate limit capacity	Max concurrent requests / tokens per minute	≥ 2x peak production volume	Review docs + test burst behaviour

Accuracy Testing Protocol

Week 1

Baseline Evaluation

Run eval set against each vendor with identical prompts
Record accuracy, latency, token count per response
Flag obvious hallucinations or format failures

Week 2

Prompt Optimization

Optimize prompts for each vendor independently
Measure accuracy gain from optimization
Document prompt engineering effort per vendor

Week 3

Stress & Edge Case Testing

Test adversarial inputs and known failure modes
Measure performance on long-context documents
Test behaviour at context window limits

Dimension 2 Deep-Dive: Security & Compliance

Default weight: 20%

SOC 2 Type II

Type II covers a 6–12 month period (not a point-in-time snapshot)
Request the full report under NDA — a summary is insufficient
Check Trust Services Criteria covered: Security + Availability minimum
Review the management letter for open exceptions or qualifications
Verify audit firm credibility (Big 4 or recognised specialist)

ISO 27001

Verify the certificate is current and not expired
Check the scope: does it cover the specific services you will use?
Cloud-hosted AI services should include Annex A.17 (Business Continuity)
Distinguish between ISO 27001 certification and mere compliance claims
Pair with ISO 27017 (cloud security) and ISO 27018 (PII in cloud) where relevant

GDPR & Data Processing

A signed DPA (Data Processing Agreement) is mandatory under GDPR Art. 28
Review subprocessor list and approval rights for subprocessor changes
Confirm data residency: EU region only, no US transfer without SCCs
Verify deletion SLA: how quickly is data deleted on termination or request?
Confirm prompts and outputs will never be used for model training

EU AI Act Alignment

Classify your AI use case under the EU AI Act risk categories
High-risk use cases (HR, credit, healthcare) require conformity assessments
Request vendor statement on GPAI model obligations (transparency, copyright)
Verify provider maintains technical documentation per Art. 53
Check provider's position on incident reporting obligations under Art. 62

Data Residency Options by Major Provider

Provider	EU Region	Data Never Leaves EU	Self-Hosted Option	DPA Available
OpenAI (direct)	Not available	No — US servers	No	Yes (Enterprise)
OpenAI via Azure	Yes (Sweden, France, Netherlands)	Yes (PTU)	No	Yes (Azure DPA)
Anthropic (direct)	Not available	No — US servers	No	Yes (Enterprise)
Anthropic via Bedrock	Yes (Frankfurt, Ireland)	Yes	No	Yes (AWS DPA)
Mistral (direct)	Yes (France)	Yes — EU-native	Open weights	Yes (standard)
Google Vertex AI	Yes (Belgium, Netherlands)	Yes (regional endpoint)	No	Yes (GCP DPA)

Dimension 3 Deep-Dive: Total Cost of Ownership

Default weight: 15%

TCO modelling for AI vendors has 5 cost categories. Most teams budget only for Category 1. The full picture is usually 2–3x higher than initial estimates. Build a 3-year model before committing.

Category 1

API & Inference Costs

Input token pricing × projected monthly input token volume
Output token pricing × projected monthly output token volume
Volume discount tiers — model at each pricing band
Committed use vs pay-as-you-go tradeoff at your projected volume
Rate limit upgrade costs if you need dedicated throughput

This is the only cost most teams include in their budget.

Category 2

Training & Customisation

Fine-tuning compute: training run cost per epoch × number of expected runs
Fine-tuning inference: fine-tuned models typically cost 2–4x base model inference
Evaluation compute: running your eval suite on every model version update
Embedding re-indexing cost when switching or updating embedding models
Data preparation and labelling for training sets

Typically adds 20–40% to API costs for teams using fine-tuning.

Category 3

Operational Overhead

Engineering time for SDK integration, prompt engineering, and maintenance
Monitoring and observability tooling (LangSmith, Langfuse, Datadog LLM)
Multi-region replication for latency or failover requirements
Support tier subscription cost for enterprise SLA
Legal review of terms of service changes (2–4x per year)

Often 30–60% of API costs for mature production deployments.

Category 4

Compliance & Security

SOC 2 bridge letters and third-party assessments
GDPR DPA legal review and annual re-review
Penetration testing and security assessment costs
Audit logging infrastructure and retention storage
Data residency premium (if applicable)

One-time and annual recurring costs totalling $10K–$50K/year for regulated industries.

Category 5

Migration & Exit Costs

Engineering time to rewrite prompts and adapters if switching providers
Regression testing against your eval suite after migration
Parallel running costs during migration period (2 vendors simultaneously)
Re-embedding entire corpus when changing embedding providers
Downtime risk and revenue impact during migration window

The most underestimated cost category. Estimate 3–6 months of migration if switching mid-project.

Build vs Buy vs Partner Decision Framework

Buy (SaaS API)

Best When

Speed to market is the top priority
Team lacks ML infrastructure expertise
Volume is under $5K/month in API costs
Use case is standard (summarization, classification)

Key Risks

Vendor lock-in and pricing changes at scale
Data leaves your perimeter
Limited customization for domain-specific tasks

Examples: OpenAI API, Anthropic API, Mistral API

Build (Self-Host Open Source)

Best When

Volume exceeds $10K/month on API costs
Data sovereignty is non-negotiable
Need deep customization and fine-tuning control
Team has ML infrastructure and DevOps capability

Key Risks

High operational overhead and reliability responsibility
GPU capex or rental costs
Model update and security patching burden on your team

Examples: Llama 4, Mistral (open weights), Falcon, Gemma

Partner (Cloud AI Platform)

Best When

Already heavily invested in AWS, Azure, or GCP
Need enterprise compliance with existing cloud agreements
Want model variety without managing multiple vendor relationships
Need managed fine-tuning and evaluation tooling

Key Risks

Cloud platform lock-in on top of model lock-in
Model availability lags behind direct provider APIs
Complex pricing with multiple dimensions (tokens + compute + storage)

Examples: AWS Bedrock, Azure OpenAI Service, GCP Vertex AI

The Scoring Matrix Template

A worked example comparing four vendors for a European enterprise LLM deployment. Score each vendor 1–10 per dimension, multiply by dimension weight, and sum for the weighted total.

Formula: Weighted Total = Σ(Dimension Weight% × Score) / 10Score 1–3: Does not meet requirements | 4–6: Partially meets | 7–9: Meets or exceeds | 10: Exceptional

Dimension	Weight	Vendor AUS hyperscaler	Vendor BCloud platform	Vendor CEU-native	Vendor DOpen-source host
Technical Performance	25%	9/10(22.5)	8/10(20.0)	7/10(17.5)	6/10(15.0)
Security & Compliance	20%	5/10(10.0)	8/10(16.0)	10/10(20.0)	7/10(14.0)
Total Cost of Ownership	15%	6/10(9.0)	7/10(10.5)	8/10(12.0)	9/10(13.5)
Support & SLAs	10%	8/10(8.0)	9/10(9.0)	6/10(6.0)	5/10(5.0)
Integration & Ecosystem	10%	9/10(9.0)	7/10(7.0)	6/10(6.0)	5/10(5.0)
Vendor Roadmap & Stability	10%	8/10(8.0)	7/10(7.0)	9/10(9.0)	6/10(6.0)
Compliance & Regulatory Fit	5%	4/10(2.0)	7/10(3.5)	10/10(5.0)	8/10(4.0)
Exit Strategy & Portability	5%	4/10(2.0)	6/10(3.0)	9/10(4.5)	8/10(4.0)
Weighted Total	100%	70.5	76.0	80.0Winner	66.5

Reading the Results

RFP Template: 25 Questions to Send Every Vendor

Technical Performance

1What are your published P50, P95, and P99 latency targets for our expected request size?
2What throughput (tokens/second) can you guarantee under a dedicated tier vs shared capacity?
3How do you handle latency degradation during peak demand? Do you shed load or queue requests?
4What is your model accuracy benchmark methodology, and how do you validate it on domain-specific data?
5What is your process for communicating model updates that may change output behavior?

Security & Compliance

6Can you provide your current SOC 2 Type II report under NDA?
7Do you hold an ISO 27001 certificate? If so, what scope does it cover?
8What are your data retention and deletion policies for API call logs, prompt data, and model outputs?
9Do you offer a Data Processing Agreement (DPA) compliant with GDPR Article 28?
10Can you confirm in writing that our prompts and outputs will never be used for model training without explicit opt-in?
11What EU data residency options do you offer, and which regions?

Pricing & Commercial Terms

12What is your full pricing structure including input tokens, output tokens, fine-tuning, and storage?
13Are volume discounts available? At what tier, and what is the committed use contract structure?
14What happens to pricing if we exceed our committed volume in a given month?
15Are there any egress, data transfer, or API gateway fees not included in per-token pricing?
16What are the terms for enterprise support tiers, and what does each include?

Operational & Integration

17What is your SLA for API uptime? How do you calculate and credit downtime?
18How do you handle rate limit increases for production traffic spikes?
19What observability and logging do you expose to customers (token usage, error rates, latency)?
20Do you provide a staging/sandbox environment for testing that mirrors production behaviour?
21What SDKs do you officially support and what is your deprecation process for SDK versions?

Strategic & Exit

22What is your roadmap for the next 12-18 months? What model capabilities are planned?
23What is the minimum notice period before deprecating a model version we are using in production?
24How can we export our fine-tuned model weights or adapter layers if we choose to leave?
25What is the contractual process for terminating the agreement early, and what data deletion guarantees apply?
26Can you provide references from customers in our industry or with similar compliance requirements?

How to Use These Questions

Send as a formal written RFP, not as questions in a sales call

Set a 5-business-day response deadline

Rate each response 1–3 (inadequate, partial, complete)

Vendors with more than 3 inadequate responses in Security/Compliance sections should be eliminated

Request evidence documents (SOC 2 report, DPA template) alongside responses

Ask follow-up questions on any response that is vague or deferred to a contract negotiation

Red Flags: 12 Signs a Vendor Will Disappoint You

#	Red Flag	Severity	What It Signals
1	No public status page or historical uptime data	Critical	Vendor has something to hide about reliability. Every serious production provider publishes incident history.
2	Training opt-out requires legal review, not a UI toggle	Critical	Your proprietary prompts and business data are likely being used for model training. Non-negotiable for enterprise.
3	No SOC 2 Type II report available (only Type I)	Critical	Type I is a point-in-time snapshot with no sustained controls evidence. Type II covers a 6-12 month operating period.
4	GDPR/DPA documentation requires sales escalation	Critical	A DPA should be self-service or standard. Escalation requirements signal either legal immaturity or deliberate friction.
5	Pricing requires a sales call for basic tier information	High	Hidden pricing usually means it varies based on perceived budget, creating unpredictability in your cost forecasting.
6	Model deprecation notice shorter than 6 months	High	Production systems cannot migrate safely in under 6 months. Short deprecation windows destroy engineering plans.
7	No self-hosted or VPC deployment option for enterprise tier	High	For regulated industries or high-sensitivity data, shared tenancy is often unacceptable. No self-hosted = no deal.
8	SDK is a thin REST wrapper with no retry/backoff logic	High	Engineering maturity signal. Production-grade SDKs handle retries, streaming, rate limit backoff, and error classification.
9	Rate limits not documented or changed without prior notice	Medium	Undocumented or volatile rate limits make capacity planning impossible and cause unexpected production failures.
10	No data residency commitment in writing	Medium	Verbal assurances are not enforceable. Data residency requirements must be in the DPA or MSA, not in a sales deck.
11	Company founded less than 18 months ago with no enterprise referenceable customers	Medium	Early-stage vendors may pivot, run out of funding, or be acquired. For production AI infrastructure, longevity matters.
12	No exit clause or data deletion guarantee in standard contract	Medium	What happens to your data and fine-tuned models when you leave? If the contract is silent, assume the worst.

Critical

Hard stop. Eliminate vendor immediately unless you can get contractual remediation.

High

Require detailed investigation and written mitigation plan before proceeding.

Medium

Caution signal. Manage via contractual protections or documented risk acceptance.

Shortlisting Process: From 20 Vendors to 3 Finalists in 2 Weeks

Week 1

Discovery

Cast wide net: 15-20 vendors

Output: Long list with one-line qualification

Tool: Market research, G2, analyst reports

Week 1

MoSCoW Filter

Apply hard must-have criteria

Output: Eliminate ~60% of vendors automatically

Tool: Compliance checklist, pricing floor

Week 2

Desk Research

Deep-dive on remaining 6-8 vendors

Output: Security posture, pricing, ecosystem maturity

Tool: Public docs, SOC 2 requests, RFP send

Week 2

Demo & Technical Call

30-min call with each vendor, ask the 25 RFP questions

Output: Score responses, eliminate vendors with critical red flags

Tool: Structured interview template

Week 2

Scoring & Shortlist

Apply weighted scoring matrix to top 3-4 vendors

Output: Ranked shortlist of 3 finalists for PoC

Tool: Scoring matrix (see Section 6)

MoSCoW Filter Criteria

Apply these as binary pass/fail gates. Any vendor failing a Must Have is eliminated immediately — no exceptions.

Must Have (any failure = eliminate)

SOC 2 Type II report available
GDPR-compliant DPA available
EU data residency (if required by your DPO)
Published pricing (no sales-gated basic tiers)
Uptime SLA ≥ 99.9% in contract
Training opt-out as a standard account setting

Should Have (score higher, not eliminate)

ISO 27001 certificate
Dedicated enterprise support tier
Self-hosted or VPC deployment option
Fine-tuning and customisation support
Multi-region availability
Model portability and export

Case Study: How a European Bank Selected Its LLM Vendor

3-month process • 12 vendors evaluated • Decision rationale documented

Vendors evaluated

Months end-to-end

PoC finalists

Selected vendor

Month 1

Discovery & Requirements

Defined use case: internal document search and contract analysis (52,000 documents)
Established hard requirements: EU data residency, GDPR DPA, SOC 2 Type II, <800ms P95 latency
Identified 12 candidate vendors from market research and existing cloud relationships
Applied MoSCoW filter — eliminated 5 vendors immediately (no EU residency or no DPA)

Month 2

PoC & Technical Evaluation

Ran 4-week parallel PoC with 3 finalists on a representative 500-document subset
Measured: extraction accuracy on IBAN/legal clauses, P95 latency under 50 req/s, hallucination rate
Security review: reviewed SOC 2 Type II reports, DPA terms, subprocessor lists
TCO modelling: projected 3-year cost including API, fine-tuning, and support tiers

Month 3

Negotiation & Selection

Shortlisted 2 vendors within 8 weighted scoring points of each other
Ran 2-week production-volume stress test (200 req/s peak) on both finalists
Negotiated contractual exit clause: 90-day notice, full data deletion, model weight export
Final decision: EU-native provider won on data residency weight (30% of score) and exit terms

Final Decision Rationale

Post-Selection: Vendor Management and SLA Monitoring

SLA Monitoring Dashboard: Key Metrics

Metric	SLA Target	Measurement	Escalation Trigger
API Uptime	≥ 99.9% monthly	Synthetic monitoring every 60s from EU region	P1 incident if downtime > 15 minutes
P95 Latency	< 800ms for standard requests	95th percentile of response times over rolling 24h window	Alert if P95 exceeds 1,200ms for > 5 minutes
Error Rate	< 0.5% 5xx errors per hour	Error rate across all API endpoints, excluding client errors	Escalate to vendor if > 1% for two consecutive hours
Rate Limit Headroom	≥ 30% spare capacity vs contracted limits	Daily peak usage vs contracted rate limit ceiling	Request limit increase when headroom < 20% for 5 consecutive days
Cost per 1K API Calls	Within 10% of modelled baseline	Rolling 7-day average vs original TCO model	Review and renegotiate if sustained > 20% above baseline
Quarterly Business Review	Held every 90 days	Vendor roadmap update, incident review, pricing review, SLA compliance report	Trigger formal performance review if any Critical SLA missed

Vendor Management Cadence

DailyAutomated SLA monitoring alerts — uptime, latency, error rate

WeeklyInternal review of cost per request trend and rate limit headroom

MonthlyVendor status page review, incident post-mortem for any P1 events

QuarterlyQBR with vendor: roadmap update, pricing review, SLA compliance report

AnnuallyFull re-evaluation: re-run scoring matrix, assess market alternatives, renegotiate contract

Contract Renewal Checklist

Start 3 months before contract renewal. This is your leverage window.

Re-run weighted scoring matrix with current-year market data
Request updated SOC 2 Type II report
Benchmark 2–3 alternative vendors to establish negotiating position
Review vendor's deprecation notices — any models you depend on at risk?
Calculate full TCO for past 12 months vs initial model
Review DPA for any terms changes in the past 12 months
Negotiate: volume commitments for improved pricing, improved SLA, extended notice periods
Update exit clause: ensure data deletion SLA and model portability rights are current

Multi-Vendor Abstraction Strategy

Open-source routing options

LiteLLM — unified API for 100+ providers
Portkey — gateway with observability and fallbacks
OpenRouter — marketplace routing with cost optimisation
Custom abstraction layer with provider interface

What abstraction gives you

Swap primary vendor without rewriting application code
Run fallback vendor automatically on rate limit or downtime
A/B test two vendors on live traffic
Route by cost vs quality dynamically at request time

Need Help Running Your AI Vendor Evaluation?

Related Resources

LLM Cost Optimization Guide

Cut LLM inference costs by 60–90% with model routing, caching, and fine-tuning

AI Security & Red-Teaming Playbook

Protect your AI systems from prompt injection and model attacks

EU AI Act Compliance Guide

Navigate regulatory requirements for AI systems in Europe

AI Vendor Evaluation Matrix: Score & Compare AI Solutions Objectively | Hyperion Consulting