June 03, 2026

Stop Selling Model Names. Sell Uptime: Multi-Provider Routing with Client-Facing SLOs

16 minutes

The Problem Nobody Talks About

Every AI provider goes down. Not maybe. Not occasionally. Regularly.

November 25, 2024: OpenAI suffered widespread timeouts and 503 errors for hours

September 2025: Anthropic published postmortems for three separate Claude API incidents

2024-2025: Cloudflare global incidents cascaded into half the AI services on the internet

If your revenue depends on AI output, a single-provider architecture is a single point of failure with your name on it.

The Solution: Reliability as a Feature

Stop leading with "We use GPT-4" or "We're on Claude." Start leading with numbers:

99.5% of requests succeed

P95 latency under 2.5 seconds

Average cost per request under $0.015

That's a promise a client can hold you to—and it makes you worth more than the person who just says "we use the best model."

The Technical Stack

1. Two-Provider Router with LiteLLM

Not five providers. Not a fancy model cascade. Two.

Primary gets weight of 9, secondary gets weight of 1

LiteLLM retries in-group once, then fails over automatically

Your app hits one endpoint—routing happens behind the proxy

Keep a bypass switch: BYPASS_ROUTER=true for 30-second rollback

Key Configuration:

Set explicit routing order (primary first, secondary only on failure)

2-second stream timeout for time-to-first-token

Pin providers for latency-critical paths

2. Budget Guardrails and Cost Control

The problem: Secondary providers can be 3x more expensive per token

The solution: Budget guardrails in LiteLLM

Maximum cost per request

Maximum tokens in/out

Graceful degradation (truncate context, switch to cheaper model, return cached response)

Observability Stack:

Tag every request: tenant ID, feature, provider, tokens, cost

Pipe into Langfuse or Helicone (both have free tiers)

Three alerts only:

P95 latency over target for 15 minutes → page

Success rate below target for 5 minutes → page

Average cost per request over budget for 15 minutes → page

3. Travel-Mode Cache

The reality: Airport throttling, café wifi drops, connectivity chaos

The solution: Write-through cache + service workers

Every router response written to local cache (Redis, SQLite)

Keyed on normalized prompt version

Service worker intercepts fetch requests, falls back to cache on network failure

Bonus: 60%+ cache hit rates on repetitive prompts = major cost savings

Provider-side optimization:

Anthropic prompt caching for stable blocks (system instructions, tool definitions)

Default 5-minute TTL, optional 1-hour cache

Reduces both latency and input token cost

Client-Facing SLOs

The Language That Wins Deals

Most AI agency proposals: "We use state-of-the-art AI models"

Your proposal:

> "99.5% success rate, p95 latency under 2.5 seconds, average cost per request under $0.015, measured over a rolling 30-day window"

Why this works:

CTO understands your architecture

VP of Operations understands "99.5% uptime"

Different audiences, different languages

SLO vs SLA Distinction

SLI = The measurement (p95 latency)

SLO = The target ("95% of requests complete in under 2.5 seconds")

SLA = The contract (legal commitment with penalties)

Publish SLOs, not SLAs. SLO = transparency commitment. SLA = legal obligation with penalties.

Error Budget Framework

If your target is 99.5% success rate over 30 days:

You're allowed to fail on 0.5% of requests

On 10,000 requests/month = 50 allowed failures

Spend budget on deploys, experiments, provider hiccups

When it's gone, freeze changes and stabilize

The 30-Minute Friday Drill

Why Manual Drills Matter

Don't automate the drill. The point isn't to test the system—it's to test you.

AWS calls these "chaos game days." Google calls them "Wheel of Misfortune exercises."

Drill Structure (30 minutes)

Three roles (even if you're playing all three):

Drill lead runs the clock

Operator flips the switch

Scribe captures what happened

The process:

Revoke your primary provider's API key

Watch the router fail over

Confirm p95 stays within target

Restore the key and verify everything's green

Tie results to error budget: If failover took longer than expected or success rate dipped below SLO, that's a finding. Log it, fix it, run again next quarter.

When It's Boring, It Works

The goal: Make reliability boring.

If your infrastructure is exciting, something's wrong. Ship the boring infrastructure. Sell the boring promise. Win the clients who care about reliability more than hype.

Action Items

This week:

Stand up the router with two providers

Set the three alerts

Run the drill Friday

Next two weeks:

Layer in the cache

Add SLO language to proposals

Implement full observability

Resources

Download the complete Reliability SLO Kit:

SLO one-pager template

Budget guardrail sheet with alert thresholds

Router config

Cache recipe

30-minute drill SOP with rollback steps

Client-safe proposal language

Available on the Resources page

Legal disclaimer: The SLO/SOW language provided is template language, not legal advice. Have your counsel review before shipping to clients.

...more

View all episodes

By Santi, Kira

June 03, 2026

Stop Selling Model Names. Sell Uptime: Multi-Provider Routing with Client-Facing SLOs

16 minutes

Stop Selling Model Names. Sell Uptime: Multi-Provider Routing with Client-Facing SLOs

The Problem Nobody Talks About

Every AI provider goes down. Not maybe. Not occasionally. Regularly.

November 25, 2024: OpenAI suffered widespread timeouts and 503 errors for hours

September 2025: Anthropic published postmortems for three separate Claude API incidents

2024-2025: Cloudflare global incidents cascaded into half the AI services on the internet

If your revenue depends on AI output, a single-provider architecture is a single point of failure with your name on it.

The Solution: Reliability as a Feature

Stop leading with "We use GPT-4" or "We're on Claude." Start leading with numbers:

99.5% of requests succeed

P95 latency under 2.5 seconds

Average cost per request under $0.015

That's a promise a client can hold you to—and it makes you worth more than the person who just says "we use the best model."

The Technical Stack

1. Two-Provider Router with LiteLLM

Not five providers. Not a fancy model cascade. Two.

Primary gets weight of 9, secondary gets weight of 1

LiteLLM retries in-group once, then fails over automatically

Your app hits one endpoint—routing happens behind the proxy

Keep a bypass switch: BYPASS_ROUTER=true for 30-second rollback

Key Configuration:

Set explicit routing order (primary first, secondary only on failure)

2-second stream timeout for time-to-first-token

Pin providers for latency-critical paths

2. Budget Guardrails and Cost Control

The problem: Secondary providers can be 3x more expensive per token

The solution: Budget guardrails in LiteLLM

Maximum cost per request

Maximum tokens in/out

Graceful degradation (truncate context, switch to cheaper model, return cached response)

Observability Stack:

Tag every request: tenant ID, feature, provider, tokens, cost

Pipe into Langfuse or Helicone (both have free tiers)

Three alerts only:

P95 latency over target for 15 minutes → page

Success rate below target for 5 minutes → page

Average cost per request over budget for 15 minutes → page

3. Travel-Mode Cache

The reality: Airport throttling, café wifi drops, connectivity chaos

The solution: Write-through cache + service workers

Every router response written to local cache (Redis, SQLite)

Keyed on normalized prompt version

Service worker intercepts fetch requests, falls back to cache on network failure

Bonus: 60%+ cache hit rates on repetitive prompts = major cost savings

Provider-side optimization:

Anthropic prompt caching for stable blocks (system instructions, tool definitions)

Default 5-minute TTL, optional 1-hour cache

Reduces both latency and input token cost

Client-Facing SLOs

The Language That Wins Deals

Most AI agency proposals: "We use state-of-the-art AI models"

Your proposal:

> "99.5% success rate, p95 latency under 2.5 seconds, average cost per request under $0.015, measured over a rolling 30-day window"

Why this works:

CTO understands your architecture

VP of Operations understands "99.5% uptime"

Different audiences, different languages

SLO vs SLA Distinction

SLI = The measurement (p95 latency)

SLO = The target ("95% of requests complete in under 2.5 seconds")

SLA = The contract (legal commitment with penalties)

Publish SLOs, not SLAs. SLO = transparency commitment. SLA = legal obligation with penalties.

Error Budget Framework

If your target is 99.5% success rate over 30 days:

You're allowed to fail on 0.5% of requests

On 10,000 requests/month = 50 allowed failures

Spend budget on deploys, experiments, provider hiccups

When it's gone, freeze changes and stabilize

The 30-Minute Friday Drill

Why Manual Drills Matter

Don't automate the drill. The point isn't to test the system—it's to test you.

AWS calls these "chaos game days." Google calls them "Wheel of Misfortune exercises."

Drill Structure (30 minutes)

Three roles (even if you're playing all three):

Drill lead runs the clock

Operator flips the switch

Scribe captures what happened

The process:

Revoke your primary provider's API key

Watch the router fail over

Confirm p95 stays within target

Restore the key and verify everything's green

Tie results to error budget: If failover took longer than expected or success rate dipped below SLO, that's a finding. Log it, fix it, run again next quarter.

When It's Boring, It Works

The goal: Make reliability boring.

If your infrastructure is exciting, something's wrong. Ship the boring infrastructure. Sell the boring promise. Win the clients who care about reliability more than hype.

Action Items

This week:

Stand up the router with two providers

Set the three alerts

Run the drill Friday

Next two weeks:

Layer in the cache

Add SLO language to proposals

Implement full observability

Resources

Download the complete Reliability SLO Kit:

SLO one-pager template

Budget guardrail sheet with alert thresholds

Router config

Cache recipe

30-minute drill SOP with rollback steps

Client-safe proposal language

Available on the Resources page

Legal disclaimer: The SLO/SOW language provided is template language, not legal advice. Have your counsel review before shipping to clients.

...more

Share Stop Selling Model Names. Sell Uptime: Multi-Provider Routing with Client-Facing SLOs

Sign up to save your podcasts

Stop Selling Model Names. Sell Uptime: Multi-Provider Routing with Client-Facing SLOs

Stop Selling Model Names. Sell Uptime: Multi-Provider Routing with Client-Facing SLOs