The Stateless Founder

Stop Selling Model Names. Sell Uptime: Multi-Provider Routing with Client-Facing SLOs


Listen Later

Stop Selling Model Names. Sell Uptime: Multi-Provider Routing with Client-Facing SLOs
The Problem Nobody Talks About

Every AI provider goes down. Not maybe. Not occasionally. Regularly.

  • November 25, 2024: OpenAI suffered widespread timeouts and 503 errors for hours
  • September 2025: Anthropic published postmortems for three separate Claude API incidents
  • 2024-2025: Cloudflare global incidents cascaded into half the AI services on the internet
  • If your revenue depends on AI output, a single-provider architecture is a single point of failure with your name on it.

    The Solution: Reliability as a Feature

    Stop leading with "We use GPT-4" or "We're on Claude." Start leading with numbers:

    • 99.5% of requests succeed
    • P95 latency under 2.5 seconds
    • Average cost per request under $0.015
    • That's a promise a client can hold you to—and it makes you worth more than the person who just says "we use the best model."

      The Technical Stack
      1. Two-Provider Router with LiteLLM

      Not five providers. Not a fancy model cascade. Two.

      • Primary gets weight of 9, secondary gets weight of 1
      • LiteLLM retries in-group once, then fails over automatically
      • Your app hits one endpoint—routing happens behind the proxy
      • Keep a bypass switch: BYPASS_ROUTER=true for 30-second rollback
      • Key Configuration:

        • Set explicit routing order (primary first, secondary only on failure)
        • 2-second stream timeout for time-to-first-token
        • Pin providers for latency-critical paths
        • 2. Budget Guardrails and Cost Control

          The problem: Secondary providers can be 3x more expensive per token

          The solution: Budget guardrails in LiteLLM

          • Maximum cost per request
          • Maximum tokens in/out
          • Graceful degradation (truncate context, switch to cheaper model, return cached response)
          • Observability Stack:

            • Tag every request: tenant ID, feature, provider, tokens, cost
            • Pipe into Langfuse or Helicone (both have free tiers)
            • Three alerts only:
              1. P95 latency over target for 15 minutes → page
              2. Success rate below target for 5 minutes → page
              3. Average cost per request over budget for 15 minutes → page
              4. 3. Travel-Mode Cache

                The reality: Airport throttling, café wifi drops, connectivity chaos

                The solution: Write-through cache + service workers

                • Every router response written to local cache (Redis, SQLite)
                • Keyed on normalized prompt version
                • Service worker intercepts fetch requests, falls back to cache on network failure
                • Bonus: 60%+ cache hit rates on repetitive prompts = major cost savings
                • Provider-side optimization:

                  • Anthropic prompt caching for stable blocks (system instructions, tool definitions)
                  • Default 5-minute TTL, optional 1-hour cache
                  • Reduces both latency and input token cost
                  • Client-Facing SLOs
                    The Language That Wins Deals

                    Most AI agency proposals: "We use state-of-the-art AI models"

                    Your proposal:

                    > "99.5% success rate, p95 latency under 2.5 seconds, average cost per request under $0.015, measured over a rolling 30-day window"

                    Why this works:

                    • CTO understands your architecture
                    • VP of Operations understands "99.5% uptime"
                    • Different audiences, different languages
                    • SLO vs SLA Distinction
                      • SLI = The measurement (p95 latency)
                      • SLO = The target ("95% of requests complete in under 2.5 seconds")
                      • SLA = The contract (legal commitment with penalties)
                      • Publish SLOs, not SLAs. SLO = transparency commitment. SLA = legal obligation with penalties.

                        Error Budget Framework

                        If your target is 99.5% success rate over 30 days:

                        • You're allowed to fail on 0.5% of requests
                        • On 10,000 requests/month = 50 allowed failures
                        • Spend budget on deploys, experiments, provider hiccups
                        • When it's gone, freeze changes and stabilize
                        • The 30-Minute Friday Drill
                          Why Manual Drills Matter

                          Don't automate the drill. The point isn't to test the system—it's to test you.

                          AWS calls these "chaos game days." Google calls them "Wheel of Misfortune exercises."

                          Drill Structure (30 minutes)

                          Three roles (even if you're playing all three):

                          1. Drill lead runs the clock
                          2. Operator flips the switch
                          3. Scribe captures what happened
                          4. The process:

                            1. Revoke your primary provider's API key
                            2. Watch the router fail over
                            3. Confirm p95 stays within target
                            4. Restore the key and verify everything's green
                            5. Tie results to error budget: If failover took longer than expected or success rate dipped below SLO, that's a finding. Log it, fix it, run again next quarter.

                              When It's Boring, It Works

                              The goal: Make reliability boring.

                              If your infrastructure is exciting, something's wrong. Ship the boring infrastructure. Sell the boring promise. Win the clients who care about reliability more than hype.

                              Action Items

                              This week:

                              1. Stand up the router with two providers
                              2. Set the three alerts
                              3. Run the drill Friday
                              4. Next two weeks:

                                • Layer in the cache
                                • Add SLO language to proposals
                                • Implement full observability
                                • Resources

                                  Download the complete Reliability SLO Kit:

                                  • SLO one-pager template
                                  • Budget guardrail sheet with alert thresholds
                                  • Router config
                                  • Cache recipe
                                  • 30-minute drill SOP with rollback steps
                                  • Client-safe proposal language
                                  • Available on the Resources page

                                    Legal disclaimer: The SLO/SOW language provided is template language, not legal advice. Have your counsel review before shipping to clients.

                                    ...more
                                    View all episodesView all episodes
                                    Download on the App Store

                                    The Stateless FounderBy Santi, Kira