The Stateless Founder

Build a Three-Tier AI Failover That Survives Provider Outages


Listen Later

Build a Three-Tier AI Failover That Survives Provider Outages
The Recent Incidents That Changed Everything

April 6-7, 2026: Anthropic's Claude experienced back-to-back days of elevated errors

  • April 6: 15:00-16:30 UTC, login errors affecting Claude.ai and Claude Code
  • April 7: 14:32-15:12 UTC, same symptoms across login, chats, and voice
  • March 4, 2026: OpenAI logged elevated API error rates for 30 minutes across multiple models due to simultaneous infrastructure actions

    March 16, 2026: Google announced Project Spend Caps for Gemini API with ~10-minute enforcement delays

    The Three-Tier Failover Architecture
    Hot Tier
    • Same provider, different model or endpoint
    • Example: Claude Sonnet fails → switch to Claude Haiku
    • Handles partial outages where some models still respond
    • Uses circuit breakers to detect consecutive failures
    • Warm Tier
      • Completely different provider
      • Example: Anthropic down → route to OpenAI or Gemini
      • Requires OpenAI-compatible gateway layer for request normalization
      • Test tool calling and JSON mode differences beforehand
      • Cold Tier
        • Graceful degradation + human-in-the-loop
        • Requests go into Redis queue with BullMQ
        • Returns 202 (accepted, processing) to client
        • Triggers notifications to ops team
        • Pre-written message templates for client communication
        • Key Technical Patterns

          Circuit Breakers (Martin Fowler pattern)

          • Track consecutive failures per route
          • Open after 5 failures, enter half-open state
          • Probe every 30 seconds, close after 2 successes
          • Exponential Backoff with Jitter (AWS guidance)

            • Prevents thundering herd during outages
            • First retry: 200ms + random offset
            • Each retry waits longer with randomization
            • Idempotency Keys (Stripe pattern)

              • Hash user ID + job ID + input for unique key
              • Prevents duplicate processing on retries
              • Essential for safe retry logic
              • Budget Guardrails

                Google's Project Spend Caps

                • Set in AI Studio under Spend tab
                • Monthly dollar limits per project
                • ~10-minute enforcement delay
                • Billing account $0 balance stops ALL linked projects
                • App-Level Protection

                  • Webhook endpoint receives spend percentage
                  • Flips Redis flag at 80% of monthly budget
                  • Queue workers check flag before processing
                  • Manual resume endpoint when spend drops
                  • Implementation Overview

                    Core Components

                    • JSON routing config (environment-based URLs)
                    • 40-line router function in Node/Python
                    • Redis instance for queuing ($5-15/month)
                    • Circuit breaker libraries (Cockatiel, PyBreaker)
                    • Retry libraries (Tenacity for Python)
                    • Cost Analysis

                      • Infrastructure: $20-80/month total
                      • Most months closer to low end
                      • Compare to cost of missed deliverable ($8K+ for Santi)
                      • Even $500/month clients will churn on missed deadlines
                      • The Counterargument: Is This Overengineering?

                        Valid Concerns

                        • Adds complexity that creates new failure modes
                        • Gateway layers can introduce latency/quirks
                        • Circuit breaker thresholds need calibration
                        • Most LLM APIs use global endpoints (not regional)
                        • Proportional Response

                          • Start with circuit breaker + one warm provider (20 lines of code)
                          • Add cold queue when ready (Redis + notifications)
                          • Budget guardrails only if spending enough to matter
                          • For non-technical users: Make/n8n error paths + Google Sheets
                          • The Lisbon Test

                            Can you:

                            • Deploy from a café with sketchy wifi? ✓
                            • Let async team operate without you online? ✓
                            • Survive bad connectivity? ✓
                            • 15-Minute Validation

                              1. Block hot provider domain locally → confirm warm takeover
                              2. Force 500 errors from gateway → confirm circuit opens
                              3. Post fake budget alert → confirm pause flag sets
                              4. Resources

                                Download: Nomad-Proof Model Failover SOP

                                • JSON routing config templates
                                • Node (Cockatiel + BullMQ) wrapper code
                                • Python (Tenacity + PyBreaker) implementation
                                • Redis queue setup with pause flags
                                • Budget webhook specifications
                                • Cost comparison spreadsheet
                                • Lisbon Test validation checklist
                                • Action Items

                                  This Week: Pick primary provider + one warm alternative. Write 20 lines of failover code OR build one error path in Make/n8n. Test it.

                                  This Weekend: Implement the full three-tier system if you're running client-facing AI operations.

                                  The next outage window is coming - we just don't know when.

                                  ...more
                                  View all episodesView all episodes
                                  Download on the App Store

                                  The Stateless FounderBy Santi, Kira