The Stateless Founder

Build a QA Wall That Catches Bad AI Outputs Before Clients See Them


Listen Later

Build a QA Wall That Catches Bad AI Outputs Before Clients See Them
The Problem: When AI Content Goes Wrong
  • 47 deliverables per week through AI workflows across 7 contractors, 4 continents
  • Only 5 out of 47 got human review before shipping to clients
  • The breaking point: Fabricated citation published to 11,000 LinkedIn followers
  • Root cause: No systematic checks between model output and client inbox
  • The Three-Layer QA Wall
    Layer 1: Golden-Set Regression Tests
    • 20 diverse test cases covering every content type you deliver
    • Properties-based testing: Brand tone, word count, must-include/avoid elements
    • Automated CI integration: Deploy fails if golden set breaks
    • Version control: Run tests before every prompt change or model swap
    • Tools mentioned:

      • OpenAI Evals framework
      • LangSmith datasets and evaluators
      • Statsig guidance on dataset diversity
      • Layer 2: Heuristics + LLM Judges

        Deterministic checks (fast and cheap):

        • Word count validation
        • JSON schema validation
        • URL domain allowlists
        • PII detection with OWASP regex patterns
        • AWS Comprehend for model-based PII detection (90% confidence threshold)
        • LLM Judge system:

          • Dual judges: Fast model + primary model
          • Explicit rubric: Faithfulness, Relevance, Style/Tone (1-5 scale)
          • Pass threshold: Average ≥4.0, Faithfulness ≥4.0
          • Rationale required: Judge must cite specific issues
          • Auto-escalation: Route to human if judges disagree by >0.5 points
          • Layer 3: Strategic Human Sampling

            Sampling math:

            • Detection probability = 1 - (1 - error_rate)^sample_size
            • 20% sampling for new clients (88% detection at 10% error rate)
            • 10% sampling for established clients
            • Batch pausing: Stop all client work if one sample fails
            • Random rotation of reviewers with Slack alerts
            • Spend Guardrails
              Per-Job Caps
              • max_output_tokens on every API call
              • Short-circuit to human review after 2 retries
              • Prevent runaway costs on stuck jobs
              • Per-Client Monthly Caps
                • Track spend in SQLite database
                • Slack alerts at 80%, 90%, 100% of monthly budget
                • Automatic queue pausing at 100%
                • Offline-First Architecture

                  SQLite + LiteFS setup:

                  • Jobs table, QA events table, client spend tracking
                  • Works on laptop with no internet
                  • Syncs when back online
                  • The Lisbon Test: Can you run this from a café with sketchy wifi?
                  • Addressing the Counterargument

                    The concern: Using LLM to check LLM output adds a second failure mode

                    The response:

                    • Judge is triage, not truth
                    • Multiple layers provide redundancy
                    • Deterministic checks catch mechanical failures
                    • Human sampling catches edge cases
                    • Dual judges with disagreement escalation
                    • Resources

                      QA Wall Starter Pack includes:

                      • 20-item golden set template (YAML)
                      • Judge rubric with scoring dimensions
                      • Heuristics file with OWASP regex patterns
                      • Slack alert payloads (Block Kit format)
                      • SQLite schema and Python guardrails code
                      • Your One Thing This Week

                        Stand up Layer 1: Create a 20-item golden set and run it manually against your current prompts. See what breaks—you'll be surprised.

                        Episode 14 of The Stateless Founder

                        ...more
                        View all episodesView all episodes
                        Download on the App Store

                        The Stateless FounderBy Santi, Kira