The Stateless Founder

Build a Three-Layer QA Wall for AI Outputs in 48 Hours


Listen Later

Build a Three-Layer QA Wall for AI Outputs in 48 Hours

Every AI deliverable you ship without quality checks is a bet against model drift, prompt degradation, and silent failures. This episode builds a three-layer QA wall that catches problems before clients do.

The Cost of Not Checking
  • Human evaluation: $50 per case, 10 minutes
  • LLM judge evaluation: $0.02 per case, 16 seconds
  • At 1,000 cases/week: $50,000 vs $20 in evaluation costs
  • Layer 1: Rubric-Scored LLM Judge

    Deploy an LLM judge against a weighted rubric before every deliverable ships:

    Five-Criteria Rubric
    • Task fulfillment (30%): Did it follow instructions?
    • Factual accuracy (25%): Are claims verifiable?
    • Clarity and structure (15%): Is it well-organized?
    • Style and brand fit (10%): Matches client voice?
    • Citations (10%): Proper attribution?
    • Safety flags (negative weight): PII leakage, hallucinations
    • Scoring Thresholds
      • Green (ships automatically): 0.8+ total, no critical flags, top two criteria 4+
      • Amber (human edit queue): 0.7-0.8 total, or any criterion ≤2
      • Red (blocked/escalated): <0.7 total or any critical flag
      • Research Backing
        • ICLR 2026 AutoMetrics: +33.4% correlation with humans vs direct LLM-as-judge
        • AAAI 2026 Think-J: Rubric-anchored judges more robust to noisy training data
        • Layer 2: Golden-Set Replay and Drift Detection

          Build a golden set of 40-60 items per output type, scored by humans with agreed-upon labels and rationales.

          Weekly Calibration Process
          1. Replay golden set through your judge
          2. Measure agreement using Cohen's kappa or Kendall's tau
          3. Kappa >0.61 = substantial agreement
          4. Track week-over-week trends
          5. When agreement drops → pause auto-shipping and investigate
          6. Drift Detection
            • PLOS One 2026 study: Weekly Bradley-Terry recalibration achieved τ=0.59-0.68 vs humans
            • Detected three drift patterns: stable, improving, degrading
            • Without weekly replay, you're "shipping and hoping"
            • Guardrails Against Brittleness
              • Randomize position: Run both A-B and B-A orders (Chatbot Arena method)
              • Separate concerns: Rubric is workhorse, pairwise is tiebreaker
              • Never self-judge: Don't let GPT-4o judge GPT-4o outputs
              • Layer 3: Human Sampling with Red/Amber/Green Thresholds

                Strategic 5-10% human sampling focused on risk and borderlines:

                Sample Composition
                • 50%: Amber decisions (borderlines judge wasn't sure about)
                • 30%: High-risk greens (long outputs, safety-sensitive, new client styles)
                • 20%: Random greens (keep judge honest)
                • Dashboard Thresholds
                  • Green: Judge precision ≥95%, human disagreement <10%, no critical flags
                  • Amber: One metric slipped → raise cutline by 0.02, bump sampling to 15%
                  • Red: Critical safety event, 2+ major misses in 50-item sample, or kappa <0.5
                  • Client Value Proposition

                    "Every output gets scored by a calibrated judge against a six-criterion rubric. Top performers ship automatically. Borderlines get human edit. Weekly 5-10% human sample with dashboard that updates every Monday."

                    The Monday Dashboard

                    Five widgets for 30-minute weekly review:

                    1. Volume and mix: Items processed, percentage green/amber/red
                    2. Judge health: Agreement vs golden set with 4-week trend
                    3. Human QA metrics: Precision, disagreement rate, sample size
                    4. Risk flags: By type and resolution speed
                    5. Cost per eval: Track efficiency gains
                    6. Cost Analysis: Visa Run Revenue Math
                      • Judge costs: $20/week for 1,000 items
                      • Human sample: 50-100 items at $15-20/hour
                      • Total QA cost: ~$350/week
                      • vs Full human review: $50,000/week
                      • ROI: If $350 prevents one client churn, pays for itself quarterly
                      • Implementation Checklist
                        This Week
                        1. Build golden set: 40 items from real output (good, borderline, bad)
                        2. Score manually: Create foundation for everything else
                        3. Schedule Monday review: 30 minutes on calendar
                        4. Next Week
                          1. Deploy rubric-scored judge on new outputs
                          2. Set up weekly golden-set replay
                          3. Implement human sampling workflow
                          4. Resources

                            The QA Wall Kit includes:

                            • Rubric template with acceptance thresholds
                            • Judge prompt pack (rubric + pairwise modes)
                            • Human sampling SOP with R/A/G dashboard
                            • Monday review checklist
                            • Research Sources
                              • ICLR 2026 AutoMetrics: Rubric-style evaluators improve correlation by 33.4%
                              • PLOS One 2026: Bias-calibrated LLM judges with weekly recalibration
                              • AAAI 2026 Think-J: Generative judges outperform classifier-style approaches
                              • UW Health Clinical Study: Cost/latency comparison of human vs LLM evaluation
                              • TREC AutoJudge 2026: Live benchmark studying judge vulnerabilities and guardrails
                              • Next episode: Judge fine-tuning vs off-the-shelf models for domain-specific QA

                                ...more
                                View all episodesView all episodes
                                Download on the App Store

                                The Stateless FounderBy Santi, Kira