The Stateless Founder

Build a Minimal LLM Evaluation Loop That Catches Regressions While You Sleep


Listen Later

Build a Minimal LLM Evaluation Loop That Catches Regressions While You Sleep
The Problem: Silent AI Failures

When your website goes down, you get an alert. When Stripe breaks, payments fail immediately. But when your LLM starts producing worse outputs—slightly less accurate summaries, off-tone emails, JSON fields that are almost right—nobody tells you. The model doesn't throw an error. It just gets worse.

For nomad founders managing AI workflows across time zones, this silent failure mode is especially dangerous. You're asleep, on a 12-hour bus in Peru, or doing a visa run in Bangkok while your content repurposing tool ships summaries that drop key facts.

The Solution: A Three-Piece Evaluation System
1. Golden Test Sets (15-20 Cases Per Output Type)
  • Real production data only: Synthetic test cases test synthetic problems
  • JSONL format: One line per case, input paired with known-good output
  • Tagged for slicing: Formal tone, has PII, Spanish language, etc.
  • Three common types: Email rewrites, JSON extraction, content summaries
  • 2. AI Judge Prompts (G-Eval Pattern)
    • Rubric-guided scoring: Analysis first, then scores per dimension
    • Cross-family judges: Generate with OpenAI, judge with Anthropic (or vice versa)
    • Blind randomized order: Prevents position bias
    • Four dimensions for email rewrites: Instruction-following, tone fit, clarity, PII leak check
    • 3. Pairwise A/B Testing
      • Compare prompt A vs prompt B: Not just absolute scoring
      • Randomized presentation: Judge sees outputs in random order
      • Tie-breaking: Borderline cases escalate to human review
      • Reliability Mitigations
        Judge Bias Problems
        • Self-preference bias: Judges favor their own model family's outputs
        • Position bias: Prefer whatever they see first or whatever is longer
        • Verbosity bias: Longer outputs score higher regardless of quality
        • Solutions
          • Cross-family separation: Never use same provider for generation and judging
          • Human sampling: 10-20% of live production jobs reviewed weekly
          • Focus sampling: Pull cases where judge was least confident
          • 95% agreement target: If judge-human disagreement exceeds 5% for two weeks, recalibrate
          • The Monday Scorecard (30 Minutes Weekly)
            Six Key Numbers
            1. Pass rate per output type: Email rewrites (90% threshold), summarization (88%)
            2. Win rate from pairwise A/Bs: New prompt vs baseline
            3. P95 latency: 95th percentile response time
            4. Cost per 100 jobs: Token usage × per-token price
            5. Judge agreement: Percentage alignment with human sample
            6. Incidents: Anything that broke during the week
            7. Decision Framework
              • Roll forward: Pass rates stable, costs in line
              • Hold and investigate: Something dipped
              • Roll back: Model deprecation broke judge or generator
              • Implementation Tools
                CI Regression Gate
                • Promptfoo: Open source CLI with YAML config
                • GitHub Actions: Automated eval runs on every PR
                • Pass-rate thresholds: Build fails if quality regresses
                • Non-zero exit code: Blocks deployment automatically
                • Cost Tracking
                  • OpenAI/Anthropic APIs: Return token usage on every call
                  • Real example: 4¢ per generation + 1.2¢ per judge call = $5.20 per 100 jobs
                  • Alert thresholds: Catch cost spikes before monthly review
                  • Model Deprecation Monitoring
                    • Pin model versions: Keep last two working versions in environment variables
                    • Watch deprecation pages: OpenAI and Anthropic maintain lifecycle schedules
                    • One-line rollback: Pinned configs enable instant reversion
                    • Weekly Rhythm
                      • Friday: Add 3-5 fresh cases from production traces
                      • Sunday: Open PR with prompt/model changes, let CI run
                      • Monday: Fill scorecard, make decision, assign one action item
                      • Daily: Alerts on latency/cost thresholds catch spikes
                      • Monthly Maintenance
                        • Refresh golden sets: Replace stale cases with fresh production examples
                        • Close stale failures: Archive resolved issues
                        • Recalibrate judge: If agreement drops below 95% target
                        • Start Small: The One-Output-Type Version

                          Don't try to build all three output types at once. Pick your highest-volume type, build 15 golden cases, wire up one judge prompt, run for two weeks. You'll catch things you didn't know were breaking.

                          The full three-type system is the mature version. One type is the version that fits in an afternoon and still saves you from Monday morning client complaints.

                          Resources
                          • Starter Kit: JSONL templates, G-Eval judge prompts, Promptfoo CI config
                          • Monday Scorecard: Notion template with all six metrics
                          • Deprecations Checklist: Model lifecycle monitoring guide
                          • Human Sampling Guide: 10-20% review protocols
                          • The vibes-based evaluation method works until it doesn't. When it doesn't, you find out from your customers. This system ensures you know before they do.

                            ...more
                            View all episodesView all episodes
                            Download on the App Store

                            The Stateless FounderBy Santi, Kira