Build a Minimal LLM Evaluation Loop That Catches Regressions While You Sleep
The Problem: Silent AI Failures
When your website goes down, you get an alert. When Stripe breaks, payments fail immediately. But when your LLM starts producing worse outputs—slightly less accurate summaries, off-tone emails, JSON fields that are almost right—nobody tells you. The model doesn't throw an error. It just gets worse.
For nomad founders managing AI workflows across time zones, this silent failure mode is especially dangerous. You're asleep, on a 12-hour bus in Peru, or doing a visa run in Bangkok while your content repurposing tool ships summaries that drop key facts.
The Solution: A Three-Piece Evaluation System
1. Golden Test Sets (15-20 Cases Per Output Type)
Real production data only: Synthetic test cases test synthetic problemsJSONL format: One line per case, input paired with known-good outputTagged for slicing: Formal tone, has PII, Spanish language, etc.Three common types: Email rewrites, JSON extraction, content summaries2. AI Judge Prompts (G-Eval Pattern)
Rubric-guided scoring: Analysis first, then scores per dimensionCross-family judges: Generate with OpenAI, judge with Anthropic (or vice versa)Blind randomized order: Prevents position biasFour dimensions for email rewrites: Instruction-following, tone fit, clarity, PII leak check3. Pairwise A/B Testing
Compare prompt A vs prompt B: Not just absolute scoringRandomized presentation: Judge sees outputs in random orderTie-breaking: Borderline cases escalate to human reviewReliability Mitigations
Judge Bias Problems
Self-preference bias: Judges favor their own model family's outputsPosition bias: Prefer whatever they see first or whatever is longerVerbosity bias: Longer outputs score higher regardless of qualitySolutions
Cross-family separation: Never use same provider for generation and judgingHuman sampling: 10-20% of live production jobs reviewed weeklyFocus sampling: Pull cases where judge was least confident95% agreement target: If judge-human disagreement exceeds 5% for two weeks, recalibrateThe Monday Scorecard (30 Minutes Weekly)
Six Key Numbers
Pass rate per output type: Email rewrites (90% threshold), summarization (88%)Win rate from pairwise A/Bs: New prompt vs baselineP95 latency: 95th percentile response timeCost per 100 jobs: Token usage × per-token priceJudge agreement: Percentage alignment with human sampleIncidents: Anything that broke during the weekDecision Framework
Roll forward: Pass rates stable, costs in lineHold and investigate: Something dippedRoll back: Model deprecation broke judge or generatorImplementation Tools
CI Regression Gate
Promptfoo: Open source CLI with YAML configGitHub Actions: Automated eval runs on every PRPass-rate thresholds: Build fails if quality regressesNon-zero exit code: Blocks deployment automaticallyCost Tracking
OpenAI/Anthropic APIs: Return token usage on every callReal example: 4¢ per generation + 1.2¢ per judge call = $5.20 per 100 jobsAlert thresholds: Catch cost spikes before monthly reviewModel Deprecation Monitoring
Pin model versions: Keep last two working versions in environment variablesWatch deprecation pages: OpenAI and Anthropic maintain lifecycle schedulesOne-line rollback: Pinned configs enable instant reversionWeekly Rhythm
Friday: Add 3-5 fresh cases from production tracesSunday: Open PR with prompt/model changes, let CI runMonday: Fill scorecard, make decision, assign one action itemDaily: Alerts on latency/cost thresholds catch spikesMonthly Maintenance
Refresh golden sets: Replace stale cases with fresh production examplesClose stale failures: Archive resolved issuesRecalibrate judge: If agreement drops below 95% targetStart Small: The One-Output-Type Version
Don't try to build all three output types at once. Pick your highest-volume type, build 15 golden cases, wire up one judge prompt, run for two weeks. You'll catch things you didn't know were breaking.
The full three-type system is the mature version. One type is the version that fits in an afternoon and still saves you from Monday morning client complaints.
Resources
Starter Kit: JSONL templates, G-Eval judge prompts, Promptfoo CI configMonday Scorecard: Notion template with all six metricsDeprecations Checklist: Model lifecycle monitoring guideHuman Sampling Guide: 10-20% review protocolsThe vibes-based evaluation method works until it doesn't. When it doesn't, you find out from your customers. This system ensures you know before they do.