Build a QA Wall That Catches Bad AI Outputs Before Clients See Them
The Problem: When AI Content Goes Wrong
47 deliverables per week through AI workflows across 7 contractors, 4 continentsOnly 5 out of 47 got human review before shipping to clientsThe breaking point: Fabricated citation published to 11,000 LinkedIn followersRoot cause: No systematic checks between model output and client inboxThe Three-Layer QA Wall
Layer 1: Golden-Set Regression Tests
20 diverse test cases covering every content type you deliverProperties-based testing: Brand tone, word count, must-include/avoid elementsAutomated CI integration: Deploy fails if golden set breaksVersion control: Run tests before every prompt change or model swapOpenAI Evals frameworkLangSmith datasets and evaluatorsStatsig guidance on dataset diversityLayer 2: Heuristics + LLM Judges
Deterministic checks (fast and cheap):
Word count validationJSON schema validationURL domain allowlistsPII detection with OWASP regex patternsAWS Comprehend for model-based PII detection (90% confidence threshold)Dual judges: Fast model + primary modelExplicit rubric: Faithfulness, Relevance, Style/Tone (1-5 scale)Pass threshold: Average ≥4.0, Faithfulness ≥4.0Rationale required: Judge must cite specific issuesAuto-escalation: Route to human if judges disagree by >0.5 pointsLayer 3: Strategic Human Sampling
Detection probability = 1 - (1 - error_rate)^sample_size20% sampling for new clients (88% detection at 10% error rate)10% sampling for established clientsBatch pausing: Stop all client work if one sample failsRandom rotation of reviewers with Slack alertsSpend Guardrails
Per-Job Caps
max_output_tokens on every API callShort-circuit to human review after 2 retriesPrevent runaway costs on stuck jobsPer-Client Monthly Caps
Track spend in SQLite databaseSlack alerts at 80%, 90%, 100% of monthly budgetAutomatic queue pausing at 100%Offline-First Architecture
Jobs table, QA events table, client spend trackingWorks on laptop with no internetSyncs when back onlineThe Lisbon Test: Can you run this from a café with sketchy wifi?Addressing the Counterargument
The concern: Using LLM to check LLM output adds a second failure mode
Judge is triage, not truthMultiple layers provide redundancyDeterministic checks catch mechanical failuresHuman sampling catches edge casesDual judges with disagreement escalationResources
QA Wall Starter Pack includes:
20-item golden set template (YAML)Judge rubric with scoring dimensionsHeuristics file with OWASP regex patternsSlack alert payloads (Block Kit format)SQLite schema and Python guardrails codeYour One Thing This Week
Stand up Layer 1: Create a 20-item golden set and run it manually against your current prompts. See what breaks—you'll be surprised.
Episode 14 of The Stateless Founder