May 02, 2026

Build a QA Wall That Catches Bad AI Outputs Before Clients See Them

14 minutes

The Problem: When AI Content Goes Wrong

47 deliverables per week through AI workflows across 7 contractors, 4 continents

Only 5 out of 47 got human review before shipping to clients

The breaking point: Fabricated citation published to 11,000 LinkedIn followers

Root cause: No systematic checks between model output and client inbox

The Three-Layer QA Wall

Layer 1: Golden-Set Regression Tests

20 diverse test cases covering every content type you deliver

Properties-based testing: Brand tone, word count, must-include/avoid elements

Automated CI integration: Deploy fails if golden set breaks

Version control: Run tests before every prompt change or model swap

Tools mentioned:

OpenAI Evals framework

LangSmith datasets and evaluators

Statsig guidance on dataset diversity

Layer 2: Heuristics + LLM Judges

Deterministic checks (fast and cheap):

Word count validation

JSON schema validation

URL domain allowlists

PII detection with OWASP regex patterns

AWS Comprehend for model-based PII detection (90% confidence threshold)

LLM Judge system:

Dual judges: Fast model + primary model

Explicit rubric: Faithfulness, Relevance, Style/Tone (1-5 scale)

Pass threshold: Average ≥4.0, Faithfulness ≥4.0

Rationale required: Judge must cite specific issues

Auto-escalation: Route to human if judges disagree by >0.5 points

Layer 3: Strategic Human Sampling

Sampling math:

Detection probability = 1 - (1 - error_rate)^sample_size

20% sampling for new clients (88% detection at 10% error rate)

10% sampling for established clients

Batch pausing: Stop all client work if one sample fails

Random rotation of reviewers with Slack alerts

Spend Guardrails

Per-Job Caps

max_output_tokens on every API call

Short-circuit to human review after 2 retries

Prevent runaway costs on stuck jobs

Per-Client Monthly Caps

Track spend in SQLite database

Slack alerts at 80%, 90%, 100% of monthly budget

Automatic queue pausing at 100%

Offline-First Architecture

SQLite + LiteFS setup:

Jobs table, QA events table, client spend tracking

Works on laptop with no internet

Syncs when back online

The Lisbon Test: Can you run this from a café with sketchy wifi?

Addressing the Counterargument

The concern: Using LLM to check LLM output adds a second failure mode

The response:

Judge is triage, not truth

Multiple layers provide redundancy

Deterministic checks catch mechanical failures

Human sampling catches edge cases

Dual judges with disagreement escalation

Resources

QA Wall Starter Pack includes:

20-item golden set template (YAML)

Judge rubric with scoring dimensions

Heuristics file with OWASP regex patterns

Slack alert payloads (Block Kit format)

SQLite schema and Python guardrails code

Your One Thing This Week

Stand up Layer 1: Create a 20-item golden set and run it manually against your current prompts. See what breaks—you'll be surprised.

Episode 14 of The Stateless Founder

...more

View all episodes

By Santi, Kira

May 02, 2026

Build a QA Wall That Catches Bad AI Outputs Before Clients See Them

14 minutes

Build a QA Wall That Catches Bad AI Outputs Before Clients See Them

The Problem: When AI Content Goes Wrong

47 deliverables per week through AI workflows across 7 contractors, 4 continents

Only 5 out of 47 got human review before shipping to clients

The breaking point: Fabricated citation published to 11,000 LinkedIn followers

Root cause: No systematic checks between model output and client inbox

The Three-Layer QA Wall

Layer 1: Golden-Set Regression Tests

20 diverse test cases covering every content type you deliver

Properties-based testing: Brand tone, word count, must-include/avoid elements

Automated CI integration: Deploy fails if golden set breaks

Version control: Run tests before every prompt change or model swap

Tools mentioned:

OpenAI Evals framework

LangSmith datasets and evaluators

Statsig guidance on dataset diversity

Layer 2: Heuristics + LLM Judges

Deterministic checks (fast and cheap):

Word count validation

JSON schema validation

URL domain allowlists

PII detection with OWASP regex patterns

AWS Comprehend for model-based PII detection (90% confidence threshold)

LLM Judge system:

Dual judges: Fast model + primary model

Explicit rubric: Faithfulness, Relevance, Style/Tone (1-5 scale)

Pass threshold: Average ≥4.0, Faithfulness ≥4.0

Rationale required: Judge must cite specific issues

Auto-escalation: Route to human if judges disagree by >0.5 points

Layer 3: Strategic Human Sampling

Sampling math:

Detection probability = 1 - (1 - error_rate)^sample_size

20% sampling for new clients (88% detection at 10% error rate)

10% sampling for established clients

Batch pausing: Stop all client work if one sample fails

Random rotation of reviewers with Slack alerts

Spend Guardrails

Per-Job Caps

max_output_tokens on every API call

Short-circuit to human review after 2 retries

Prevent runaway costs on stuck jobs

Per-Client Monthly Caps

Track spend in SQLite database

Slack alerts at 80%, 90%, 100% of monthly budget

Automatic queue pausing at 100%

Offline-First Architecture

SQLite + LiteFS setup:

Jobs table, QA events table, client spend tracking

Works on laptop with no internet

Syncs when back online

The Lisbon Test: Can you run this from a café with sketchy wifi?

Addressing the Counterargument

The concern: Using LLM to check LLM output adds a second failure mode

The response:

Judge is triage, not truth

Multiple layers provide redundancy

Deterministic checks catch mechanical failures

Human sampling catches edge cases

Dual judges with disagreement escalation

Resources

QA Wall Starter Pack includes:

20-item golden set template (YAML)

Judge rubric with scoring dimensions

Heuristics file with OWASP regex patterns

Slack alert payloads (Block Kit format)

SQLite schema and Python guardrails code

Your One Thing This Week

Stand up Layer 1: Create a 20-item golden set and run it manually against your current prompts. See what breaks—you'll be surprised.

Episode 14 of The Stateless Founder

...more

Share Build a QA Wall That Catches Bad AI Outputs Before Clients See Them

Sign up to save your podcasts

Build a QA Wall That Catches Bad AI Outputs Before Clients See Them

Build a QA Wall That Catches Bad AI Outputs Before Clients See Them