The Stateless Founder

By Santi, Kira

The Stateless Founder teaches digital nomads how to build location-independent businesses powered by AI and automation. Each week, Santi and Kira break down real business models, workflows, costs, and... more

Download on the App Store

Download on the App Store

Get it on Google Play

FAQs about The Stateless Founder:

How many episodes does The Stateless Founder have?

The podcast currently has 25 episodes available.

The Stateless Founder episodes:

May 27, 2026The 14-Day Partner Sprint: Feed-Drops, Mini-Templates, and the 15-Minute SLA
The 14-Day Partner Sprint: Feed-Drops, Mini-Templates, and the 15-Minute SLA
The Question That Started It All

Someone in Kira's Slack community asked: "I've done three collabs this year. A podcast swap, a newsletter mention, a joint webinar. Each one spiked traffic for like two days and then nothing. How do I make partnerships actually compound instead of just being one-off favors?"

The answer: Stop treating partnerships like networking events. Start treating them like a systematic distribution channel.
The Three Missing Pieces

Most partnership marketing fails because it's missing:
A shared asset that lives beyond the collab - not a moment, but something that keeps working

Tracking that tells you which partner actually moved the needle - so you can prove ROI and repeat what works

A response system - when someone shows up from a partner's audience, you answer in 15 minutes, not 15 hours

The 14-Day Partner Sprint System
Partner Selection: The Adjacency Test

Use these five criteria to filter potential partners:
Does their audience overlap with yours (same job title, same problem)?

Do they cover topics within your top three themes?

Can you ship the collab async?

Is their engagement real (actual clicks and listens, not vanity followers)?

Is there a clear contact you can reach?

Pass rate needed: 4 out of 5. If they only pass 3, the fit is too loose.

Partner types to target:
Podcasters

Community admins

Tool companies

Agencies

Educators (newsletter writers, course creators)

Target: 4 prospects in each category = 20 total on your shortlist
Expected yes rate: 20-30% (plan for 70% rejection)
The Assets That Actually Compound

Feed-drops: A full episode from your podcast publishes directly in another podcast's RSS feed. Key requirements:
Host-voiced intro (20-30 seconds)

Talent reads outperform generic announcer reads by 3 points on purchase intent

Realistic conversion: ~0.67% device conversion (Chartable SmartPromos data)

Mini-templates: One-page, co-branded assets that solve a specific problem for the partner's audience
Takes ~3 hours to produce

Gate with email for 7 days, then open up

Personalized assets drive 4x more demo requests than generic content (ON24 benchmarks)

The Measurement Layer

Wire three tracking systems from day one:
UTMs on every link
Source = partner name

Medium = channel type

Campaign = sprint month

Track in GA4: template view, template claim, demo intent

SmartPromos through Chartable
For podcast-to-podcast attribution

Tracks device conversion: did someone who heard the promo subsequently download your show?

Self-reported attribution
"How did you first hear about us?" dropdown on template gates and demo forms

Partner names in the options

Cross-reference against UTM data - when they disagree, trust the human

The 15-Minute SLA

The setup:
Slack channel for any form submission with partner UTM or word "referred"

Make or Zapier automation (10 minutes to build)

Coverage blocks that overlap with your biggest partner's audience

The target: 15 minutes to first reply (not to close)
The message: "Hey, thanks for coming via [partner]. Here's a 15-minute fit check - pick a time."

Why it matters: Harvard Business Review study shows responding within an hour makes you nearly 7x more likely to qualify a lead. Most nomads respond the next morning because they were asleep in a different time zone.
The Sprint Timeline
Day 1: Build the list and wire the tracking

Day 3: Send 20 outreach messages

Days 4-6: Negotiate and produce assets

Days 8-12: Feed-drops and templates go live

Day 13: Pull numbers and send partners a 5-line recap with their stats

Day 14: Debrief, duplicate the board, load 5 new prospects for next sprint

The Compounding Flywheel

After the first sprint:
You have a proven partner and co-created asset

The partner knows you deliver

The asset has a landing page and tracking

Next sprint: skip prospecting for that partner, go straight to "what do we ship next?"

Add 2 new partners to the rotation

Sprint progression:
Sprint 1: 2 partners

Sprint 2: 4 partners

Sprint 3: 6 partners

Each tracked asset keeps collecting emails between sprints.
Why This Beats Cold Outreach for Nomads
Paid ads: Require budget and constant optimization

SEO: Takes months for results

Partnership marketing: Done this way, gives you signal in 14 days

Location independence: Every asset ships async, no Zoom calls required

Resources

Get the complete 14-Day Partner Sprint Kit with outreach scripts, negotiation checklist, Notion calendar, UTM spreadsheet, and SLA routing setup at statelessfounder.com/resources

Your one move this week: Build the 20-name shortlist. Run the adjacency test. If 4 pass, you're ready to sprint.
...more
14min
May 25, 2026Build a Three-Layer QA Wall for AI Outputs in 48 Hours
Build a Three-Layer QA Wall for AI Outputs in 48 Hours

Every AI deliverable you ship without quality checks is a bet against model drift, prompt degradation, and silent failures. This episode builds a three-layer QA wall that catches problems before clients do.
The Cost of Not Checking
Human evaluation: $50 per case, 10 minutes

LLM judge evaluation: $0.02 per case, 16 seconds

At 1,000 cases/week: $50,000 vs $20 in evaluation costs

Layer 1: Rubric-Scored LLM Judge

Deploy an LLM judge against a weighted rubric before every deliverable ships:
Five-Criteria Rubric
Task fulfillment (30%): Did it follow instructions?

Factual accuracy (25%): Are claims verifiable?

Clarity and structure (15%): Is it well-organized?

Style and brand fit (10%): Matches client voice?

Citations (10%): Proper attribution?

Safety flags (negative weight): PII leakage, hallucinations

Scoring Thresholds
Green (ships automatically): 0.8+ total, no critical flags, top two criteria 4+

Amber (human edit queue): 0.7-0.8 total, or any criterion ≤2

Red (blocked/escalated): <0.7 total or any critical flag

Research Backing
ICLR 2026 AutoMetrics: +33.4% correlation with humans vs direct LLM-as-judge

AAAI 2026 Think-J: Rubric-anchored judges more robust to noisy training data

Layer 2: Golden-Set Replay and Drift Detection

Build a golden set of 40-60 items per output type, scored by humans with agreed-upon labels and rationales.
Weekly Calibration Process
Replay golden set through your judge

Measure agreement using Cohen's kappa or Kendall's tau

Kappa >0.61 = substantial agreement

Track week-over-week trends

When agreement drops → pause auto-shipping and investigate

Drift Detection
PLOS One 2026 study: Weekly Bradley-Terry recalibration achieved τ=0.59-0.68 vs humans

Detected three drift patterns: stable, improving, degrading

Without weekly replay, you're "shipping and hoping"

Guardrails Against Brittleness
Randomize position: Run both A-B and B-A orders (Chatbot Arena method)

Separate concerns: Rubric is workhorse, pairwise is tiebreaker

Never self-judge: Don't let GPT-4o judge GPT-4o outputs

Layer 3: Human Sampling with Red/Amber/Green Thresholds

Strategic 5-10% human sampling focused on risk and borderlines:
Sample Composition
50%: Amber decisions (borderlines judge wasn't sure about)

30%: High-risk greens (long outputs, safety-sensitive, new client styles)

20%: Random greens (keep judge honest)

Dashboard Thresholds
Green: Judge precision ≥95%, human disagreement <10%, no critical flags

Amber: One metric slipped → raise cutline by 0.02, bump sampling to 15%

Red: Critical safety event, 2+ major misses in 50-item sample, or kappa <0.5

Client Value Proposition

"Every output gets scored by a calibrated judge against a six-criterion rubric. Top performers ship automatically. Borderlines get human edit. Weekly 5-10% human sample with dashboard that updates every Monday."
The Monday Dashboard

Five widgets for 30-minute weekly review:
Volume and mix: Items processed, percentage green/amber/red

Judge health: Agreement vs golden set with 4-week trend

Human QA metrics: Precision, disagreement rate, sample size

Risk flags: By type and resolution speed

Cost per eval: Track efficiency gains

Cost Analysis: Visa Run Revenue Math
Judge costs: $20/week for 1,000 items

Human sample: 50-100 items at $15-20/hour

Total QA cost: ~$350/week

vs Full human review: $50,000/week

ROI: If $350 prevents one client churn, pays for itself quarterly

Implementation Checklist
This Week
Build golden set: 40 items from real output (good, borderline, bad)

Score manually: Create foundation for everything else

Schedule Monday review: 30 minutes on calendar

Next Week
Deploy rubric-scored judge on new outputs

Set up weekly golden-set replay

Implement human sampling workflow

Resources

The QA Wall Kit includes:
Rubric template with acceptance thresholds

Judge prompt pack (rubric + pairwise modes)

Human sampling SOP with R/A/G dashboard

Monday review checklist

Research Sources
ICLR 2026 AutoMetrics: Rubric-style evaluators improve correlation by 33.4%

PLOS One 2026: Bias-calibrated LLM judges with weekly recalibration

AAAI 2026 Think-J: Generative judges outperform classifier-style approaches

UW Health Clinical Study: Cost/latency comparison of human vs LLM evaluation

TREC AutoJudge 2026: Live benchmark studying judge vulnerabilities and guardrails

Next episode: Judge fine-tuning vs off-the-shelf models for domain-specific QA
...more
12min
May 25, 2026Build a B2B Affiliate Program in 14 Days
Build a B2B Affiliate Program in 14 Days

Most founders think the next hire they need is a salesperson. They're wrong. The next hire isn't a person at all — it's five partners who already have your buyer's attention and will send them your way for a cut of the revenue.
In This Episode

Santi and Kira walk you through building a complete B2B affiliate program from scratch in just 14 days. You'll get the one-pager template, commission structures, UTM tracking setup, outreach email sequences, cross-border payout procedures, and compliance guidelines.
Key Topics Covered
Referral vs Affiliate Partners: Why the distinction matters for your terms and enablement

Partner Tiers: Creator, Solutions, and Community tiers with different commission structures

The One-Pager: Six essential elements every partner needs to see

Commission Math: Recurring vs lifetime models with real examples from Webflow and Fathom

UTM Tracking: Simple Google Sheets setup for attribution without expensive tools

Compliance Basics: FTC 2023 updates, ASA requirements, and disclosure copy that works

Cross-Border Payouts: W-9/W-8 collection and PayPal/Wise batch payment setup

The 14-Day Sprint: Exact timeline from partner list to first demos

Key Takeaways
Start Small and Selective: Five hand-picked partners beat hundreds of random recruits — GoToMeeting got 725% more paid accounts by cutting partners, not adding them

Structure Recurring Commissions: Pay 30% for 12 months or 25% lifetime so you only pay on revenue you've collected, eliminating upfront risk

Bake in Compliance: Include disclosure copy directly in partner assets to meet 2023 FTC requirements that hold advertisers responsible for affiliate compliance

Real Examples
Webflow: 50% commission on first year subscription revenue through 500+ partners on PartnerStack

Fathom Analytics: 25% lifetime recurring commissions with simple PayPal payouts

GoToMeeting: 725% increase in paid accounts through focused partner recruitment and enablement

The 14-Day Sprint Timeline

Days 1-3: Build prospect list (15 potential partners → 5), draft one-pager, pick commission model, create UTM sheet

Days 4-6: Outreach sequence (4 emails over 12 days), track replies, send preview materials

Day 7: Asset drop with unique URLs, disclosure copy, and creative kit

Days 8-14: Activation, placement confirmation, first demo tracking, and payout queue setup
Resources
Referral Partner Kit: Complete template bundle with one-pager, terms, UTM tracker, outreach emails, payout SOP, and dashboard

FTC Endorsement Guides (2023): Updated disclosure requirements

IRS Publication 515: Cross-border withholding rules for affiliate payments

Compliance Note

We are not tax or legal advisors. This is operational guidance. Confirm everything with your accountant and legal counsel, especially for cross-border payments and disclosure requirements.

Ready to build your partner program? Download the complete Referral Partner Kit and start your 14-day sprint.
...more
15min
May 25, 2026YouTube SEO for B2B: Build a Search-Led Video Engine That Books Demos
YouTube SEO for B2B: Build a Search-Led Video Engine That Books Demos
The Roma Norte Demo Story

Kira's sitting in a Mexico City café when her phone buzzes - demo booked. The source? A 6-minute screen share video with 240 views titled "Make.com client onboarding automation, email plus Slack, free template." Not creative, but it answered the exact query someone typed when they had a broken onboarding flow.
Why Search Beats Recommended Feed for B2B

YouTube's Search & Discovery team optimizes for viewer satisfaction and intent matching, not just clicks. When someone searches "Webflow to HubSpot auto-create MQL with UTM capture," they have a job to do today. They're not browsing - they're buying.

The timing advantage: Google's 2025 ranking adjustments surface more video content across search results and AI summaries. Your YouTube videos now compound across surfaces you didn't even publish to.
The Template CTA Pattern

Three B2B companies have perfected the conversion mechanism:
Make.com
Template library with "Get this template" buttons

One click clones entire automation scenarios

YouTube descriptions link directly to template pages

Template click = conversion event + account activation

Webflow University
"Clone in Webflow" duplicates entire projects

Paired with tutorial streams

Stream teaches, cloneable converts

Airtable
"Use template" → "Add base" flow

Tutorial to template pipeline

Working base in your workspace instantly

The key insight: Template CTAs provide zero-friction activation. Viewer gets value immediately vs. "book a demo" which requires timezone math and scheduling friction.
Building Your System: The 4-Tier Intent Map

Tier A - "Do the job now" (highest intent)
"Airtable CRM score inbound leads and route to AE in ten minutes"

Person has pipeline problem today

Tier B - Integration unblocking
Tools that unblock adoption of your solution

Tier C - Evaluation
"Make versus Zapier for multi-step client onboarding"

Tier D - Post-purchase fixes
Support and troubleshooting content

30-Minute Topic Map Process
List your 3 core jobs-to-be-done

Pick 1-2 tools your buyers already use per job

Generate 1 Tier A + 1 Tier B query per combination

Add 2 wildcards from C or D

Assign each to a week = 12-week map

Prioritization Criteria (not search volume)
Does a working template exist you can link to?

Can you screen-share the build in under 10 minutes?

Is it a known adoption pain point?

If all three = yes, that's week one.
The Weekly Cadence (5 Hours Total)

Monday-Tuesday: Production (2.5 hours)
Pick buyer query from map

Confirm template link works

Record single-take screen share

Cut dead air, burn in captions

Wednesday: Publish
Description template: benefit first line, template link second line

5-8 chapters with timestamps

Pin comment with template link + common gotchas

End screen to specific next video

Thursday: Repurpose (30 minutes)
Cut 2 Shorts (awareness only - links not clickable)

Write 1 LinkedIn post with video + template links

Use LinkedIn-specific UTMs

Friday: Measurement (20 minutes)
Update tracker with UTM data

Compute demos per 1,000 views

Decide one thing to keep, one to change

Target Metrics
CTR: 4%+ (YouTube's documented range is 2-10%)

Retention: 35% average view duration (internal target for 6-10 minute tutorials)

Conversion: Demos per 1,000 views (the one number that matters)

The Discovery Objection

Objection: "You're leaving reach on the table by only targeting search."

Response: Layer discovery on after building your search foundation. Use Shorts and discovery content to widen top of funnel, but long-form search videos carry the clickable template links and UTMs. Build the net before you drive the fish.
Measurement That Matters

Every template link gets UTM-tagged:
Source: YouTube

Medium: video

Campaign: date + query slug

Content: link placement (description, pinned comment, end screen)

GA4 captures automatically. Mark template installs and demos as conversion events. Now you can see: this video drove 4 installs and 1 demo, that video drove 12 installs and 0 demos.

The insight: A video with 80 views and 2 demos outperforms a video with 800 views and 0 demos.
Your Next Action

Pick your first buyer query. Not the most creative one - the most boring, specific, "someone is typing this into YouTube right now because they have this problem today" query you can find. Record 6 minutes. Link the template. Publish.
Resources

Get the complete /t/youtube-seo-engine kit on the Resources page:
Topic map with 4 intent tiers

Script generator prompts

Description templates with chaptering

Repurposing SOP to Shorts and LinkedIn

UTM tracker wired to GA4 conventions

The exact system we just walked through. Duplicate it and start your 12 weeks.

The Stateless Founder teaches digital nomads how to build location-independent businesses powered by AI and automation. New episodes Monday, Wednesday, Friday at 7 AM PT.
...more
16min
May 18, 2026Build a Minimal LLM Evaluation Loop That Catches Regressions While You Sleep
Build a Minimal LLM Evaluation Loop That Catches Regressions While You Sleep
The Problem: Silent AI Failures

When your website goes down, you get an alert. When Stripe breaks, payments fail immediately. But when your LLM starts producing worse outputs—slightly less accurate summaries, off-tone emails, JSON fields that are almost right—nobody tells you. The model doesn't throw an error. It just gets worse.

For nomad founders managing AI workflows across time zones, this silent failure mode is especially dangerous. You're asleep, on a 12-hour bus in Peru, or doing a visa run in Bangkok while your content repurposing tool ships summaries that drop key facts.
The Solution: A Three-Piece Evaluation System
1. Golden Test Sets (15-20 Cases Per Output Type)
Real production data only: Synthetic test cases test synthetic problems

JSONL format: One line per case, input paired with known-good output

Tagged for slicing: Formal tone, has PII, Spanish language, etc.

Three common types: Email rewrites, JSON extraction, content summaries

2. AI Judge Prompts (G-Eval Pattern)
Rubric-guided scoring: Analysis first, then scores per dimension

Cross-family judges: Generate with OpenAI, judge with Anthropic (or vice versa)

Blind randomized order: Prevents position bias

Four dimensions for email rewrites: Instruction-following, tone fit, clarity, PII leak check

3. Pairwise A/B Testing
Compare prompt A vs prompt B: Not just absolute scoring

Randomized presentation: Judge sees outputs in random order

Tie-breaking: Borderline cases escalate to human review

Reliability Mitigations
Judge Bias Problems
Self-preference bias: Judges favor their own model family's outputs

Position bias: Prefer whatever they see first or whatever is longer

Verbosity bias: Longer outputs score higher regardless of quality

Solutions
Cross-family separation: Never use same provider for generation and judging

Human sampling: 10-20% of live production jobs reviewed weekly

Focus sampling: Pull cases where judge was least confident

95% agreement target: If judge-human disagreement exceeds 5% for two weeks, recalibrate

The Monday Scorecard (30 Minutes Weekly)
Six Key Numbers
Pass rate per output type: Email rewrites (90% threshold), summarization (88%)

Win rate from pairwise A/Bs: New prompt vs baseline

P95 latency: 95th percentile response time

Cost per 100 jobs: Token usage × per-token price

Judge agreement: Percentage alignment with human sample

Incidents: Anything that broke during the week

Decision Framework
Roll forward: Pass rates stable, costs in line

Hold and investigate: Something dipped

Roll back: Model deprecation broke judge or generator

Implementation Tools
CI Regression Gate
Promptfoo: Open source CLI with YAML config

GitHub Actions: Automated eval runs on every PR

Pass-rate thresholds: Build fails if quality regresses

Non-zero exit code: Blocks deployment automatically

Cost Tracking
OpenAI/Anthropic APIs: Return token usage on every call

Real example: 4¢ per generation + 1.2¢ per judge call = $5.20 per 100 jobs

Alert thresholds: Catch cost spikes before monthly review

Model Deprecation Monitoring
Pin model versions: Keep last two working versions in environment variables

Watch deprecation pages: OpenAI and Anthropic maintain lifecycle schedules

One-line rollback: Pinned configs enable instant reversion

Weekly Rhythm
Friday: Add 3-5 fresh cases from production traces

Sunday: Open PR with prompt/model changes, let CI run

Monday: Fill scorecard, make decision, assign one action item

Daily: Alerts on latency/cost thresholds catch spikes

Monthly Maintenance
Refresh golden sets: Replace stale cases with fresh production examples

Close stale failures: Archive resolved issues

Recalibrate judge: If agreement drops below 95% target

Start Small: The One-Output-Type Version

Don't try to build all three output types at once. Pick your highest-volume type, build 15 golden cases, wire up one judge prompt, run for two weeks. You'll catch things you didn't know were breaking.

The full three-type system is the mature version. One type is the version that fits in an afternoon and still saves you from Monday morning client complaints.
Resources
Starter Kit: JSONL templates, G-Eval judge prompts, Promptfoo CI config

Monday Scorecard: Notion template with all six metrics

Deprecations Checklist: Model lifecycle monitoring guide

Human Sampling Guide: 10-20% review protocols

The vibes-based evaluation method works until it doesn't. When it doesn't, you find out from your customers. This system ensures you know before they do.
...more
15min
May 15, 2026Build Self-Serve Revenue While You Sleep: Weekend Setup Guide
Build Self-Serve Revenue While You Sleep: Weekend Setup Guide
The Self-Serve Revenue Problem

43% of SaaS companies now run hybrid pricing models (base fee + usage), but most nomad founders are still losing revenue to:
Timezone gaps when buyers want to purchase

Failed credit card payments with no recovery system

Manual onboarding calls that don't scale across time zones

Three Self-Serve Patterns You Can Ship This Weekend
Pattern 1: Template + Add-Ons

Stack:
Stripe Checkout or Payment Links for one-time purchases

Stripe Billing for recurring add-ons (monthly updates, premium templates)

Optional: Tally forms for gated delivery

Costs:
Stripe: 2.9% + $0.30 per card charge

Billing: Additional 0.7% per paid invoice

Activation Event: Template duplicated AND first checklist item completed within 24 hours
Pattern 2: Micro-SaaS with Hybrid Pricing

Stack:
Stripe Billing with subscriptions

Usage meters for hybrid pricing

Customer portal for self-service management

Usage caps to prevent runaway costs

Example Pricing: $29/month base + $0.15 per AI job after 100 jobs

Activation Event: First successful job completed within 24-48 hours
Pattern 3: Productized Services

Stack:
Stripe Payment Links for deposits

Calendly Free (1 event type, unlimited bookings)

Tally forms for intake

Activation Event: Self-scheduled kickoff AND deposit paid within 24 hours
Dunning & Recovery Automation
Stripe Configuration
Go to Billing → Subscriptions and Emails → Manage Failed Payments

Enable Smart Retries (ML-driven retry timing)

Turn on all customer emails: failed payment, trial ending, upcoming invoice, expiring card

Add custom 7-14 day save sequence: Day 0, Day 3, Day 7

Include one-click card update links

Paddle Configuration
Built-in Retain system: 4 emails over 10-12 days

30-day total retry window

Native SMS and in-app prompt support

Multi-channel recovery without custom development

Recovery Results: Founders report recovering $2,400+/month and reducing involuntary churn from 1.0% to 0.3% monthly.
Reply Router for 15-Minute Response Times

System Design:
Classify incoming replies by intent (buying, expansion, billing, support)

Use lightweight LLM classifier

Check sender's local timezone and business hours

Page on-call person via Slack/SMS for high-intent messages

Auto-acknowledge outside business hours with response time commitment

Research Backing: Responding within 5 minutes makes you 21x more likely to qualify leads vs. 30-minute response times.
Key Metrics to Track
Activation Rate: % of signups hitting aha event within defined window

Day-One Retention

Trial-to-Paid Conversion

Involuntary Churn: Failed payments as % of MRR

Recovery Rate: Broken out by decline reason (expired card vs. insufficient funds)

Alert Threshold: If activation rate drops below 30% for two consecutive weeks, stop acquisition and fix onboarding.
Weekend Implementation Challenge

This Weekend:
Pick your pattern

Set up checkout/paywall

Enable Smart Retries and email sequence

Next Week: Add reply router
Week After: Layer in SMS for high-value accounts
Resources
Self-Serve in a Weekend Config Pack: Flowchart, Stripe/Paddle checklists, webhook maps, email templates, and reply-router specifications

All templates and configurations available on the Resources page

"The Lisbon Test for self-serve: Can a buyer in Tokyo try your product, hit a paywall, pay, and get started while you're asleep in Portugal? If yes, you've built something location-independent. If no, you've built a job with a nice view."
...more
16min
May 13, 2026Build AI-First SOPs That Survive Model Changes
Build AI-First SOPs That Survive Model Changes

When models change on provider schedules you don't control, your prompts break. Today we build the fix: an AI-first SOP template that treats prompts as versioned assets.
The Problem: Brittle Prompts in a Moving Target Environment
OpenAI retired GPT-4o from ChatGPT February 13, 2026 (hard cutoff)

Traditional SOPs say "use GPT-4o" with no version, expiration, or fallback

Result: contractors debugging prompts that aren't broken when models disappear

The AI-First SOP Schema
Header (Metadata Block)
Owner name + backup owner (critical for async teams)

Status: draft/approved/deprecated

SOP version number

Model tag with specific release date

Temperature band (0-0.2 for compliance, 0.3-0.6 for creative)

Steps with Versioned Prompts
Each model call gets unique prompt key + version number
Semantic versioning: major.minor.patch
Major: Output shape changes (text → JSON)

Minor: Instructions change, output contract same

Patch: Typo fixes, threshold tweaks

Full label: [email protected]#model_tag+dataset_hash

Input/Output Schemas
Field name, type, required/optional, description

JSON Schema for technical teams, simple tables for everyone else

Contractors don't guess what prompts expect

Failure Modes & Guardrails
OWASP Top 10 for LLM Applications (v2.0, 2025) catalogs common risks

Document specific failure modes for each workflow

Attach guardrail policy IDs (AWS Bedrock, NeMo Guardrails)

Version guardrail policies too

Tooling Options
Small Teams (≤3 people): Pure Notion
Database with owner, status, SemVer, model tag, last edited time

Page history provides diffs for rollback

Button stamps changelog entry when publishing new version

Setup time: 45 minutes

Bigger Teams: Dedicated Platforms
PromptLayer: Registry with release labels, rollback, analytics
Speak scaled 1→11 markets training non-technical teams to version prompts
Humanloop: Version control with .prompt files that sync to Git
Note: Platform sunset notice flagged in 2025 docs

Platform Risk Mitigation
Keep SemVer convention, model tags, changelog in your SOP

These survive any platform migration

Tool can disappear; versioning scheme persists

The 30-Day Change Review Process
What to Check Monthly
OpenAI deprecations page

Azure model retirement tables

Anthropic deprecation docs

Vertex AI deprecation page

When Something's Flagged
Pull affected SOPs

Rerun evals on replacement model (even just 5 test cases)

If outputs hold: update model tag, bump version

If outputs don't hold: patch prompt before deadline

Real Example: Meticulate
Scaled to 1.5M LLM requests using PromptLayer

Tagged every call by function and model

When prompts regressed: search failing runs, find working version, rollback

Versioned workflow enabled hotfixes in hours vs days

The Cost of Not Having This
3AM messages from confused contractors

2 hours debugging prompts that aren't broken

Client complaints on LinkedIn in front of 11K followers

Margins drifting as pricing changes go unnoticed

Implementation

This week: Pick your most critical AI workflow—the one that would hurt most if it broke tomorrow. Build its SOP first. Pin the model version, write the failure modes, set the 30-day review date.

Template: Grab the AI-First SOP template in the show notes. Duplicate it, fill in your model tag and inputs, get versioned prompts with built-in changelog by end of day.

Resources
AI-First SOP Template (Notion) - Complete template with 3 worked examples

OpenAI API Deprecations

OWASP Top 10 for LLM Applications v2.0

Semantic Versioning Spec

Case Studies Mentioned
Speak: Language learning app scaled 1→11 markets using PromptLayer for non-technical prompt editing

Meticulate: Scaled to 1.5M LLM requests with versioned prompt workflow for rapid rollbacks
...more
15min
May 11, 2026Stop Interviews: Use a 90-Minute AI-Graded Skills Test
Stop Interviews: Use a 90-Minute AI-Graded Skills Test
The Problem

That founder in Bangkok spent 11 hours across 5 calls in 4 time zones to hire one contractor—who ghosted after the trial project. Sound familiar?

Resume screens and portfolio reviews don't tell you if someone can actually handle malformed JSON at 2 AM when you're asleep on the other side of the planet.
The Solution: AI-Graded Skills Tests

Replace interviews with a paid, 90-minute async skills test graded by a calibrated LLM judge with human sampling on borderlines.
Core Architecture

Golden Set Calibration
Build 6-10 test items per role: 4 happy-path scenarios, 2-3 edge cases, 1 failure-handling test

For automation builders: clean webhook payload, Euro currency with commas, missing email field, duplicate event requiring idempotency logic

Run 3-5 internal testers through the same test to calibrate rubric weights

Pairwise Judging with Permutation Debiasing
Never use raw 1-10 scores—LLM judges show systematic position bias

Show candidate work vs. golden answer side-by-side: "Which better satisfies this rubric?"

Flip order and run again—if model picks same winner both times, reliable signal

If it flips, flag for human review

Confidence Bands for Decisioning
Compute win rate across all items (% of time candidate beat gold standard)

Calculate 95% Wilson confidence interval around that number

Pass: lower bound above 60%

Borderline: win rate 55-65% or interval straddles 60%

Reject: below 55% with upper bound under 60%

Human Sampling Protocol
Every borderline case gets human review

Sample 10-20% of clear passes (stratified by role/region) to check for model drift

Route any critical criterion failure (e.g., factual accuracy in content) to human regardless of overall score

Content Ops Grading

Four weighted criteria:
Factual accuracy: 35% (marked critical—auto-routes to human if flagged)

Structure: 25%

Voice adherence: 25%

Brief compliance: 15%

Anti-Cheat Without Surveillance

Required Layer:
Randomized inputs (rotate variants monthly)

Time-boxed links (portal locks at 90 minutes)

Honor statement checkbox

Optional Additions:
Tab-switch logging

Basic plagiarism detection

Avoid: Screen recording, keystroke logging, webcam monitoring—you're hiring async contractors, not surveilling them.
Fair Payment Structure

Regional Pay Bands (90-minute stipend):

Content Ops:
Southeast Asia: $30

Western Europe: $60

US: $68

Automation Builders:
Southeast Asia: $45

Western Europe: $83

US: $98

Based on Upwork median rates and Automattic's $25/hour trial standard.
Appeal Process
5-day window for human re-review requests

Rubric feedback provided either way

Brand signal: "We take your time seriously enough to build transparent systems"

Research Foundation
Stanford SCALE Autorubric: Per-criterion rubric checks with few-shot calibration

Chatbot Arena methodology: Pairwise comparison with confidence-aware ranking

Position bias studies: 100k+ evaluation instances show systematic bias in LLM judges

G-Eval correlation: GPT-4 achieves ~0.51 Spearman with humans on summarization—good but not perfect

Quality Flags & Transparency
Log every prompt, model version, score (HELM-style reporting)

Version everything, changelog everything

Defend every decision with audit trail

10-20% human sampling concentrated on borderlines and critical criteria

The Math

Traditional hiring: 11 hours of interviews + bad hire that costs a client

AI-graded test: $400 for 10 candidates + 40 minutes reviewing 2 borderline cases

The math isn't close.
Resources

The Contractor Skills Test Pack includes:
Golden-set datasets for automation builder and content ops roles

Pairwise grader prompts with permutation logic

Rubric weights and confidence-band calculator

Human sampling SOP and anti-cheat checklist

Regional pay-band tables

Candidate-facing one-pager for Notion

Next Steps
Grab the Contractor Skills Test Pack

Swap in your role and stack

Run 3 internal testers to calibrate bands

Post your first test by Friday

Ship it before your next visa run.
...more
14min
May 08, 2026Build an AI Org Chart That Works While You Sleep
Build an AI Org Chart That Works While You Sleep
The Oaxaca Disaster

Kira's 11 PM wake-up call in Oaxaca: contractor in Lagos finished fourteen blog posts, but the Berlin editor was on PTO with no backup assigned. Result? Nine posts reviewed while falling asleep at a tiny Airbnb desk, five shipped unreviewed, and one had the wrong client name in the headline. The 7 AM apology call from a mezcal hangover was the moment she realized her agency didn't have an org chart—it had her.
The Four-Role Framework

Not four people—four roles. One person can hold multiple roles when you're small:
Builder: Ships the thing. Writes drafts, builds automations, pushes code

Operator: Owns quality, schedules, budgets, client communications

Reviewer: Independent check. Cannot be the Builder on the same task

Agent/Dispatcher: Routes work, maintains schedules, pages people when things break

Three Scalable Patterns
Pattern 1: Solo + Contractors
Founder: Builder + Operator

Contractor 1: Secondary Builder

Contractor 2: Reviewer

Make automation: Dispatcher with 30-minute human backstop

Pattern 2: Pod Model (3-5 people)
Lead writer (Builder)

Ops person (Operator)

Rotating editor (Reviewer)

Published SLAs: 24h priority campaigns, 48h everything else

Auto-approve on silence if automated checks pass

Pattern 3: Agency Cell + Dispatcher
Multiple pods handling different clients/products

Traffic Manager routes work and maintains coverage

UTC coverage grid shows overlap windows

SLA Matrix & Escalation
Response Time Targets
Revenue-critical leads: 15-minute acknowledgment during sender's business hours

Code reviews: 4-hour first look during business hours

Content approvals: 24-48 hours

Support requests: Same business day

Severity Tiers (Atlassian Framework)
SEV 1: Revenue impact now → immediate paging

SEV 2: Major degradation/deadline today → 30-60 minute window

SEV 3: Normal work → business hours

Two-Layer Escalation
On-call Agent (15-minute acknowledgment window)

Auto-escalate to Operator if missed

Coverage Grid & Handoffs
UTC Coverage Grid

Spreadsheet with columns: name, role, UTC offset, work start/end, PTO dates. Calculate overlap hours between Builder in Bogotá and Reviewer in Bangkok.
Five-Field Handoff Packet

Before passing work across time zones:
Context: What we're doing and for whom

Constraints: Deadlines, budgets, brand rules

Last good output: Most recent working version

Budget left: Hours or dollars remaining

Fallback: What to do if blocked for 12 hours

Receiving person must comment "I own it" and restate next checkpoint in UTC.
The Lisbon Test for Handoffs

Could this work keep moving for 24 hours while you're offline? If any of the five fields is blank, you don't have a handoff—you have a hope.
Reviewer Rotation

GitLab's "Reviewer Roulette": Random assignment from a pool. For small teams, use a shared doc rotating weekly assignments with visible backup coverage.
Blameless Postmortems

Google SRE template: What happened, timeline in UTC, root cause, what worked, what failed, three ranked fixes with owners and due dates. Run within 72 hours while details are fresh. Goal: fix the system, never punish.
EU AI Act Compliance Ready

August 2, 2026 applicability date for most provisions. Named Reviewers, documented approval chains, and evidence logs aren't just good ops—they're compliance readiness for human oversight requirements.
Minimum Viable Process

Start with:
One-page RACI per offer (not per task)

UTC coverage grid in Google Sheets

Five-field handoff packet

Two-tier escalation (15-minute window only for revenue-critical leads)

Pilot on one client for two weeks, then iterate

This Week's Action
Download the AI-Augmented Org Packet (RACI template, SLA matrix, coverage grid, escalation tree, handoff checklist)

Duplicate and fill in roles for one client/product

Assign backup for every single role

Run one red-team handoff—hand real task to backup overnight

If it ships without you touching it, your org chart works

Resources:
AI-Augmented Org Packet - Complete templates and frameworks

GitLab Reviewer Roulette - Rotation system reference

Atlassian Incident Response - Severity framework

Google SRE Postmortem Culture - Blameless postmortem template
...more
16min
May 08, 2026Build Your Own LLM Cost Meter Before the Next Provider Change
Build Your Own LLM Cost Meter Before the Next Provider Change
The Problem: Fragmented Cost Visibility

Provider dashboards only show you one slice of your AI spend. When you're calling OpenAI, Anthropic, and Gemini, you have three separate billing views that don't talk to each other. You can't see total cost per customer, per job, or which workflow is eating your margins.
The Solution: A Vendor-Agnostic Data Layer

Every API call writes one row to one table, same format, regardless of provider. The schema is simpler than you think: 19 fields that capture everything you need for cost attribution, audit compliance, and budget enforcement.
The 19-Field Schema

Core Fields:
event_id (UUID)

timestamp (UTC)

actor_type (user/agent/human_reviewer)

provider (normalized: "openai", "anthropic", "google")

model (normalized: "gpt-5.4-mini")

region

Cost Fields:
input_tokens

output_tokens

latency_ms

cost_usd

Attribution Fields:
job_id

customer_id

Audit Fields (EU AI Act Ready):
pii_flag

review_required

reviewer_id

decision (approved/rejected/edited)

Quality Fields:
eval_score

confidence

error_code

Three Implementation Paths
Path 1: Spreadsheet First (Under 1 Hour)
Google Sheets with Apps Script webhook

Looker Studio dashboard

Weekly email digest

Perfect for solo builders

Path 2: Open Source + Data Ownership
Langfuse for LLM observability

Export to Parquet files

Query with DuckDB

30-day hot, 180-day warm retention

Path 3: Lightweight SaaS
PostHog for event capture

Metronome for usage meters

Built-in alerting

Fastest path if you're already in the ecosystem

Spend Caps and Degraded Mode

Set two numbers:
Daily cap in dollars

Rolling 7-day cap

When caps trip, don't just stop. Degrade gracefully:
Queue non-urgent requests

Route to cheaper models automatically

Flag for human review if full capability needed

EU AI Act Compliance

The Act becomes generally applicable August 2, 2026. The audit fields in this schema create the evidence trail you need:
Log which calls touched PII

Track human review decisions

Maintain 6-month retention minimum

Resources

Indie LLM Cost Meter Starter - Complete template with:
CSV schema and Google Sheets setup

Apps Script webhook code

DuckDB SQL queries

Slack alert recipes

Looker Studio dashboard template

Key Pricing References (May 2026)
GPT-5.5: $5/1M input tokens, $30/1M output tokens

All providers enforce rate limits via 429 responses

Normalize provider names at ingestion to avoid SQL grouping errors

Action Item

Add the schema to your next LLM call. Log one row. Once you see that first row land, you'll never go back to checking billing dashboards manually.
...more
15min

FAQs about The Stateless Founder:

How many episodes does The Stateless Founder have?

The podcast currently has 25 episodes available.