The Stateless Founder

Stop Interviews: Use a 90-Minute AI-Graded Skills Test


Listen Later

Stop Interviews: Use a 90-Minute AI-Graded Skills Test
The Problem

That founder in Bangkok spent 11 hours across 5 calls in 4 time zones to hire one contractor—who ghosted after the trial project. Sound familiar?

Resume screens and portfolio reviews don't tell you if someone can actually handle malformed JSON at 2 AM when you're asleep on the other side of the planet.

The Solution: AI-Graded Skills Tests

Replace interviews with a paid, 90-minute async skills test graded by a calibrated LLM judge with human sampling on borderlines.

Core Architecture

Golden Set Calibration

  • Build 6-10 test items per role: 4 happy-path scenarios, 2-3 edge cases, 1 failure-handling test
  • For automation builders: clean webhook payload, Euro currency with commas, missing email field, duplicate event requiring idempotency logic
  • Run 3-5 internal testers through the same test to calibrate rubric weights
  • Pairwise Judging with Permutation Debiasing

    • Never use raw 1-10 scores—LLM judges show systematic position bias
    • Show candidate work vs. golden answer side-by-side: "Which better satisfies this rubric?"
    • Flip order and run again—if model picks same winner both times, reliable signal
    • If it flips, flag for human review
    • Confidence Bands for Decisioning

      • Compute win rate across all items (% of time candidate beat gold standard)
      • Calculate 95% Wilson confidence interval around that number
      • Pass: lower bound above 60%
      • Borderline: win rate 55-65% or interval straddles 60%
      • Reject: below 55% with upper bound under 60%
      • Human Sampling Protocol

        • Every borderline case gets human review
        • Sample 10-20% of clear passes (stratified by role/region) to check for model drift
        • Route any critical criterion failure (e.g., factual accuracy in content) to human regardless of overall score
        • Content Ops Grading

          Four weighted criteria:

          • Factual accuracy: 35% (marked critical—auto-routes to human if flagged)
          • Structure: 25%
          • Voice adherence: 25%
          • Brief compliance: 15%
          • Anti-Cheat Without Surveillance

            Required Layer:

            • Randomized inputs (rotate variants monthly)
            • Time-boxed links (portal locks at 90 minutes)
            • Honor statement checkbox
            • Optional Additions:

              • Tab-switch logging
              • Basic plagiarism detection
              • Avoid: Screen recording, keystroke logging, webcam monitoring—you're hiring async contractors, not surveilling them.

                Fair Payment Structure

                Regional Pay Bands (90-minute stipend):

                Content Ops:

                • Southeast Asia: $30
                • Western Europe: $60
                • US: $68
                • Automation Builders:

                  • Southeast Asia: $45
                  • Western Europe: $83
                  • US: $98
                  • Based on Upwork median rates and Automattic's $25/hour trial standard.

                    Appeal Process
                    • 5-day window for human re-review requests
                    • Rubric feedback provided either way
                    • Brand signal: "We take your time seriously enough to build transparent systems"
                    • Research Foundation
                      • Stanford SCALE Autorubric: Per-criterion rubric checks with few-shot calibration
                      • Chatbot Arena methodology: Pairwise comparison with confidence-aware ranking
                      • Position bias studies: 100k+ evaluation instances show systematic bias in LLM judges
                      • G-Eval correlation: GPT-4 achieves ~0.51 Spearman with humans on summarization—good but not perfect
                      • Quality Flags & Transparency
                        • Log every prompt, model version, score (HELM-style reporting)
                        • Version everything, changelog everything
                        • Defend every decision with audit trail
                        • 10-20% human sampling concentrated on borderlines and critical criteria
                        • The Math

                          Traditional hiring: 11 hours of interviews + bad hire that costs a client

                          AI-graded test: $400 for 10 candidates + 40 minutes reviewing 2 borderline cases

                          The math isn't close.

                          Resources

                          The Contractor Skills Test Pack includes:

                          • Golden-set datasets for automation builder and content ops roles
                          • Pairwise grader prompts with permutation logic
                          • Rubric weights and confidence-band calculator
                          • Human sampling SOP and anti-cheat checklist
                          • Regional pay-band tables
                          • Candidate-facing one-pager for Notion
                          • Next Steps
                            1. Grab the Contractor Skills Test Pack
                            2. Swap in your role and stack
                            3. Run 3 internal testers to calibrate bands
                            4. Post your first test by Friday
                            5. Ship it before your next visa run.

                              ...more
                              View all episodesView all episodes
                              Download on the App Store

                              The Stateless FounderBy Santi, Kira