Bulletproof Your AI Automations: Monitoring, Retries, and 2‑AM Alerts
The $8K Thailand Disaster
Santi lost an $8,000/month client when a Make.com API change broke his automation for four days while he was island-hopping in Thailand. Silent failure, no alerts, forty-eight leads vanishing into the void while he sat on a beach thinking everything was fine.
Why Automations Fail in Production
Rate limits that suddenly drop from 1,000 to 50 requests per hour
OAuth tokens that expire silently after 90 days despite "permanent" documentation
Schema changes where strings become arrays overnight
Provider outages returning HTTP 200 success with error messages in the body
Platform-Specific Error Handling
Zapier: Custom Error Handling + Manager Alerts
Add Custom Error Handling to any step touching external APIs
Click three dots → "Error Handler" for alternate failure path
Critical caveat: Error handlers suppress default email notifications
Build separate Zap: Zapier Manager "New Zap Error" → Slack "Send Channel Message"
Team account gotcha: Notifications route to Zap owner, not monitoring Zap creator
Make: Incomplete Executions + Auto-Retry
Scenario settings → Toggle "Store incomplete executions"
Add Break error handler to modules that might fail
Set attempts (3) and delay (exponential: 1min, 5min, 20min)
Scenario settings → "Number of consecutive errors" (5) for auto-deactivation
Instant trigger caveat: Deactivates after first error, not five
n8n: Global Error Workflow + Per-Node Controls
Create dedicated workflow with Error Trigger node
Route to Slack/Telegram/email with execution context
Template reference: workflow 5629 for multi-channel alerts
Limitation: No error data if trigger node itself fails
On Error: Stop Workflow / Continue / Continue Using Error Output
Retry On Fail: Set max attempts (no built-in exponential backoff)
Use Wait node between retries for rate limit handling
Security Alert: n8n "Ni8mare" Vulnerability
CVSS 10.0 (remote code execution, no authentication)
Affected versions: 1.65–1.120.4
Fixed in: v1.121.0
Exposed instances: ~59,559 as of January 11, 2026
Set up update notifications
Actually apply patches promptly
Restrict public webhook/form endpoints
Rotate credentials regularly
Retry Patterns & Best Practices
Exponential Backoff with Jitter
Add random delays to spread retry attempts
Prevents thundering herd effects when APIs recover
AWS guidance since 2015 for distributed systems
Idempotency Keys
Use unique identifiers that survive retries
Check if records exist before creating duplicates
Prevent multiple charges, notifications, or invoices
Incident Classification Framework
P0: Customer-facing service completely down
Response: Page immediately (Telegram, phone call)
Actions: Pause billing, notify clients, implement manual workaround
P1: Degraded but limping along
Response: Slack alert within one hour
Actions: Monitor closely, prepare escalation
P2: Broken but nobody notices yet
Response: Queue for tomorrow
Actions: Document and schedule fix
The Minimum Viable Error Stack
Error notifications: Route all platform errors to one channel (Slack/Telegram)
Retry logic: Two attempts minimum with exponential backoff on external APIs
Classification & runbooks: Define P0/P1/P2 with response procedures
Heartbeat monitoring: Check that daily workflows actually ran
Security patches: Update notifications and prompt application for self-hosted
Silent Failures & Volumetric Monitoring
Beyond error handlers, monitor for:
Triggers that stop firing
Successful runs with wrong output
Volume drops below expected thresholds
Zapier: Use Storage to count executions
Make: Data store for volume tracking
n8n: Database writes with periodic queries
The Lisbon Test
Can your error handling setup:
Notify you at a Lisbon café with sketchy wifi?
Let you diagnose issues from your phone?
Enable workaround implementation without your laptop?
Mobile-first principle: If you can't fix it from your phone, your setup is too complex.
Downloads
Notion Incident Runbook Template
P0/P1/P2 classifications
Response procedures and escalation paths
Manual workaround documentation
Transient vs permanent vs authentication errors
Identification and handling guidelines
Retry decision matrix
Zapier Manager → Slack
Make webhook → Slack
n8n Error Trigger → Telegram
Key Resources
Zapier Custom Error Handling
Make Incomplete Executions
n8n Error Handling Docs
n8n Multi-Channel Error Template
AWS Exponential Backoff Guide
Stripe Idempotency Best Practices
Build it bulletproof, or don't build it at all.