In Elixir Wizards S15E04, Charles Suggs and Emma Whamond are joined by Somtochi Onyekwere, a software engineer at Fly.io and contributor to the Corrosion distributed database project, to talk about distributed systems, infrastructure resilience, and the growing fragility of centralized cloud platforms.
We discuss what recent outages across major providers reveal about modern infrastructure and why more teams are starting to rethink assumptions around reliability, failover, and system design. Somtochi explains how Fly.io approaches geographic distribution, eventual consistency, and replication across nodes, along with the trade-offs that come with building systems this way.
The conversation explores CRDTs (Conflict-free Replicated Data Types), consensus, split-brain prevention, and what actually happens when distributed systems fail in production. We also talk about testing strategies, rollback planning, property-based testing tools, and how teams can reduce blast radius when things inevitably go wrong.
Along the way, we discuss AI infrastructure, sandboxing AI agents, and how newer workloads may add pressure to already centralized systems. The episode closes with practical advice for developers who want to build more resilient applications without over-complicating their architecture.
Topics Discussed in this Episode:
Corrosion and distributed database replicationCentralized cloud fragility and recent outage patternsDistributed systems versus traditional cloud architecturesMulti-region deployment strategies for Phoenix applicationsCRDTs and conflict resolution in distributed systemsEventual consistency versus strict consistency tradeoffsConsensus, leader election, and split-brain preventionTesting failover and recovery scenariosProperty-based testing and AntithesisRollback planning for database schema migrationsReducing blast radius through system isolationHealth checks and blue-green deployment strategiesFly Proxy request routing and replay behaviorCross-region synchronization and replication challengesSingle points of failure inside “redundant” systemsBackup restoration testing and disaster recovery planningNetwork partitions and failure handling in productionInfrastructure monitoring and operational visibilityAI infrastructure workloads and operational strainSandboxing and securing AI agentsSprites and AI workflows at Fly.ioLatency improvements from geographic distributionDistributed systems tradeoffs in real-world environmentsTransitive dependency failures across cloud providersPractical resilience strategies for modern engineering teamsLinks Mentioned:
https://github.com/superfly/corrosion
https://docs.gitops.weaveworks.org/
FluxCD https://fluxcd.io/
Fly.io Stateful Sandbox Environments https://sprites.dev/
Cloudflare Workers AI Inference Platform https://www.cloudflare.com/products/workers-ai/
“An AI Agent Just Destroyed Our Production Data. It Confessed in Writing” Twitter post from PocketOS founder: https://x.com/lifeof_jer/status/2048103471019434248
Oct 2025 AWS Outage https://www.theguardian.com/technology/2025/oct/24/amazon-reveals-cause-of-aws-outage
Dec 2025 Cloudflare Outage https://www.theguardian.com/technology/2025/dec/05/another-cloudflare-outage-takes-down-websites-linkedin-zoom
July 2025 Crowdstrike Outage https://www.ibm.com/think/news/recent-crowdstrike-outage-what-you-should-know
March 2026 Stryker Cyber Attack https://www.stryker.com/us/en/about/news/2026/a-message-to-our-customers-03-2026.html
https://aws.amazon.com/
https://cloud.google.com/
https://azure.microsoft.com/en-us
https://fly.io/docs/elixir/
CRDTs!! https://smartlogic.io/podcast/elixir-wizards/s13-e03-local-first-liveview-svelte-pwa/
https://antithesis.com/docs/resources/property_based_testing/
https://hex.pm/packages/proper