Welcome to the 4 episode of Architecture Deep Dive with Oleksii Petrov!
In today’s podcast, our guest — Andriy Lupa, Tech Lead at Day.io, shares how their team rebuilt a high-load, event-driven system serving the Brazilian market and processing millions of punch events daily. We dive into the legacy pain points, the architectural shift to Kafka and Temporal, and how signals and workflow design changed their scaling strategy. Andriy walks us through the journey from 100% CPU and cascading failures to 5-7× scalability, including the move from Postgres to Cassandra and the real trade-offs behind it. We also discuss durable timers, safe cron jobs, migration with feature flags, versioning challenges, and when Temporal is and isn’t the right tool.
Link to Andriy’s talk: "Scaling in space and time with Temporal" 🔗 https://youtu.be/jxHVcGbwZWM
What you should subscribe to:
– More interesting content for developers: https://fwdays.com/en/events
– Fwdays Twitter: https://twitter.com/fwdays
– Oleksii Petrov's Telegram channel: https://t.me/OleksiiTheArchitect
– Oleksii Petrov's LinkedIn: https://www.linkedin.com/in/alexhelkar/
– Andriy Lupa's LinkedIn: https://www.linkedin.com/in/andriy-lupa?utm_source=share_via&utm_content=profile&utm_medium=member_android
Timestamps:
00:00 - Intro
01:43 - What is Day.io and what problems does it solve?
05:11 - Legacy architecture: where did the system start, and what pain points emerged over time?
11:06 - Why Temporal and what alternatives were considered?
16:48 - Real-world workflows: what business processes are modeled as Temporal workflows?
21:07 - Observability and debugging: how Temporal UI and metrics help in production
23:37 - System architecture overview: how events flow from microservices into workflows
25:39 - Signals vs activities: how the workflow design evolved in practice
29:37 - V1: 100% CPU and the domino effect
31:30 - V2: Signals as a performance breakthrough
34:32 - Why tuning alone wasn’t enough?
36:13 - V3: Merging activities for 5–7x scalability
37:57 - Temporal at scale: reducing pressure on the cluster
39:45 - Postgres vs Cassandra for Temporal persistence
42:50 - Lessons learned from running Cassandra in production
45:17 - Reliability under load and avoiding risky shortcuts
46:30 - Real-world Temporal use cases beyond metrics
51:14 - Using signals for async workflows and external integrations
53:21 - Migrating customers without a big-bang release
01:00:01 - When Temporal is the right tool?
01:04:51 - What the team wishes they knew before starting?
01:06:51 - Final advice and recommendations
01:11:22 - Don’t forget to subscribe and like!