The Swyx Mixtape

[Weekend Drop] Temporal — the iPhone of System Design


Listen Later

This is the audio version of the essay I published on Monday.

I'm excited to finally share why I've joined Temporal.io as Head of Developer Experience. It's taken me months to precisely pin down why I have been obsessed with Workflows in general and Temporal in particular.


It boils down to 3 core opinions: Orchestration, Event Sourcing, and Workflows-as-Code.


Target audience: product-focused developers who have some understanding of system design, but limited distributed systems experience and no familiarity with workflow engines


30 Second Pitch


The most valuable, mission-critical workloads in any software company are long-running and tie together multiple services.

  • Because this work relies on unreliable networks and systems:
    • You want to standardize timeouts and retries.
    • You want offer "reliability on rails" to every team.
  • Because this work is so important:
    • You must never drop any work.
    • You must log all progress.
  • Because this work is complex:
    • You want to easily model dynamic asynchronous logic...
    • ...and reuse, test, version and migrate it.


Finally, you want all this to scale
. The same programming model going from small usecases to millions of users without re-platforming. Temporal is the best way to do all this — by writing idiomatic code known as "workflows".


Requirement 1: Orchestration


Suppose you are executing some business logic that calls System A, then System B, and then System C. Easy enough right?


But:

  • System B has rate limiting, so sometimes it fails right away and you're just expected to try again some time later.
  • System C goes down a lot — and when it does, it doesn't actively report a failure. Your program is perfectly happy to wait an infinite amount of time and never retry C.


You could deal with B by just looping until you get a successful response, but that ties up compute resources. Probably the better way is to persist the incomplete task in a database and set a cron job to periodically retry the call.


Dealing with C is similar, but with a twist. You still need B's code to retry the API call, but you also need another (shorter lived, independent) scheduler to place a reasonable timeout on C's execution time since it doesn't report failures when it goes down.


Do this often enough and you soon realize that writing timeouts and retries are really standard production-grade requirements when crossing any system boundary, whether you are calling an external API or just a different service owned by your own team.


Instead of writing custom code for timeout and retries for every single service every time, is there a better way? Sure, we could centralize it!


We have just rediscovered the need for orchestration over choreography. There are various names for the combined A-B-C system orchestration we are doing — depending who you ask, this is either called a Job Runner, Pipeline, or Workflow.


Honestly, what interests me (more than the deduplication of code) is the deduplication of infrastructure. The maintainer of each system no longer has to provision the additional infrastructure needed for this stateful, potentially long-running work. This drastically simplifies maintenance — you can shrink your systems down to as small as a single serverless function — and makes it easier to spin up new ones, with the retry and timeout standards you now expect from every production-grade service. Workflow orchestrators are "reliability on rails".


But there's a risk of course — you've just added a centralized dependency to every part of your distributed system. What if it ALSO goes down?


Requirement 2: Event Sourcing


The work that your code does is mission critical. What does that really mean?

  • We cannot drop anything. All requests to start work must either result in error or success - no "it was supposed to be running but got lost somewhere" mismatch in expectations.
  • During execution, we must be able to resume from any downtime. If any part of the system goes down, we must be able to pick up where we left off.
  • We need the entire history of what happened when, for legal compliance, in case something went wrong, or if we want to analyze metadata across runs.


There are two ways to track all this state. The usual way starts with a simple task queue, and then adds logging:

(async function workLoop() {
const nextTask = taskQueue.pop()
await logEvent('starting task:', nextTask.ID)
try {
await doWork(nextTask) // this could fail!
catch (err) {
await logEvent('reverting task:', nextTask.ID, err)
taskQueue.push(nextTask)
}
await logEvent('completed task:', nextTask.ID)
setTimeout(workLoop, 0)
})()


But logs-as-afterthought has a bunch of problems.

  • The logging is not tightly paired with the queue updates. If it is possible for one to succeed but the other to fail, you either have unreliable logs or dropped work — unacceptable for mission critical work. This could also happen if the central work loop itself goes down while tasks are executing.
  • At the local level, you can fix this with batch transactions. Between systems, you can create two-phase commits. But this is a messy business and further bloats your business code with a ton of boilerplate — IF (a big if) you have the discipline to instrument every single state change in your code.


The alternative to logs-as-afterthought is logs-as-tr...

...more
View all episodesView all episodes
Download on the App Store

The Swyx MixtapeBy Swyx

  • 5
  • 5
  • 5
  • 5
  • 5

5

1 ratings


More shows like The Swyx Mixtape

View all
Beef And Dairy Network by MaximumFun.org

Beef And Dairy Network

1,492 Listeners

The Daily by The New York Times

The Daily

111,115 Listeners

Dateline NBC by NBC News

Dateline NBC

48,072 Listeners

The News Agents by Global

The News Agents

990 Listeners

Murder in the Moonlight by NBC News

Murder in the Moonlight

637 Listeners

Big Time by Apple TV+ / Campside Media

Big Time

266 Listeners