July 25, 2021

[Weekend Drop] Temporal — the iPhone of System Design

23 minutes

This is the audio version of the essay I published on Monday.

I'm excited to finally share why I've joined Temporal.io as Head of Developer Experience. It's taken me months to precisely pin down why I have been obsessed with Workflows in general and Temporal in particular.

It boils down to 3 core opinions: Orchestration, Event Sourcing, and Workflows-as-Code.

Target audience: product-focused developers who have some understanding of system design, but limited distributed systems experience and no familiarity with workflow engines

30 Second Pitch

The most valuable, mission-critical workloads in any software company are long-running and tie together multiple services.

Because this work relies on unreliable networks and systems:
- You want to standardize timeouts and retries.
- You want offer "reliability on rails" to every team.
Because this work is so important:
- You must never drop any work.
- You must log all progress.
Because this work is complex:
- You want to easily model dynamic asynchronous logic...
- ...and reuse, test, version and migrate it.

Finally, you want all this to scale. The same programming model going from small usecases to millions of users without re-platforming. Temporal is the best way to do all this — by writing idiomatic code known as "workflows".

Requirement 1: Orchestration

Suppose you are executing some business logic that calls System A, then System B, and then System C. Easy enough right?

But:

System B has rate limiting, so sometimes it fails right away and you're just expected to try again some time later.
System C goes down a lot — and when it does, it doesn't actively report a failure. Your program is perfectly happy to wait an infinite amount of time and never retry C.

You could deal with B by just looping until you get a successful response, but that ties up compute resources. Probably the better way is to persist the incomplete task in a database and set a cron job to periodically retry the call.

Dealing with C is similar, but with a twist. You still need B's code to retry the API call, but you also need another (shorter lived, independent) scheduler to place a reasonable timeout on C's execution time since it doesn't report failures when it goes down.

Do this often enough and you soon realize that writing timeouts and retries are really standard production-grade requirements when crossing any system boundary, whether you are calling an external API or just a different service owned by your own team.

Instead of writing custom code for timeout and retries for every single service every time, is there a better way? Sure, we could centralize it!

We have just rediscovered the need for orchestration over choreography. There are various names for the combined A-B-C system orchestration we are doing — depending who you ask, this is either called a Job Runner, Pipeline, or Workflow.

Honestly, what interests me (more than the deduplication of code) is the deduplication of infrastructure. The maintainer of each system no longer has to provision the additional infrastructure needed for this stateful, potentially long-running work. This drastically simplifies maintenance — you can shrink your systems down to as small as a single serverless function — and makes it easier to spin up new ones, with the retry and timeout standards you now expect from every production-grade service. Workflow orchestrators are "reliability on rails".

But there's a risk of course — you've just added a centralized dependency to every part of your distributed system. What if it ALSO goes down?

Requirement 2: Event Sourcing

The work that your code does is mission critical. What does that really mean?

We cannot drop anything. All requests to start work must either result in error or success - no "it was supposed to be running but got lost somewhere" mismatch in expectations.
During execution, we must be able to resume from any downtime. If any part of the system goes down, we must be able to pick up where we left off.
We need the entire history of what happened when, for legal compliance, in case something went wrong, or if we want to analyze metadata across runs.

There are two ways to track all this state. The usual way starts with a simple task queue, and then adds logging:

(async function workLoop() {

const nextTask = taskQueue.pop()

await logEvent('starting task:', nextTask.ID)

try {

await doWork(nextTask) // this could fail!

catch (err) {

await logEvent('reverting task:', nextTask.ID, err)

taskQueue.push(nextTask)

}

await logEvent('completed task:', nextTask.ID)

setTimeout(workLoop, 0)

})()

But logs-as-afterthought has a bunch of problems.

The logging is not tightly paired with the queue updates. If it is possible for one to succeed but the other to fail, you either have unreliable logs or dropped work — unacceptable for mission critical work. This could also happen if the central work loop itself goes down while tasks are executing.
At the local level, you can fix this with batch transactions. Between systems, you can create two-phase commits. But this is a messy business and further bloats your business code with a ton of boilerplate — IF (a big if) you have the discipline to instrument every single state change in your code.

The alternative to logs-as-afterthought is logs-as-tr...

...more

View all episodes

By Swyx

11 ratings