November 11, 2021

Sidekiq Race Conditions [Chris Toomey]

11 minutes

Listen to The Bikeshed (24mins in) https://www.bikeshed.fm/313

Transcript

So we had a bug that occurred in the application where something was supposed to have happened. And then there was an email that needed to go out to tell the user that this thing had happened. And the bug popped up within AppSignal and said something was nil that shouldn't have been nil.

Particularly, we're using a gem called Time For a Boolean, which is by Caleb Hearth. And he's a former thoughtboter and maintains this wonderful gem that instead of having a Boolean for like, is this thing approved, or is it paid? Or is it processed? You use a timestamp. And then this gem gives you nice Boolean-like methods on top of that timestamp. Because it turns out, very often just having the Boolean of like, this was paid, it turns out you really want to know when it was paid. That would be a really useful piece of information. And so, while you're still in Postgres land, it's nice to be able to reach for this and have the affordances of the Boolean-like interface but also have the timestamp where available.

So anyway, the email was trying to process but that timestamp...let's pretend that it was paid as the one that matters here so paid at was nil, which was very concerning. Because this was the email that's like, hey, that thing was processed. Or let's say it was processed, actually, because that's closer to what it was. Hey, this thing was processed, and here's an email notification to tell you that. But the process timestamp was nil. I was like, oh no. Oh no. And so when I saw this pop up, I was like, this is very bad. Everything is very bad. Oh goodness.

Turns out what had happened was...because I very quickly chased after this, looked in the background job queue, looked in Sidekiq's UI, and the job was gone. So it had been processed. I was like, wait a minute, how? How did this fix itself? Like, that's not the kind of bug that resolves itself, except, in this case, it was. This was an interaction that I'd run into many times before. Sidekiq was immediately processing the job. But the job was being enqueued from within the context of a database transaction. And the database transaction had not been committed yet. But Sidekiq was already off to the races trying to process.

So the record that was being worked on, the database record, had local changes within the context of that transaction, but that hadn't been committed. Sidekiq then reads that record from the database, but it's now out of sync because that tiny bit of Sidekiq is apparently very fast off to the races immediately. And so there's just this tiny little bit of time that can occur. And this is also a fun one where this isn't going to happen every time. It's only going to happen sometimes. Like, if the queue had a couple of other things in it, Sidekiq probably would have not gotten to this until the database transaction had fully closed.

So the failure mode here is super annoying. But the solution is pretty easy. You just have to make sure that you enqueue outside of the database transaction. But I'm going to be honest, that's difficult to always do right.

STEPH: That's a gnarly bug or something to investigate that I don't think I have run into before. Could you talk a little bit more about enqueueing the job outside the database transaction?

CHRIS: Sure. And I think I've talked about this on a previous episode a while back because I have run into this one a few times. But I think it is sufficiently rare; like, you need almost a perfect storm because the database transaction is going to close very quickly. Sidekiq needs to be all that much more speedy in picking up the job in order for this to happen.

But basically, the idea is within some processing logic that we have in our system; we find a record, we do some work. And then we need to update that record to assign this timestamp or whatever it is. And then we also want to inform the user, so we're going to enqueue a job to send the email notification. But for all of the database work, we are wrapping it in a transaction because we want it to either succeed or fail atomically. So there are three different records that we need to update. We want all of them to be updated or none of them to be updated. So, therefore, we wrap it in a transaction.

And the way we had written, this was to also enqueue the job from within the transaction. That wasn't something we were actively intentionally doing because those are different systems. It doesn't really mean anything. But we were still within the block of ApplicationRecord.transaction do. We're now inside of that block. We're doing all of the record updates. And then the last piece of work that we want to think about is enqueueing the job to send the email.

The problem is if we're still within that database transaction if it's yet to be committed, then when Sidekiq picks up that job to run it, it will see the prior state of the world. And it's only if the Sidekiq job waits a little bit that then the database transaction will have been committed. The record is now updated and available to be read by Sidekiq in the correct updated state.

And so there's this tiny little bit of inconsistency that can happen. It's basically because Sidekiq is going out to Redis, which is a distinct system. It doesn't have any knowledge of the database transaction at play. That's why I sometimes consider using a Postgres-backed background job system because then actually the job can be as part of the database transaction.

STEPH: Cool. That's helpful. That makes a lot of sense the way you explained the whole you're actually enqueueing the job from inside that transaction. I'm curious, that prompts another question. In the case where you mentioned you're using a transaction because you want to make sure that if something fails to update so, everything gets updated together, in the event that something does fail to update because you were previously enqueueing that job from the transaction, does that mean that the update could have failed but that email would still have gone out?

CHRIS: That does not. And the reason for that is because we're within dry-monad world. And so dry-monad will implicitly capture the ActiveRecord rollback, which I think is an exception that gets raised or somehow...But basically, if that database transaction fails for any reason and ends up getting rolled back, then dry-monads will not continue processing through the rest of the sequential operation. And so, therefore, even if we move the enqueuing of the email outside of the database transaction, the sequential nature of that processing and the dry-monad stuff that we have in play will handle that. And I think that would more generally be true because I think Rails raises an exception on rollback. Not certain there. But I know in our case, we're fine on that. And we have actually explicitly checked7 for that sort of thing.

STEPH: So I meant a slightly different question because that makes sense to me everything that you just said where if it's outside of the transaction, then that sequential order won't fire because of that ActiveRecord migration error. But when you have the enqueuing inside of the transaction because then that's going to be inside of the sequential order, maybe before the rollback error gets raised. Does that make sense?

CHRIS: Yes. I think what you're asking is basically like, do we make sure to not send the job if the rest of the stuff didn't succeed?

STEPH: I'm just wondering from a transaction perspective, actually. If you have a...

...more

View all episodes

By Swyx

11 ratings