Code[ish]

I Was There: Stories of Production Incidents II


Listen Later

Corey Martin leads the discussion with two developers about production incidents they were personally involved in. Their goal is to inform listeners on how they discovered these issues, how they resolved them, and what they learned along the way.

Ifat Ribon is a Senior Developer at LaunchPad Lab, a web and mobile application development agency headquartered in Chicago. For one of their clients, they developed an application to assist with the scheduling of janitorial services. It was built with a fairly simple Ruby on Rails backend, leveraging Sidekiq to process background jobs. As part of its feature set, the app would send text messages to let employees know their schedule for the week; these schedules were assembled by querying the database several times, fetching frequencies and availabilities of workers. Unfortunately, a client noticed a discrepancy between how many notices were being sent out, versus how many jobs they knew they had: of the 400 jobs total, only 150 had notifications. It turned out that all of the available database connections were being exhausted--but that was only half of the issue. Sidekiq was attempting to process far too many jobs at once, and each of these jobs were responsible for connecting to the database, exhausting the available pool. The solution Ifat settled on was to reduce the number of parallel jobs processed while increasing the number of connections to the database. From this experience, she also learned the importance of understanding how all these different systems interconnect.

Christopher Ostrowski is Chief Technology Officer at Dutchie an e-commerce platform for the cannabis industry. One Christmas Eve, while celebrating with his family, Chris began receiving pager notifications warning him about some sluggish API response times. Since it didn't really have any significant end user impact, he ignored it and went back to the festivities. As the night went on, the warnings became significant alerts, and he pulled together a response team with colleagues to figure out what was going on. By all accounts, the website was functioning, but curiously, the rate of orders began to drop off. Through some investigation, they realized what was going on. Customers' order numbers were assigned a random, non-sequential six digit numbers. Dutchie was about to track its one-millionth order, a huge milestone. Before any orders are created, though, the app generates a six digit number, and tries to create one that doesn't already exist. The database was constantly being hit, as less and less six digit numbers were available for use. The solution ended up being rather simple: the order number limit was increased to nine digits. Although they had monitoring in place, the data was set up as an aggregate reporting; even though the "create order" API was slow, all of the others were low, keeping the average within tolerable levels. Christopher's solution to avoid this in the future was to set up more groupings for "essential" API endpoints, to alert the team sooner for latency issues on core business functionality.

Links from this episode
  • LaunchPad Lab is a web and mobile application development agency
  • Dutchie is an e-commerce platform for the cannabis industry
...more
View all episodesView all episodes
Download on the App Store

Code[ish]By Heroku from Salesforce

  • 4.7
  • 4.7
  • 4.7
  • 4.7
  • 4.7

4.7

18 ratings


More shows like Code[ish]

View all
Motley Fool Money by The Motley Fool

Motley Fool Money

3,212 Listeners

The Changelog: Software Development, Open Source by Changelog Media

The Changelog: Software Development, Open Source

288 Listeners

Up First from NPR by NPR

Up First from NPR

56,692 Listeners

CoRecursive: Coding Stories by Adam Gordon Bell - Software Developer

CoRecursive: Coding Stories

189 Listeners

Elis James and John Robins by BBC Radio 5 Live

Elis James and John Robins

332 Listeners

Tech Lead Journal by Henry Suryawirawan

Tech Lead Journal

13 Listeners