This episode of the TechOps series goes into high availability troubleshooting. Not just high availability, not just troubleshooting, but actually talking through what it takes to manage and maintain and fix HA systems. This is part of a longer discussion we've been having and so there's some really interesting ideas in the middle of these discussions that I hope will shape your thinking as you build high availability systems, diagnostics and troubleshooting for people who are in high availability very complex environments.
Transcript: https://otter.ai/u/wM__4w1YIzZnhVdgLuXLsDDu0Ng?utm_source=copy_url
References:
https://status.openai.com/incidents/ctrsv3lwd797\