May 16, 2019

Understanding Observability (and Monitoring) with Christine Yen

34 minutes

About Christine Yen

Christine delights in being a developer in a room full of ops folks. As a cofounder of Honeycomb.io, a tool for engineering teams to understand their production systems, she cares deeply about bridging the gap between devs and ops with technological and cultural improvements. Before Honeycomb, she built out an analytics product at Parse (bought by Facebook) and wrote software at a few now-defunct startups.

Links Referenced:

https://www.honeycomb.io
https://www.heavybit.com/library/podcasts/o11ycast/

Transcript

Mike Julian: This is the Real World DevOps podcast and I'm your host Mike Julian. I'm setting out to meet the most interesting people doing awesome work in the world of DevOps. From the creators of your favorite tools to the organizers of amazing conferences, to the authors of great books to fantastic public speakers. I want to introduce you to the most interesting people I can find.

Mike Julian: This episode is sponsored by the lovely folks at Influx Data. If you're listening to this podcast you're probably also interested in better monitoring tools and that's where Influx comes in. Personally I'm a huge fan of their products, and I often recommend them to my own clients. You're probably familiar with their time series database InfluxDB, but you may not be as familiar with their other tools, Telegraf for metrics collection from systems, Chronograf for visualization, and Kapacitor for real-time streaming. All of these are available as open source and as a hosted SaaS solution. You can check all of it out at influxdata.com. My thanks to Influx Data for helping to make this podcast possible.

Mike Julian: Hi folks. Welcome to another episode of Real World DevOps podcast. I'm your host Mike Julian. My guest this week is a conversation I've been wanting to have for quite some time. I'm chatting with Christine Yen CEO and co-founder of Honeycomb, and previously an engineer at Parse. Welcome to the show.

Christine Yen: Hello. Thanks for having me.

Mike Julian: I want to start this conversation off in kind of what might sound like a really foundational question. What are we talking about when we're all talking about observability? What do we mean?

Christine Yen: When I think about observability, and I talk about observability I like to frame it in my head as the ability to ask questions of our systems. And the reason we've got that word rather than just say, "Okay well monitoring is asking questions about our system," is that we really feel like observability is about being a little bit more flexible and ad-hoc about asking those questions. Monitoring sort of brings to mind defining very constrict parameters within which to watch your systems, or thresholds, or putting your systems in a jail cell and monitoring that behavior, whereas, we're like, "Okay, our systems are going to do things, but they're not necessarily bad." But let's be able to understand what's happening and why. And let's observe and look at the data that your systems are putting out as well as thinking about how, asking more free form questions might impact how you even think about your systems, and how you even think about what to do with that data.

Mike Julian: When you say asking questions what do you mean?

Christine Yen: When I say asking questions of my system, I mean being able to proactively be able to investigate and dig deeper into data, rather than sort of passively sitting back and looking at the answers I've curated in the past. In order to illustrate this, to compare observability monitoring a little more directly with monitoring, especially traditional monitoring when we're curating these dashboards, what we're essentially doing is we are looking at sets of answers from questions that we posed when we pulled those dashboards together. All right, so if a dashboard has existed for six months the graphs that I'm looking at to answer a question like, what's going on in my system, are answers to the questions that I had in mind six months ago when I tried to figure out what information I would need to figure out whether my system was healthy or not. In contrast, an observability tool should let you say, "Oh is my system healthy?" What does healthy mean today? What do I care about today? And if is see some sort of anomaly in a graph, or I see something odd, I should be able to continue investigating that threat without losing track of where I am, or again relying on answers from past questions.

Mike Julian: So does that mean that curating these dashboards to begin with is just the wrong way to go? Like is it just a bad idea?

Christine Yen: I think dashboards can be useful, but I think that over use of them has led to a lot of really bad habits in our industry.

Mike Julian: Yeah, tell me more about the bad habits there.

Christine Yen: An analogy I like to use is, when you go to the doctor and you're not feeling well. A doctor looks at you and asks you, "What doesn't feel well. Oh, it's your head. What kind of pain are you feeling in your head? Is it acute? Is it just kind of a dull ache? Oh, it's acute. Where in your head?" They're asking progressively more detailed questions based on what they learned at each step along the way. Honestly this is kind of parallels a natural human problem solving concept. In contrast, I think the bad habits that dashboard lead us to build are things like it would be the equivalent of a doctor saying, "Oh well based on the charts from the last three times you visited, you broke your ankle and you skinned your knee." Pretend you go to the doctor to skin your knee. You know, "Oh okay, you broke your ankle last time, did you break your ankle again? No. Okay, did you ... How's your knee doing?"

With dashboards, we have built up this belief that these answers to past questions that we've asked are going to continue to be relevant today. And there's no guarantee that they are. Especially for our engineering teams that are staying on top of incidents and responding, and fixing things that they've found along the way. You're going to continue to run into new problems, new kind of strange interactions with routine components. And you're going to be able to ask new questions of what your systems are doing today.

Mike Julian: It seems like with that dashboard problem we have that same issue with alerting. I've started calling this kind of reflexive alerting strategy where it's like, "Oh God. We just had this problem. Well we better add a new alert so we catch it next time it happens." It's like well how many times is that new alert going to fire? Probably never. Like you're probably never going to see that issue again. With dashboards, dashboards are the same way. What you're describing, God I've seen this 100 billion times where someone curates a dashboard, is like, "Okay, well first thing now that we have this alert is let's go look at the dashboards and see what went wrong." I'm like, "Well no, graphs look fine." So no problem, but clearly the sites down.

Christine Yen: Yeah, there's a term that we've been playing with, dashboard blindness. Where if it doesn't exist in the dashboard it clearly hasn't happened, or you know it just can't exist, because people start to feel like, "Okay, we have so many dashboards...

...more