Serverless Chats

Episode #51: Globally Resilient Architectures with Adrian Hornsby


Listen Later

About Adrian Hornsby:

Adrian Hornsby is a Technical Evangelist working with AWS and passionate about everything cloud. Adrian has more than 15 years of experience in the IT industry, having worked as a software and system engineer, backend, web and mobile developer and part of DevOps teams where his focus has been on cloud infrastructure and site reliability, writing application software, deploying servers and managing large scale architectures. Today, Adrian tends to get super excited by AI and IoT, and especially in the convergence of both technologies.

  • Twitter: twitter.com/adhorn
  • Medium: medium.com/@adhorn
  • Dev.to: dev.to/adhorn

Watch this episode on YouTube: https://youtu.be/6o2owe2VHMo


Transcript:

Jeremy: Hi, everyone. I'm Jeremy Daly and this is Serverless Chats. Today I'm speaking with Adrian Hornsby. Hey, Adrian, thanks for joining me.

Adrian: Hey, Jeremy, how are you?

Jeremy: So you are a principal developer advocate for architecture at AWS. So why don't you tell the listeners a little bit about your background and what it is you do at AWS?

Adrian: Okay, cool. So first of all, thanks for having me on your show. I'm a huge fan of your show. As for my background, it's a mix of industry and research. Actually, I started my career at the university doing some research, and then moved to Nokia research and eventually some startups, always around distributed systems and real time networks and things like this. And then let's say the particular things is much of the work that I've done was always on AWS since the very beginning. So it kind of felt very natural eventually to join AWS, which was about four years and few months ago. And I joined as a solutions architect, and then quickly moved into an evangelist role. And mostly doing architectures and resiliency and a lot of breaking things kind of chaos engineering type of things.

Jeremy: Awesome. Well, speaking of resilient architectures, that's what I wanted to speak with you about today, because you have on your Medium blog, which is awesome, by the way. I mean, I go there-

Adrian: Thank you.

Jeremy: Every time I go, and I read something there, you think you know it all, and then you read something by Adrian, and you learn something new, which is absolutely amazing. But so I want to talk to you about this, because this is something I think that ties into serverless pretty well, is this idea that I think we take for granted, especially as serverless developers, we take for granted that there is a bunch of things happening for us behind the scenes.

And so we get a lot of this, infrastructure management out of the box, we get, some failover out of the box, we get some of these things. But that really only scratches the surface. And there's so much further we can go to build truly resilient applications. And you have an excellent series on your blog called the resilient architecture collection. And I'd love to go through these because I think that this is the kind of thing where if you start thinking about global distribution, you start thinking about latency. You and I have been having a lot of latency issues trying to record this episode, because you're all the way in Helsinki and I'm over in the United States. These are things to start thinking about.

So I want to jump in first with this idea of embracing failure at scale. And I love this idea because when we build small systems, we think about reliability, right? We try to get as many nines as we possibly can. But when you get to the level of global distribution, distributed systems that are sending messages between components, that are sending messages across the Atlantic Ocean or the Pacific Ocean, this data is going all over the place, this idea of failure, or at least partial failure has become the new normal.

Adrian: Yeah. So yeah, I think it's things have changed a lot in the last few years. I mean, before you were on the monolith application, and you were trying to make sure your monolithic application was always up and running, right? I think there was even some competition into uptimes it was very popular back then to look at uptimes of servers and say, "My server's been up for 16 years, wow, awesome." But now, we've moved away slowly from monoliths to micro service architecture, and especially I think as we move even to the cloud, and we use more third party services, systems become naturally more distributed, and they go over the internet, which is everything but a reliable source of communication.

So, you have network latency, you have network failures. So there's a lot more things that can go wrong. And I think understanding and accepting that anything, at any time can fail is actually a very important thing. Because it means that you accept failure as a first class citizen for your application. And then you need to write code and design applications so that at any moment in time, there can be failures, and that's called partial failure mode, as you said. And it's very different concept than what it used to be back in the day and that means that you need to design your application with different characteristics and different behavior.

Jeremy: Right. And so if you're designing your system with these different characteristics, and you're, you're forward thinking to this idea of resiliency, and again, you have a whole bunch of stuff that you do on chaos engineering as well, which is this idea of injecting failure into the system to see what happens when something breaks. But that is quite an investment, not only an investment in learning, right? You have to learn all these different parts of the cloud, and all these other failover systems and what's available from that standpoint, but also an investment in terms of building your application out that way.

So you mentioned in the article, this idea of the investment of building in this resiliency versus what that lost revenue might be if something fails. So if your billing service goes down, or your payment service goes down, and you can't charge credit cards anymore, if that's just the end of it, right? Like you just say, "Hey, we can't charge billing or we can't charge your credit card, so our site's down." Versus building something that says, "Well, we can't charge your credit card right now, but we can take your credit card number and we can calculate the order total and those sorts of things." So what is that trade off that companies should be looking for, in terms of, as you put it, lost revenue versus the investment in building these resilient architectures?

Adrian: Yeah, it's a very good question. I think first and foremost, it's always start from the business side. It's like understanding what are the requirements in terms of availability because as many nines of availability you want, a matter of fact, the more work you're going to have to put, and the more resources you're going to have to use and that resource is money, right?

And especially I think the work around availability and reliability is not really linear. At the beginning, it's you have a lot of gain with small work, but as more nines you want is actually I would say the investment, versus the investment you have to do to gain more nines becomes a lot bigger as you have more nines, right?

So it's increasingly hard to reach more nines. So, you have to really think what is it that you want to achieve as a busines...

...more
View all episodesView all episodes
Download on the App Store

Serverless ChatsBy Jeremy Daly & Rebecca Marshburn

  • 5
  • 5
  • 5
  • 5
  • 5

5

29 ratings