Serverless Chats

Episode #2: Building Resilient Serverless Applications with Nitzan Shapira


Listen Later

About Nitzan Shapira

Nitzan Shapira is the co-founder and CEO at Epsagon, a distributed tracing product that provides automated monitoring and troubleshooting for modern applications. Nitzan writes for his own blog, as well as the Epsagon blog as a frequent contributor. You can find him speaking and helping out at serverless events across the globe, including Tel Aviv, where he recently organized the city’s June 4th ServerlessDays event. In addition to his contributions to the serverless community, Nitzan has more than 12 years of experience in programming, machine learning, cyber-security, and reverse engineering.

  • Email: [email protected]
  • Twitter: @nitzanshapira
  • Blog: epsagon.com/blog
  • Epsagon: epsagon.com

Transcript

Jeremy: Hi everyone. I'm Jeremy Daly and you're listening to Serverless Chats. This week, I'm chatting with Nitzan Shapira. Hey, Nitzan. Thanks for joining me.


Nitzan: Thanks for having me.


Jeremy: You are the CEO and co-founder of Epsagon, one of those hot serverless startups out of Israel. Why don’t you tell the listeners a little bit more about yourself and what Epsagon is up to.


Nitzan: Yes, definitely. As you mentioned I'm one of the founders and the CEO of Epsagon. I'm based out of Israel and San Francisco, currently kind of in between. I'm an engineer, a computer engineer with a background in cyber security and embedded systems. It's more low level background. In the recent years also, of course, [I've worked with] the cloud all the way to serverless. Epsagon is a company focused on monitoring and troubleshooting for modern applications. So the entire field of cloud applications that are built with microservices, serverless, managed services, where you don't have access to the host, very distributed — how do you understand what's going on in your production? How can you troubleshoot issues as fast as possible? Do it automatically and in a way that is suitable for this kind of modern environment. For example, using agents is something that you cannot do. 


Jeremy: I wanted to talk to you about building resilient serverless applications. I think you have the right experience for this with what you do. But now that we're building serverless applications, and we're going beyond traditional applications as well as traditional microservices - if microservices can be considered traditional - you're starting to break things down into multiple functions. You obviously are using a lot of third-party services or managed services from the cloud provider. My question here to get us started is what is the main difference between a traditional application, whether server based or or container-based in microservices, and moving to this serverless environment?


Nitzan: Sure. I think the main difference is that a lot of the things are out of your control now, which is a good thing, because this is what you want when you go serverless. But on the other hand, you lose control over some of the things that are going on in your application. So when things don't go well, it can be very difficult to know where they broke. Then if you want to build something that's resilient, that's going to work in high scale, in very high reliability and without many surprises, you really have to think about all the different scenarios that can go wrong, which is not just my code had an exception. But maybe I got a timeout; I got an out of memory condition; I got a series of events that didn't go well - synchronous events, perhaps - and it seems that everything worked but actually didn't. How do I know about these problems, even if everything seems okay? The number of problems that can happen is just growing when you go serverless.


Jeremy: I think that makes a ton of sense. Why don't we dive into this and start talking about some of these individual problems or some of these differences, and maybe we can start with troubleshooting? What's different when you're troubleshooting a serverless application versus a more traditional, server-based application?


Nitzan: There are several key differences. The first one is that when you go serverless, you go distributed in a very significant way, more than with containers, for example, because those functions are kind of nanoservices. When you combine them together, we are seeing organizations with over 5,000 functions or more, which is just a very high number of nodes in the graph, if you look at it this way. It's very, very distributed. When something breaks, usually there are many more components involved in the chain of events, so it's going to be much more complicated to track what happened to find the cause of the problem. So distributed would be very important thing. 

The other thing is that the new things that can go wrong. All those time outs, all those out of memory conditions, they happen all the time and [it's] very, very difficult to predict them. It's not something people are used to when they work with traditional services. And finally, the possibilities that you have as an engineer or DevOps to understand what's going on in your application is again more limited because you have no access to the host, so you can't install agents and so on. All you get is basically the basic logs and metrics that the cloud providers give you, which makes it even more difficult to know what's going on in the application layer and not just the simple metrics, because they are usually not going to be enough to troubleshoot a complicated problem.


Jeremy: Yeah, and I think with something like Lambda, or any function as a service, these are ephemeral compute. You have mini execution environments, or containers spinning up in the background, but those go away. You can't go back and look at the logs and see what that server did. And really the only logs available to you that are dumped to CloudWatch, for example, those are only there if your application actually sends logs. It's not logged automatically.


Nitzan: That's exactly the challenge, because once bad things happen, usually you didn't think about them before, and then you don't have the information that you're looking for in the log. Then you also don't have anywhere to connect to, to investigate, because, as you mentioned, it's ephemeral. That makes things very difficult because you can't think about everything that can possibly happen and put it in the log. On the other hand, you really have nowhere to go to after the thing happens. So you don't really have anything to do, just by using the logs. This is basically the conclusion.


Jeremy: Also, if you're using a number of remote services or managed services from the provider, where does the debugger go there? How do you see the flow of information? You have a lot of events. You have these highly event-driven applications with information flying all over the place. How do you keep track of that? Where do you see those logs?


Nitzan: Generally you don't see it, and that's the big challenge. This is, of course, why we are building a tool to help to help you. Generally speaking, the events that are going through the system are usually much more meaningful than the logs, from what we saw. If you actu...

...more
View all episodesView all episodes
Download on the App Store

Serverless ChatsBy Jeremy Daly & Rebecca Marshburn

  • 5
  • 5
  • 5
  • 5
  • 5

5

29 ratings