This podcast discusses error 502 in distributed architectures focusing on health checks.
When a client application needs to access a session service to generate user identification records through an API gateway, the API gateway and session service may have implementation limitations that negatively impact the user.
If there is no rate limit configuration in the API Gateway, clients can make recursive calls to the session service at high frequency.
When the session service is overloaded, it returns a 502 error (Bad Gateway) to the API Gateway, indicating it cannot process the request. This response is then passed to the client.
The session service may be overloaded if a large volume of load is generated, for example, 25,000 users pushing data every 10 minutes.
To prevent this problem, several practices can be adopted: Configure rate limit in the API Gateway, Control TPS in the session service, Perform health checks in the session service, Scale the session service based on load, Implement error handling and retries in the client (and/or in the Gateway), and Provide appropriate messages to the client during retries.
For example, implementing health checks in the session service can help identify failures or bottlenecks, allowing for a quick response, such as redirecting requests to a working replica or triggering a corrective action.
During the retry process, providing clear messages to the client about the ongoing attempt and the possibility of success in subsequent attempts can help avoid frustration and improve the user experience.
Ensuring system scalability and stability is essential to provide a consistent and reliable experience for end users.
The article also mentions that it is important to consider the limitations of the session service's dependencies, such as a database. If the database reaches its connection limit, it will cause problems for the session service.