Cache Wisely: How You Can Prevent Distributed System Failures
The article describes different approaches to protect systems from hidden scaling bottlenecks that can creep into the system when caching is implemented.
Join the DZone community and get the full member experience.
Join For FreeCaching is often implemented as a generic solution when we think about improving the latency and availability characteristics of dependency service calls. Latency improves as we avoid the need to make the network round trip to the dependency service, and availability improves as we don’t need to worry about temporary downtimes of the dependency service given that the cache serves the required response that we are looking for. It is important to note that caching does not help if our requests to a dependency service lead to a distinct response every time, or if a client makes vastly different request types with not much overlap between responses. There are also additional constraints to using caching if our service cannot tolerate stale data.
We won’t be delving into caching types, techniques, and applicability as those are covered broadly on the internet. Instead, we will focus on the less talked about risk with caching that gets ignored as systems evolve, and this puts the system at risk of a broad outage.
When To Use Caching
In many cases, caching is deployed to mask known scaling bottlenecks with dependency service or caching takes over the role to hide a potential scaling deficiency of dependency service over time. For instance, as our service starts making reduced calls to dependency service, they start believing that this is the norm for steady-state traffic. If our cache hit rate is 90%, meaning 9/10 calls to the dependency service are served by the cache, then the dependency service only sees 10% of the actual traffic. If client-side caching stops working due to an outage or bug, the dependency service would see a surge in traffic by 9x! In almost all cases, this surge in traffic will overload the dependency service causing an outage. If the dependency service is a data store, this will bring down multiple other services that depend on that data store.
To prevent such outages, both the client and service should consider following recommendations to protect their systems.
Recommendations
For clients, it is important to stop treating the cache as a "good to have" optimization, and instead treat it as a critical component that needs the same treatment and scrutiny as a regular service. This includes monitoring and alarming on cache hit ratio threshold as well as overall traffic that is sent to the dependency service.
Any update or changes to caching business logic also need to go through the same rigor for testing in development environments and in the pre-production stages. Deployments to servers participating in caching should ensure that the stored state is transferred to new servers that are coming up post-deployment, or the drop in cache hit rate during deployment is tolerable for the dependency service. If a large number of cache-serving servers are taken down during deployments, it can lead to a proportional drop in cache hit ratio putting pressure on dependency service.
Clients also need to implement guardrails to control the overall traffic, measured as transaction per service (TPS), to dependency service. Algorithms like token buckets can help restrict TPS from the fleet when the caching fleet goes down. This needs to be periodically tested by taking down caching instances and seeing how clients send traffic to the dependency service. Clients should also think about implementing a negative caching strategy with a smaller Time-to-live (TTL). Negative caching means that the client will store the error response from the dependency service to ensure the dependency service is not bombarded with retry requests when it is having an extended outage.
Similarly, on the service side, load-shedding mechanisms need to be implemented to protect the service from getting overloaded. Overloaded in this case means that the service is unable to respond within the client-side timeout. Note that as the service load increases, it is usually manifested with increased latency as server resources are overused, leading to slower response. We want to respond before the client-side timeout for a request and start rejecting requests if the overall latency starts breaching the client-side timeout.
There are different techniques to prevent overloading; one of the simplest techniques is to restrict the number of connections from the Application Load Balancer (ALB) to your service host. However, this could mean indiscriminate dropping of requests, and if that is not desirable, then prioritization techniques could be implemented in the application layer of service to drop less important requests. The objective of load shedding is to ensure that the service protects the goodput, i.e., requests served within the client side timeout, as overall load grows on the service. The service also needs to periodically run load tests to validate the maximum TPS handled by the service host, which allows fine-tuning of the ALB connection limit. We introduced a couple of techniques to protect the goodput of a service which should be widely applicable but there are more approaches that readers can explore depending on their service need.
Conclusion
Caching offers immediate benefits for availability and latency at a low cost. However, neglecting the areas we discussed above can expose hidden scaling bottlenecks when the cache goes down, potentially leading to system failures. Regular diligence to ensure the proper functioning of the system even when the cache is down is crucial to prevent catastrophic outages that could affect your system's reliability.
Here is an interesting read about a large-scale outage triggered by cache misses.
Opinions expressed by DZone contributors are their own.
Comments