Gray Failure – A Silent Killer of Cloud-Scale Systems

Cloud computing has revolutionized the way businesses operate in the modern world. Companies have migrated their applications to the cloud to benefit from its scalability, cost-efficiency, and robustness. However, the cloud is not immune to failures. One of the most significant challenges that cloud practitioners face today is gray failure. It is a silent killer of cloud-scale systems that can cause major availability breakdowns and performance anomalies. In this newsletter, we discuss the concept of gray failure, it’s definition, and it’s examples from Azure cloud services.

What is Gray Failure?

Gray failure is a type of component failure that is difficult to detect because of its subtle manifestations. It is characterized by severe performance degradation, random packet loss, flaky I/O, memory thrashing, capacity pressure, and non-fatal exceptions.

Unlike other types of failures, gray failures are not fatal, but they are enough to degrade the quality of service and cause system instability. Our first-hand experience with production cloud systems reveals that gray failure is behind most cloud incidents.

A Definition of Gray Failure

Modeling and defining gray failure is crucial to addressing the problem. A system experiences gray failure when at least one app makes the observation that the system is unhealthy, but the observer observes that the system is healthy. In other words, gray failure is a form of differential observability. The figure below illustrates the concept of gray failure from the perspective of a system and its client apps.

The Cycle of Gray

Gray failure is not a one-time occurrence; it is a cycle that many systems tend to follow. The cycle starts with minor faults that the system tends to suppress, followed by a degraded mode that is externally visible but which the observer does not see.

Eventually, the degradation may reach a point that takes the system down, at which point the observer also realizes the problem. A typical example is a memory leak. During a gray failure with intermittent misbehavior, we can go round the cycle many times.

Let's look at some examples of gray failure from Azure cloud services:

More redundancy == less availability !?

Data center networks can route around failing switches, but they typically do not re-route packets when switches experience intermittent failures such as random and silent packet drops. The gray failure results in application glitches or increased latency. As a consequence, increasing redundancy sometimes lowers availability.

Works for me!

The failure detector may not spot an unhealthy virtual machine (VM) internally experiencing severe network issues because it receives heartbeats via a host agent that communicates with the VM using local RPC. Thus the problem is never spotted until a user reports an issue, leading to long gaps between the point of user impact and the point of resolution.

When the cure is worse than the disease

A bug caused the storage manager not to detect a severe capacity constraint on an Azure storage data server. As the storage manager sent more requests to the overloaded data server, it eventually crashed and rebooted, which did not fix the underlying problem.

The failure detector picked up on the continual rebooting, concluded the data server was irreparable, and took it out of service. This put pressure on the remaining healthy servers, causing more servers to degrade and experience the same ultimate fate, leading to a catastrophic cascading failure.

Another incident,

In 2018, a software update for a popular fitness tracking app caused it to drain the batteries of users' phones at an alarming rate. In an attempt to fix the issue, the company released another update that reduced the battery drain but introduced a new bug that caused the app to crash frequently. This led to many users uninstalling the app and switching to competitors, resulting in a significant loss of market share for the company. In this case, the attempted cure (the second update) was worse than the original issue (battery drain), leading to even more problems.

Finger-pointing at dawn

Sometimes, a VM cannot access its virtual disk because of a storage or network issue, leading to a crash. If no failure detector detects the underlying problem with the storage or network, the compute-cluster failure detector may incorrectly attribute the failure to the compute stack in the VM. For this reason, such gray areas can lead to finger-pointing between teams responsible for the compute stack and those responsible for storage or network infrastructure.

To avoid such situations, it's important to have a comprehensive monitoring system in place that can detect issues at all layers of the infrastructure. This includes monitoring the compute stack, storage infrastructure, and network connectivity.

Additionally, having clear communication channels between different teams and departments can help prevent finger-pointing and ensure that issues are resolved quickly and efficiently. It's important to establish a culture of collaboration and shared responsibility, where teams work together to identify and solve problems rather than blaming each other.

Regular testing and maintenance of the infrastructure can also help identify and resolve issues before they lead to failures. This can include regular backups, system updates, and load testing.

Overall, to avoid finger-pointing at dawn, it's crucial to have a robust infrastructure, effective monitoring tools, and a culture of collaboration and shared responsibility among different teams.

Readings