Sunday, November 8, 2015

Calculating Durability and Availability of Distributed Systems

"All models are wrong but some are useful" -- George Box

In this post I am going to derive expressions for availability and durability of distributed storage systems. In particular, I will go over the statistical model of Mean Time to Failure (MTTF) of such systems. This model captures the time it will take to lose a data stored in such systems (data durability). While durability deals with permanent loss of a stored data, availability deals with temporary loss of ability to read / write data. As we will see, we can also apply similar model for availability.

Introduction

A trend to move towards distributed replicated storage made from commodity hardware from storage systems running on expensive, powerful, server grade machines started in early 2000s. The argument in favor of this transition is a statistical one. Consider a server grade system which provides an availability of 99.99% (4 9s). Then 0.01% of all the requests fail on such a system. A business like Google Ads, which sees 230 million clicks per day with a average CPC of 3$ stands to lose $7M every day. Therefore, internet businesses want to run on services which provide a far higher availability (may be 5 or 6 9s). The same can be argued about durability.

Definitions

It is easy to find the availability of a single box because hardware components come with a rated life. However, when many such hardware components come together, statistical analysis is required. Before we further into the analysis let us give a few definitions:
object: a blob of data stored in the storage system
replica: a copy of object
replication group: A group of replicas storing same object
n: number of replicas
m: number of nodes (computers) in the storage system
N: total number of objects in the storage system

We assume that a distributed storage system saves multiple copies of data, each on a different node for both high durability and availability. These copies of data are called replicas and together form a replication group. We also assume that this storage system uses a consensus protocol like Paxos to make sure the replicas agree on the stored data.

Problems

Availability 

Given that an object is replicated on n replicas stored randomly on n nodes and uses a Paxos based consensus protocol, what is the probability that a read / write to the storage system will fail? In addition, what is the mean time to failure of a read or write of an object?

Durability

Given that an object is replicated on n replicas stored randomly on n nodes, what is the probability that object will be lost due to failure of individual nodes? In addition, what is the mean time to data los.

Solution

We model such a storage system as a continuous Markov chain process. In this section, we will go over the Markov Chains models - discrete and continuous, explore why continuous Markov chain represent this system and discuss the resulting results.

Markov Chain

Markov chain is a random process, which moves from one state to another.  Transition of a random variable in markov chain from one state to another only depends on the current state. In other words, the transitions are memoryless. Memoryless property implies that probability from state R=n to R=m is the same as R=n+1 to R=m+1[3]. For example consider the following game called Gambler's ruin in the following figure.


The game is played as follows. There is a gambler with r dollars.  He wins a round in the game with probability p. Each win increases his wealth by 1 dollar and each loss decreases it by 1 dollar. Gambler's ruin problem seeks to find the number of rounds, it takes for the Gambler to lose all his wealth.
This is modeled using Markov Chains. Each dollar value of gambler's wealth is represented as a state as shown in the figure above. Each round of the game is modeled as a transition from one state to another. Now consider the probability PR of ruin (wealth reaching 0) at a state R.



This expression tells us that probability of ruin with wealth = R is the same as sum of two expressions. The first being the probability of moving to state R-1 from R(by losing this round) and getting to ruin from R-1. The second being probability of moving to state R+1 from R (by winning this round) and getting to ruin from R+1.

This was an example of discrete Markov Chain. This is called discrete Markov Chain because each round is distinctly defined. Another flavor of Markov Chain models is continuous Markov Chain models.

Continuous markov chain models differ from discrete because there is no sequence of moves. It is a representation of system where state transitions happen independent of each other and with different rates. Consider an example of such system called the birth-death process.


In this process, the population undergoes birth and death of its constituents at independent rates. Birth rate is shown as μ and death as λ. There is no sequence in moving from state R (with population R) to R-1 or R+1. Such a process is modeled as continuous Markov Chain process. Also, if the population goes to 0, there is no way to come back to state 1. Such a process is called absorbing markov process and state 0 is called absorbing state.

Distributed system as a continuous Markov Chain model

A distributed storage system can be modeled as a birth death process. Consider a replication group consisting of n replicas of an object. Also assume that all of these replicas are stored on different machines. Machines in such a system are failing at a constant rate of λ. Failed machines come up (replaced) with a rate of μ. Notice that like birth-death process, state 0 is an absorbing state.

It differs from a birth-death process because of the fact that the population cannot grow beyond n machines. Let us discuss this boundary condition. However, that is not a problem because we can define the problem in such a way that this boundary condition ceases to be an issue. [1] defines an epoch. At the start of the epoch, the system is reaches state R - 1 i.e. one node has just failed. At the end of the epoch, the system either reaches state 0 which is an absorbing state or reaches (and stays in) state R, when a new epoch starts. In other words, if one node fails every S years, then on every such failure either the system is undergoing an epoch or starts a new epoch. Therefore, under this definition of epoch, the previous epoch continues until a new one is started. In the following sections we consider a distributed system which maintains 3 copies of an item for simplicity, although results are general enough to be applied to any number of replicas. 

Assumptions

1. Node failures are independent of each other
2. Failure and recovery rate of nodes are exponentially distributed. This is a direct consequence of the memoryless property of Markov Chains[3]. This basically means that if the probability of a single failure is p, the probability of m failures is m!/pm. This is a direct consequence of independent node failures.

Model

[1] describes properties of such a model. The Mean time to data loss in this model is given by the following expression for a distributed storage system which maintains 3 replicas. A more general formula can be found in [1]:


Where γ (repair ratio[1]) is the ratio of recovery (μ) and failure (λ) rate of machines. 

Let us briefly discuss how this expression is derived. Authors of [1] consider an epoch. An epoch is the amount of time a system containing R(=3) replicas take to go from a state R-1 to either 0 or R. At the beginning of the epoch, a replica of the system fails leaving the system in R-1 state. At the end of the epoch, the system is either repaired to have R replicas again or reaches the absorbing state of 0 replicas. In the former case, the system waits in state R until another failure in the same epoch. At this point a new epoch starts.

Authors of [1] consider the Mean time to data loss to be multiplication of two factor: number of such epochs and the time of each epoch. Number of epoch is calculated by finding the probability Q* for a system to go to absorbing state from state R-1 at the beginning of the epoch. Number of epoch  = 1/Q*. Duration of each epoch is calculated as the amount of time spent in state R, plus weighted average of the amount of time it would take to reach the end of epoch from stare R + 1 and R - 1. The weights being probability of reaching each state from state R.

Availability

To find a model for availability, we need to define availability. We consider a Paxos like consensus protocol running over replicas to replicate a write or a consistent read. Recall, that in a quorum (majority of replicas (e.g. 2/3, 3/5) agreeing to the proposed value) based system like Paxos, it is possible to read stale data from one of the replicas which did not receive the update in time and did not participate in write confirmation (quorum). Therefore, a consistent read is defined as the reading the written value deterministically. In a Paxos like system, it is possible to do consistent read if a read is done from a majority of the nodes.

In such a system we define availability as inability of a system to process a write or consistent read. In both these cases, a majority of replicas should be able to agree on the value. Therefore, an object of the storage system is unavailable if a majority of replicas hosting it are not reachable. Notice that in this definition, it is not required for the replicas to lose the data. They may be unavailable because of network partition, garbage collection pauses, bugs etc.

Availability of systems can be modeled as a continuous Markov model[2] and follows the expression above. A distributed system with 3 replicas of items is unavailable if 2 of the items are unavailable. Therefore, we consider the the state transition from R=3 to R=1. Because of the memoryless property of Markov chains, it is the same as considering transition from R=2 to R=0. The expression for this becomes the following:



As discussed earlier γ = μ/λ. Rate of failure λ can be calculated by number of failures of a node. Therefore if we assume that a node is unavailable once every week, the mean time to failure is 7 days(=7*24*60 minutes) and λ = 1/MTTF = 1/(7*24*60). If the time to notify an operator and for operator to fix the problem is about 30 minutes, then μ = 1/MTTR = 1/30. Therefore, the MTTF ~ 2,352 days, giving availability of 5 9s (30 minutes downtime in 2352 days).


Durability

Durability of a system is defined as the probability against loss of data. This is also measured in terms of meat time to data loss. In a replicated distributed storage system of 3 replicas, this implies time until the all three replicas are lost before the fault could be repaired.

This can also be modeled using the above continuous Markov model[1]. In this model the durability of the system can be defined as the probability of the system to move from state R=3 to R=0. The expression for this is given as:


and the mean time to data loss is:

For durability we can model failures as irrecoverable disk failures. Assuming a disk wears down in 4 years[4], λ = 1/MTTF = 1/(4*365*24). Recovery of a lost drive involves notifying an operator. The operator would find a new node and move replica to the new node. Assuming this recovery time is about 1 hour μ = 1/MTTR = 1. The MTTF for data becomes ~ 1,620,698,112 years. This implies 8 9s of durability (date lost once in 1,620,698,112 years).

Practical Considerations

The above numbers for availability and durability are impressive because we are considering them for a single data object. However, modern distributed system hardly consist of single data object. Data is sharded across multiple nodes to achieve higher throughput and support high data sizes. Therefore, if N (=1,000,000) data objects are distributed across m nodes. Mean time to data loss becomes 1620 years. Availability becomes 3.4 hours.

In both of these cases, the only knob we have is mean time to repair, replication factor and number of data objects. Since data is bound to grow, number of data objects is not on our team. Increasing replication factor is expensive because it directly affects the hardware costs. Therefore, the low hanging fruit is decreasing mean time to repair. 

Since humans are the bottlenecks when it comes to repair, there is a huge push to automate the mundane maintenance tasks. Therefore, it is imperative to have a control layer like Norbert, Helix etc. Automation in addition is cost affective too.

Conclusion

In this post, we have seen how to model durability and availability of distributed storage systems. The figures we achieved with modest failure and recovery rates far exceeds those of single node based system. Therefore, in addition to giving us higher throughput, distributed systems give us higher degree of reliability in operations. However, if pushed to current data sizes, distributed systems become more vulnerable to failures and in addition are an operational pain. Therefore, there is a push to automate more and more of the maintenance tasks.

References


[1] Analysis of Long-Running Replicated Systems,  Sriram Ramabhadran & Joseph Pasquale, in Proc. of the 25Th IEEE Annual Conference on Computer Communications (Infocom)

[2] Availability in Globally Distributed Storage System,  Ford, Daniel, et al. OSDI. 2010.

[3] Continuous Markov Models Tutorial, Queens University (link, link)

[4] Meza, Justin, et al. "A Large-Scale Study of Flash Memory Failures in the Field."

No comments: