Technical Ramblings: November 2015

Wednesday, November 25, 2015

B-tree Survey - Variants

Audience

In this post, I will summarize the salient points of the paper "The Ubiquitous B-Tree" by Douglas Comer. I assume that readers are aware of B-Tree and can generally recall, the algorithms for insertion/ deletion/ retrieval in a B-Tree as given in wikipedia

Motivation

B-Tree is one of the most important invention in CS and finds application in virtual memory systems, file systems, databases etc. It has clear advantages over other indexing techniques. To quote from [1]
B+-trees offer the following advantages: dynamic allocation and release of storage, guaranteed storage utilization of 50%, and no need for periodic "reorganization" of the entire file.

What is a B-Tree?

A B-Tree is a balanced n-ary tree which guarantees that the maximum height of the tree is log_dn where n is the number of keys (and their corresponding values) and d is the order of the tree. Order of the tree dictates the number of keys each node can store. Each node can store between d and 2d keys (inclusive) and pointers to their values. In addition each internal (non-leaf) node has between d+1 and 2d+1 pointers to its children. While leaf nodes have between d and 2d pointers to their values. Each update operation maintains this property by splitting or merging nodes. For details please look at wikipedia.

Because of the guarantees around height, there are strong guarantees around update and read time which is proportional to the height of the tree. In this post we will only consider B-Trees which need to be stored on the disk because the number of keys is more than the memory available. Therefore, to work on the tree, only a portion of the tree can be loaded in memory. This makes IO the dominant cost in working on the tree. Therefore, several variants and optimizations have been developed over time to reduce the IO cost of B-Tree traversal.

Variations of B-Tree

B*tree relaxes the requirement that B-Tree should be at least half full to at least 2/3 full. This leads to better disk utilization and therefore, efficient search. Although, the survey doesn't talk about why stop at 2/3?

This variation also allowed lazy splitting and merging of internal nodes by moving keys around on insert-delete. For example, if a delete leads to deletion of a key from the internal nodes such that the number of keys in the node are reduced to being less than the order of the node, then there are two options. One, this node can merge with its neighbors. Secondly, it can borrow keys from its neighbors so that number of keys are again at least as much as its order. This leads to less number of structural changes in the tree and therefore, leads to faster inserts and deletes.

B+Tree is a popular version of B-Tree where only leaf nodes contain data and other nodes just serve as index into the leaf nodes. This also allows for different structure and handling of two nodes the functions of which are now completely isolated. Frequently, the leaf nodes are linked left-to-right. The brilliance of this structure deserves more attention.

Image credit: http://stackoverflow.com/questions/870218/b-trees-b-trees-difference

Consider deletes on such a tree. As long as the leaves are half full, there is no need to change the rest of the tree, even if the key is present in any of the index keys. This is because, even if a look up for a deleted leaf is initiated, its absence will be noticed on reaching the leaf tree. This simplifies delete processing. Similarly, insert processing doesn't lead to changes in index structure until a leaf splits.

Also, B+Tree makes range scans very efficient. In database world, range scans are common operations initiated by BETWEEN clause in SQL. A B+Tree would look up the two keys corresponding to the boundaries and then traverse all the keys from left-to-right. Contrast this against the B-Tree which traverses between various internal and leaf nodes to achieve the same

Prefix B+Tree's motivation comes the observation that index nodes (internal nodes) of a B+tree help navigation to leaf nodes. Therefore, the keys in the internal nodes act as separators of leaf nodes. They key observation is that the separator need not be the key itself. [1] gives an example where computer and electronics are two keys in different leaf nodes, then the separator can be as short as e, instead of electronics. This leads to smaller separators and therefore smaller trees and consequently less IO.

However, this may break if the separator is (almost) as big as the keys e.g. programmer and programmers. The authors of Prefix B+Tree suggest using imbalanced leaf nodes to enable small separators. This local optimization doesn't change properties or costs associated with B+Tree.

Less important variants

Binary B-Tree is useful when one-level stores (virtual memory) are to be implemented. These are binary B-tree(0 order but no node exists with 0 keys). To save space in half filled nodes, the right pointer may point either to a leaf node or to the next internal node.

2-3 Tree is a B-Tree of order 1 and is useful as an internal (main memory) data structures. Its authors took into account the cost of comparisons and node accesses while building the tree and have reported improved storage utilization.

Knuth's version of B-Tree proposes different 'order' at different nodes. This is because of the observations that leaf nodes and the root node are rarely full. However, this leads to high maintenance cost specially because secondary storage favors fixed length nodes.

B-Tree implementation optimizations

Interpolation leaf search [4] tries to estimate the leaf node being sought based on input key and key distribution in the file without walking the tree. If the leaf node is not the right one, then a conventional search can resume. If the estimate is correct, some disk accesses for internal can be saved.

Compression can work on both pointers and keys. Pointers can be compressed by using a base and offset model. The base is common for a tree (or a sub-tree) and offsets are unique to each pointer. Both have to be combined to get the final node pointer. Keys can be compressed using standard redundancy reduction techniques.

Variable length keys can be a result of compression or be a feature of the key e.g. string keys. Promoting shorter keys over longer ones during inserts has been shown to improve storage utilization and lower latency.

Concurrency I will go into concurrency protocols for B+Trees in a later post and give a summary here. There are two types of locks: read and update. Readers take at most 2 read locks as they traverse the tree - one on the node and another on its parent. As a reader moves along from a node to its child, it releases the parent (of current node) lock to get the lock of the child (of the current node) and keeps the lock of the current node. Writes cannot take the update lock if read lock is taken.

Writers take write lock in two phases. As they move along the path, they take a reservation on each node on the path. A reservation is a right to take lock and only one node can have such a write at a time. The reservation can be converted to a lock / released later by the writer. This scheme throttles the writers but doing it lazily may incur more traversals.

An optimization is to never split a node while going up and only while going down the tree. This removes the need for a writer to take reservations at the cost of increased (slightly?) storage requirements and increased (slightly) latency. It can be proved that if a node is split when it is "nearly" full, there will never be a need for split during a write.

B-Tree key traversal algorithms

In this section, we go over the algorithms used to search a key in the node. All B-tree algorithms start with seeking the node containing the key. As a reader (thread/process) reaches a node, it has to find the pointer to the child node. This can be done in several ways which include the following

1. Linear search
2. Binary search can be used if the number of keys to look for is large. As Knuth points out in sorting and searching, linear search should be used for smaller arrays and binary search for larger array
3. Square root search [2,3] implements a B-Tree like search method on an array. In this method the array of pointers in the node is divided into √m partitions, where m is the number of pointers in a node. The goal is to find the right partition and then do a linear search in the partition for the input key. The right partition is found by comparing first keys of each partition with the the input key.

Conclusion

In this post, we saw the various variants of B-tree and their implementations in brief. Most notable among those is the B+Tree. In addition, we also saw various optimization strategies to further improve the performance of B-trees such as prefix B-Tree, compression, interpolation search etc. Implementation of B-trees in real system has more engineering challenges like security, concurrency, recovery etc. which I will cover in later posts

References

[1] Comer, Douglas. "Ubiquitous B-tree." ACM Computing Surveys (CSUR) 11.2 (1979): 121-137.

[2] MARUYAMA, K, AND SMITH, S. "Analysis of design alternatives for virtual memory indexes," Commun. ACM 20, 4 (April 1977), 245-25

[3] Strong, H. Raymond, George Markowsky, and Ashok K. Chandra. "Search within a Page." Journal of the ACM (JACM) 26.3 (1979): 457-482.

[4] GHOSH, S, AND SENKO, M. "File organization: on the selection of random access index points for sequential files," J ACM 16, 4 (Oct. 1969), 569-579

Sunday, November 8, 2015

Calculating Durability and Availability of Distributed Systems

"All models are wrong but some are useful" -- George Box

In this post I am going to derive expressions for availability and durability of distributed storage systems. In particular, I will go over the statistical model of Mean Time to Failure (MTTF) of such systems. This model captures the time it will take to lose a data stored in such systems (data durability). While durability deals with permanent loss of a stored data, availability deals with temporary loss of ability to read / write data. As we will see, we can also apply similar model for availability.

Introduction

A trend to move towards distributed replicated storage made from commodity hardware from storage systems running on expensive, powerful, server grade machines started in early 2000s. The argument in favor of this transition is a statistical one. Consider a server grade system which provides an availability of 99.99% (4 9s). Then 0.01% of all the requests fail on such a system. A business like Google Ads, which sees 230 million clicks per day with a average CPC of 3$ stands to lose $7M every day. Therefore, internet businesses want to run on services which provide a far higher availability (may be 5 or 6 9s). The same can be argued about durability.

Definitions

It is easy to find the availability of a single box because hardware components come with a rated life. However, when many such hardware components come together, statistical analysis is required. Before we further into the analysis let us give a few definitions:

object: a blob of data stored in the storage system

replica: a copy of object

replication group: A group of replicas storing same object

n: number of replicas

m: number of nodes (computers) in the storage system
N: total number of objects in the storage system

We assume that a distributed storage system saves multiple copies of data, each on a different node for both high durability and availability. These copies of data are called replicas and together form a replication group. We also assume that this storage system uses a consensus protocol like Paxos to make sure the replicas agree on the stored data.

Problems

Availability

Given that an object is replicated on n replicas stored randomly on n nodes and uses a Paxos based consensus protocol, what is the probability that a read / write to the storage system will fail? In addition, what is the mean time to failure of a read or write of an object?

Durability

Given that an object is replicated on n replicas stored randomly on n nodes, what is the probability that object will be lost due to failure of individual nodes? In addition, what is the mean time to data los.

Solution

We model such a storage system as a continuous Markov chain process. In this section, we will go over the Markov Chains models - discrete and continuous, explore why continuous Markov chain represent this system and discuss the resulting results.

Markov Chain

Markov chain is a random process, which moves from one state to another. Transition of a random variable in markov chain from one state to another only depends on the current state. In other words, the transitions are memoryless. Memoryless property implies that probability from state R=n to R=m is the same as R=n+1 to R=m+1[3]. For example consider the following game called Gambler's ruin in the following figure.

The game is played as follows. There is a gambler with r dollars. He wins a round in the game with probability p. Each win increases his wealth by 1 dollar and each loss decreases it by 1 dollar. Gambler's ruin problem seeks to find the number of rounds, it takes for the Gambler to lose all his wealth.

This is modeled using Markov Chains. Each dollar value of gambler's wealth is represented as a state as shown in the figure above. Each round of the game is modeled as a transition from one state to another. Now consider the probability P_R of ruin (wealth reaching 0) at a state R.

This expression tells us that probability of ruin with wealth = R is the same as sum of two expressions. The first being the probability of moving to state R-1 from R(by losing this round) and getting to ruin from R-1. The second being probability of moving to state R+1 from R (by winning this round) and getting to ruin from R+1.

This was an example of discrete Markov Chain. This is called discrete Markov Chain because each round is distinctly defined. Another flavor of Markov Chain models is continuous Markov Chain models.

Continuous markov chain models differ from discrete because there is no sequence of moves. It is a representation of system where state transitions happen independent of each other and with different rates. Consider an example of such system called the birth-death process.

In this process, the population undergoes birth and death of its constituents at independent rates. Birth rate is shown as μ and death as λ. There is no sequence in moving from state R (with population R) to R-1 or R+1. Such a process is modeled as continuous Markov Chain process. Also, if the population goes to 0, there is no way to come back to state 1. Such a process is called absorbing markov process and state 0 is called absorbing state.

Distributed system as a continuous Markov Chain model

A distributed storage system can be modeled as a birth death process. Consider a replication group consisting of n replicas of an object. Also assume that all of these replicas are stored on different machines. Machines in such a system are failing at a constant rate of λ. Failed machines come up (replaced) with a rate of μ. Notice that like birth-death process, state 0 is an absorbing state.

It differs from a birth-death process because of the fact that the population cannot grow beyond n machines. Let us discuss this boundary condition. However, that is not a problem because we can define the problem in such a way that this boundary condition ceases to be an issue. [1] defines an epoch. At the start of the epoch, the system is reaches state R - 1 i.e. one node has just failed. At the end of the epoch, the system either reaches state 0 which is an absorbing state or reaches (and stays in) state R, when a new epoch starts. In other words, if one node fails every S years, then on every such failure either the system is undergoing an epoch or starts a new epoch. Therefore, under this definition of epoch, the previous epoch continues until a new one is started. In the following sections we consider a distributed system which maintains 3 copies of an item for simplicity, although results are general enough to be applied to any number of replicas.

Assumptions

1. Node failures are independent of each other
2. Failure and recovery rate of nodes are exponentially distributed. This is a direct consequence of the memoryless property of Markov Chains[3]. This basically means that if the probability of a single failure is p, the probability of m failures is m!/p^m. This is a direct consequence of independent node failures.

Model

[1] describes properties of such a model. The Mean time to data loss in this model is given by the following expression for a distributed storage system which maintains 3 replicas. A more general formula can be found in [1]:

Where γ (repair ratio[1]) is the ratio of recovery (μ) and failure (λ) rate of machines.

Let us briefly discuss how this expression is derived. Authors of [1] consider an epoch. An epoch is the amount of time a system containing R(=3) replicas take to go from a state R-1 to either 0 or R. At the beginning of the epoch, a replica of the system fails leaving the system in R-1 state. At the end of the epoch, the system is either repaired to have R replicas again or reaches the absorbing state of 0 replicas. In the former case, the system waits in state R until another failure in the same epoch. At this point a new epoch starts.

Authors of [1] consider the Mean time to data loss to be multiplication of two factor: number of such epochs and the time of each epoch. Number of epoch is calculated by finding the probability Q* for a system to go to absorbing state from state R-1 at the beginning of the epoch. Number of epoch = 1/Q*. Duration of each epoch is calculated as the amount of time spent in state R, plus weighted average of the amount of time it would take to reach the end of epoch from stare R + 1 and R - 1. The weights being probability of reaching each state from state R.

Availability

To find a model for availability, we need to define availability. We consider a Paxos like consensus protocol running over replicas to replicate a write or a consistent read. Recall, that in a quorum (majority of replicas (e.g. 2/3, 3/5) agreeing to the proposed value) based system like Paxos, it is possible to read stale data from one of the replicas which did not receive the update in time and did not participate in write confirmation (quorum). Therefore, a consistent read is defined as the reading the written value deterministically. In a Paxos like system, it is possible to do consistent read if a read is done from a majority of the nodes.

In such a system we define availability as inability of a system to process a write or consistent read. In both these cases, a majority of replicas should be able to agree on the value. Therefore, an object of the storage system is unavailable if a majority of replicas hosting it are not reachable. Notice that in this definition, it is not required for the replicas to lose the data. They may be unavailable because of network partition, garbage collection pauses, bugs etc.

Availability of systems can be modeled as a continuous Markov model[2] and follows the expression above. A distributed system with 3 replicas of items is unavailable if 2 of the items are unavailable. Therefore, we consider the the state transition from R=3 to R=1. Because of the memoryless property of Markov chains, it is the same as considering transition from R=2 to R=0. The expression for this becomes the following:

As discussed earlier γ = μ/λ. Rate of failure λ can be calculated by number of failures of a node. Therefore if we assume that a node is unavailable once every week, the mean time to failure is 7 days(=7*24*60 minutes) and λ = 1/MTTF = 1/(7*24*60). If the time to notify an operator and for operator to fix the problem is about 30 minutes, then μ = 1/MTTR = 1/30. Therefore, the MTTF ~ 2,352 days, giving availability of 5 9s (30 minutes downtime in 2352 days).

Durability

Durability of a system is defined as the probability against loss of data. This is also measured in terms of meat time to data loss. In a replicated distributed storage system of 3 replicas, this implies time until the all three replicas are lost before the fault could be repaired.

This can also be modeled using the above continuous Markov model[1]. In this model the durability of the system can be defined as the probability of the system to move from state R=3 to R=0. The expression for this is given as:

and the mean time to data loss is:

For durability we can model failures as irrecoverable disk failures. Assuming a disk wears down in 4 years[4], λ = 1/MTTF = 1/(4*365*24). Recovery of a lost drive involves notifying an operator. The operator would find a new node and move replica to the new node. Assuming this recovery time is about 1 hour μ = 1/MTTR = 1. The MTTF for data becomes ~ 1,620,698,112 years. This implies 8 9s of durability (date lost once in 1,620,698,112 years).

Practical Considerations

The above numbers for availability and durability are impressive because we are considering them for a single data object. However, modern distributed system hardly consist of single data object. Data is sharded across multiple nodes to achieve higher throughput and support high data sizes. Therefore, if N (=1,000,000) data objects are distributed across m nodes. Mean time to data loss becomes 1620 years. Availability becomes 3.4 hours.

In both of these cases, the only knob we have is mean time to repair, replication factor and number of data objects. Since data is bound to grow, number of data objects is not on our team. Increasing replication factor is expensive because it directly affects the hardware costs. Therefore, the low hanging fruit is decreasing mean time to repair.

Since humans are the bottlenecks when it comes to repair, there is a huge push to automate the mundane maintenance tasks. Therefore, it is imperative to have a control layer like Norbert, Helix etc. Automation in addition is cost affective too.

Conclusion

In this post, we have seen how to model durability and availability of distributed storage systems. The figures we achieved with modest failure and recovery rates far exceeds those of single node based system. Therefore, in addition to giving us higher throughput, distributed systems give us higher degree of reliability in operations. However, if pushed to current data sizes, distributed systems become more vulnerable to failures and in addition are an operational pain. Therefore, there is a push to automate more and more of the maintenance tasks.

References

[1] Analysis of Long-Running Replicated Systems, Sriram Ramabhadran & Joseph Pasquale, in Proc. of the 25Th IEEE Annual Conference on Computer Communications (Infocom)

[2] Availability in Globally Distributed Storage System, Ford, Daniel, et al. OSDI. 2010.

[3] Continuous Markov Models Tutorial, Queens University (link, link)

[4] Meza, Justin, et al. "A Large-Scale Study of Flash Memory Failures in the Field."