Audience
In this post, I will summarize the salient points of the paper "The Ubiquitous B-Tree" by Douglas Comer. I assume that readers are aware of B-Tree and can generally recall, the algorithms for insertion/ deletion/ retrieval in a B-Tree as given in wikipediaMotivation
B-Tree is one of the most important invention in CS and finds application in virtual memory systems, file systems, databases etc. It has clear advantages over other indexing techniques. To quote from [1]B+-trees offer the following advantages: dynamic allocation and release of storage, guaranteed storage utilization of 50%, and no need for periodic "reorganization" of the entire file.
What is a B-Tree?
A B-Tree is a balanced n-ary tree which guarantees that the maximum height of the tree is logdn where n is the number of keys (and their corresponding values) and d is the order of the tree. Order of the tree dictates the number of keys each node can store. Each node can store between d and 2d keys (inclusive) and pointers to their values. In addition each internal (non-leaf) node has between d+1 and 2d+1 pointers to its children. While leaf nodes have between d and 2d pointers to their values. Each update operation maintains this property by splitting or merging nodes. For details please look at wikipedia.
Because of the guarantees around height, there are strong guarantees around update and read time which is proportional to the height of the tree. In this post we will only consider B-Trees which need to be stored on the disk because the number of keys is more than the memory available. Therefore, to work on the tree, only a portion of the tree can be loaded in memory. This makes IO the dominant cost in working on the tree. Therefore, several variants and optimizations have been developed over time to reduce the IO cost of B-Tree traversal.
Variations of B-Tree
B*tree relaxes the requirement that B-Tree should be at least half full to at least 2/3 full. This leads to better disk utilization and therefore, efficient search. Although, the survey doesn't talk about why stop at 2/3?This variation also allowed lazy splitting and merging of internal nodes by moving keys around on insert-delete. For example, if a delete leads to deletion of a key from the internal nodes such that the number of keys in the node are reduced to being less than the order of the node, then there are two options. One, this node can merge with its neighbors. Secondly, it can borrow keys from its neighbors so that number of keys are again at least as much as its order. This leads to less number of structural changes in the tree and therefore, leads to faster inserts and deletes.
B+Tree is a popular version of B-Tree where only leaf nodes contain data and other nodes just serve as index into the leaf nodes. This also allows for different structure and handling of two nodes the functions of which are now completely isolated. Frequently, the leaf nodes are linked left-to-right. The brilliance of this structure deserves more attention.
Image credit: http://stackoverflow.com/questions/870218/b-trees-b-trees-difference
Consider deletes on such a tree. As long as the leaves are half full, there is no need to change the rest of the tree, even if the key is present in any of the index keys. This is because, even if a look up for a deleted leaf is initiated, its absence will be noticed on reaching the leaf tree. This simplifies delete processing. Similarly, insert processing doesn't lead to changes in index structure until a leaf splits.
Also, B+Tree makes range scans very efficient. In database world, range scans are common operations initiated by BETWEEN clause in SQL. A B+Tree would look up the two keys corresponding to the boundaries and then traverse all the keys from left-to-right. Contrast this against the B-Tree which traverses between various internal and leaf nodes to achieve the same
Prefix B+Tree's motivation comes the observation that index nodes (internal nodes) of a B+tree help navigation to leaf nodes. Therefore, the keys in the internal nodes act as separators of leaf nodes. They key observation is that the separator need not be the key itself. [1] gives an example where computer and electronics are two keys in different leaf nodes, then the separator can be as short as e, instead of electronics. This leads to smaller separators and therefore smaller trees and consequently less IO.
However, this may break if the separator is (almost) as big as the keys e.g. programmer and programmers. The authors of Prefix B+Tree suggest using imbalanced leaf nodes to enable small separators. This local optimization doesn't change properties or costs associated with B+Tree.
Less important variants
Binary B-Tree is useful when one-level stores (virtual memory) are to be implemented. These are binary B-tree(0 order but no node exists with 0 keys). To save space in half filled nodes, the right pointer may point either to a leaf node or to the next internal node.2-3 Tree is a B-Tree of order 1 and is useful as an internal (main memory) data structures. Its authors took into account the cost of comparisons and node accesses while building the tree and have reported improved storage utilization.
Knuth's version of B-Tree proposes different 'order' at different nodes. This is because of the observations that leaf nodes and the root node are rarely full. However, this leads to high maintenance cost specially because secondary storage favors fixed length nodes.
B-Tree implementation optimizations
Interpolation leaf search [4] tries to
estimate the leaf node being sought based on input key and key
distribution in the file without walking the tree.
If the leaf node is not the right one, then a conventional search can
resume. If the estimate is correct, some disk accesses for internal can
be saved.
Variable length keys can be a result of compression or be a feature of the key e.g. string keys. Promoting shorter keys over longer ones during inserts has been shown to improve storage utilization and lower latency.
Concurrency I will go into concurrency protocols for B+Trees in a later post and give a summary here. There are two types of locks: read and update. Readers take at most 2 read locks as they traverse the tree - one on the node and another on its parent. As a reader moves along from a node to its child, it releases the parent (of current node) lock to get the lock of the child (of the current node) and keeps the lock of the current node. Writes cannot take the update lock if read lock is taken.
Writers take write lock in two phases. As they move along the path, they take a reservation on each node on the path. A reservation is a right to take lock and only one node can have such a write at a time. The reservation can be converted to a lock / released later by the writer. This scheme throttles the writers but doing it lazily may incur more traversals.
An optimization is to never split a node while going up and only while going down the tree. This removes the need for a writer to take reservations at the cost of increased (slightly?) storage requirements and increased (slightly) latency. It can be proved that if a node is split when it is "nearly" full, there will never be a need for split during a write.
B-Tree key traversal algorithms
In this section, we go over the algorithms used to search a key in the node. All
B-tree algorithms start with seeking the node containing the key. As a
reader (thread/process) reaches a node, it has to find the pointer to
the child node. This can be done in several ways which include the
following
2. Binary search can be used if the number of keys to look for is large. As Knuth points out in sorting and searching, linear search should be used for smaller arrays and binary search for larger array
3. Square root search [2,3] implements a B-Tree like search method on an array. In this method the array of pointers in the node is divided into √m partitions, where m is the number of pointers in a node. The goal is to find the right partition and then do a linear search in the partition for the input key. The right partition is found by comparing first keys of each partition with the the input key.
Conclusion
In this post, we saw the various variants of B-tree and their implementations in brief. Most notable among those is the B+Tree. In addition, we also saw various optimization strategies to further improve the performance of B-trees such as prefix B-Tree, compression, interpolation search etc. Implementation of B-trees in real system has more engineering challenges like security, concurrency, recovery etc. which I will cover in later posts
References
[1] Comer, Douglas. "Ubiquitous B-tree." ACM Computing Surveys (CSUR) 11.2 (1979): 121-137.
[2] MARUYAMA, K, AND SMITH, S. "Analysis of design alternatives for virtual memory indexes," Commun. ACM 20, 4 (April 1977), 245-25
[3] Strong, H. Raymond, George Markowsky, and Ashok K. Chandra. "Search within a Page." Journal of the ACM (JACM) 26.3 (1979): 457-482.
[4] GHOSH, S, AND SENKO, M. "File organization: on the selection of random access index points for sequential files," J ACM 16, 4 (Oct. 1969), 569-579