Introduction
Beansdb is a distributed key-value storage system designed for large scale online system, aiming for high avaliablility and easy management. It took the ideas from Amazon's Dynamo, then made some simplify to Keep It Simple Stupid.
Why BeansDB?
- It's production proven by douban
- It's KISS. Each to understand or rewrite
- It's a suitable "first step" for small/middle scale
Distributed Design
The distributed design is very similar with Memcached.
Client-Driven-Route
Partition by fixed bulk(shard) size
[Input: Key, Output: serverHost(s)]
- we have two node, node1, node2.
- divide data into 16 bulks. Each bulk can be stored in one node or multiply node. Each node can save one bulk or several bulks. In host config : assign bulk number to node. E.g node1[0 ~ 7], node2[8~15]
- key -> hash_value -> bulk number -> serverHost(s)
//key -> hash_value def fnv1a(s): prime = 0x01000193 h = 0x811c9dc5 for c in s: h ^= ord(c) h = (h * prime) & 0xffffffff return h //hash_value -> bulk number bulk_size * hash_value / hash_space(1 << 32)
Replication by client write
If we want have several copy of data, here is config
- assign 1 bulk number to several hosts: node1[0~8], node2[4~12], node3[8~15], node4[11~3]
- set write duplication : W=2
//write into every replicate server for this key rs = [s.set(key, value) for s in self._get_servers(key)] if not rs.count(True) >= self.W: # try to get, it will return False when set same content into db if self.get(key) != value: raise WriteFailedError(key) return True
same with write, client do multiply delete to clean all duplications.
def delete(self, key):
rs = [s.delete(key) for s in self._get_servers(key)]
return rs.count(True) >= self.W
Read-Repair
In reading, it will try to repair(rewrite) the replication if there are some nodes found to be missing data.
def get(self, key):
ss = self._get_servers(key)
for i, s in enumerate(ss):
r = s.get(key)
if r is not None:
# self heal
for k in range(i):
ss[k].set(key, r)
return r
Data Versioning (consistency)
A meta data is attached to value to record version information.
In the server side, when save data ([key,value]) into tc(Tokyo Cabinet) file, server will check local file data version ( called old_version)
- If version = 0, write request is sent by client, version = old_version + 1
- If version > 0, write request is sent by sync, do set when version > old_version
typedef struct t_meta {
int32_t version;
uint32_t hash;
uint32_t flag;
time_t modified;
} Meta;
HashTree(Merkle trees) sync
A cron job will call sync script to sync data between different server. So BeansDB can make sure all the data is eventual consistent.
Principle
Hash trees or Merkle trees are a type of data structure which contains a tree of summary information about a larger piece of data – for instance a file – used to verify its contents.
In the leaf node, hash_v = hash(data). In the node, hash_v = hash(hash_v of child nodes). So if data in a leaf node is changed, the change is broadcast from leaf to root node. To locate the change of two hash tree, search the change path in hash tree, it uses only log(n) to reach the changed leaf node.
douban Implementation
tree structure
- node has 16 sub-node(2^4)
- tree height is <= 8. Each hash tree at most 2^32 elements
- leaf node has at most 128 elements, if there more than 128 elements in a node, then create 16 child nodes of this node, and distribute all items from parent to child nodes. The node become a internal node.
- distribute: divide k_hash (32 byte) into 8 segment, each segment (4 byte) is a child index of a layer. child_idx = (0x0f & ( k_hash >> (7 - node->depth) * 4 )). The 8th layer do not have sub-node.
- if number of items in all child is not bigger than 128, than move all items from child to parent and delete all the child.
static const int g_index[] = {0, 1, 17, 289, 4913, 83521, 1419857, 24137569, 410338673}; 0(root) 1 ... 16 17 ... 289 ... 4913 ... 83521 ... 1419857 ... 24137569 ... 410338673 ...
hash function
- element(item) = [key, k_hash, v_hash]
- node_hash = node_hash * 97 + child[i]_hash
- leaf_node_hash = sum(k_hash * v_hash) of all elements
#define FNV_32_PRIME 0x01000193 #define FNV_32_INIT 0x811c9dc5 typedef unsigned int uint32_t; static inline uint32_t fnv1a(const char *key, int key_len) { uint32_t h = FNV_32_INIT; int i; for (i=0; i<key_len; i++) { h ^= (uint32_t)key[i]; h *= FNV_32_PRIME; } return h; }
data structure
- use an array(breadth first) to keep all the node
- use an array (breadth first) to keep corresponding data for node. for internal node, corresponding data in this array is null. for leaf node, it is should not null.
- Tricks of node position in a tree
- (node - tree->root) is the sequence number of node
- (node - tree->root) - g_index[(int)node->depth] is offset related to the first node in the same layer
- g_index[node->depth + 1] + (get_pos(tree, node) << 4) is the first child of node
- get_pos(tree, node) << ((8 - node->depth) * 4) is the min k_hash which is belong to the node; get_pos(tree, node) + 1 << ((8 - node->depth) * 4) is the max.
- Tricks of pointer algorithm
- (Item*)((char*)p + p->length)
- (void*)p-(void*)data
typedef struct t_item Item;
struct t_item {
uint32_t keyhash;
uint32_t hash;
short ver;
unsigned char length;
char name[1];
};
static Item* create_item(HTree *tree, const char *name, int ver, uint32_t hash)
{
size_t n = strlen(name);
Item *it = (Item*)tree->buf;
strncpy(it->name, name, n);
it->name[n] = 0;
it->ver = ver;
it->hash = hash;
it->keyhash = fnv1a(name, n);
it->length = sizeof(Item) + n;
return it;
}