BeansDB Source Code Reading (1)_beansdb delete-CSDN博客

本文链接：https://blog.csdn.net/FireCoder/article/details/5631531

Introduction

Beansdb is a distributed key-value storage system designed for large scale online system, aiming for high avaliablility and easy management. It took the ideas from Amazon's Dynamo, then made some simplify to Keep It Simple Stupid.

Why BeansDB?

It's production proven by douban
It's KISS. Each to understand or rewrite
It's a suitable "first step" for small/middle scale

Distributed Design

The distributed design is very similar with Memcached.

Client-Driven-Route

Partition by fixed bulk(shard) size

[Input: Key, Output: serverHost(s)]

we have two node, node1, node2.
divide data into 16 bulks. Each bulk can be stored in one node or multiply node. Each node can save one bulk or several bulks. In host config : assign bulk number to node. E.g node1[0 ~ 7], node2[8~15]

key -> hash_value -> bulk number -> serverHost(s)

//key -> hash_value
def fnv1a(s): 
    prime = 0x01000193
    h = 0x811c9dc5
    for c in s:
        h ^= ord(c)
        h = (h * prime) & 0xffffffff
    return h

//hash_value -> bulk number
bulk_size * hash_value / hash_space(1 << 32)

Replication by client write

If we want have several copy of data, here is config

assign 1 bulk number to several hosts: node1[0~8], node2[4~12], node3[8~15], node4[11~3]

set write duplication : W=2

//write into every replicate server for this key
rs = [s.set(key, value) for s in self._get_servers(key)]
        if not rs.count(True) >= self.W:
            # try to get, it will return False when set same content into db
            if self.get(key) != value:
                raise WriteFailedError(key)
        return True

same with write, client do multiply delete to clean all duplications.

def delete(self, key):
        rs = [s.delete(key) for s in self._get_servers(key)]
        return rs.count(True) >= self.W

Read-Repair

In reading, it will try to repair(rewrite) the replication if there are some nodes found to be missing data.

 def get(self, key):
        ss = self._get_servers(key)
        for i, s in enumerate(ss):
            r = s.get(key)
            if r is not None:
                # self heal
                for k in range(i):
                    ss[k].set(key, r)
                return r

Data Versioning (consistency)

A meta data is attached to value to record version information.
In the server side, when save data ([key,value]) into tc(Tokyo Cabinet) file, server will check local file data version ( called old_version)

If version = 0, write request is sent by client, version = old_version + 1
If version > 0, write request is sent by sync, do set when version > old_version

typedef struct t_meta {
    int32_t  version;
    uint32_t hash;
    uint32_t flag;
    time_t   modified;
} Meta;

HashTree(Merkle trees) sync

A cron job will call sync script to sync data between different server. So BeansDB can make sure all the data is eventual consistent.

Principle

Hash trees or Merkle trees are a type of data structure which contains a tree of summary information about a larger piece of data – for instance a file – used to verify its contents.
In the leaf node, hash_v = hash(data). In the node, hash_v = hash(hash_v of child nodes). So if data in a leaf node is changed, the change is broadcast from leaf to root node. To locate the change of two hash tree, search the change path in hash tree, it uses only log(n) to reach the changed leaf node.

douban Implementation

tree structure

node has 16 sub-node(2^4)
tree height is <= 8. Each hash tree at most 2^32 elements
leaf node has at most 128 elements, if there more than 128 elements in a node, then create 16 child nodes of this node, and distribute all items from parent to child nodes. The node become a internal node.
distribute: divide k_hash (32 byte) into 8 segment, each segment (4 byte) is a child index of a layer. child_idx = (0x0f & ( k_hash >> (7 - node->depth) * 4 )). The 8th layer do not have sub-node.

if number of items in all child is not bigger than 128, than move all items from child to parent and delete all the child.

static const int g_index[] = {0, 1, 17, 289, 4913, 83521, 1419857, 24137569, 410338673};

                          0(root)
                      1 ... 16
                    17  ...  
                   289  ...                      
                 4913   ...                   
                83521   ...                   
             1419857    ...                   
           24137569     ...                   
         410338673      ...

hash function

element(item) = [key, k_hash, v_hash]
node_hash = node_hash * 97 + child[i]_hash

leaf_node_hash = sum(k_hash * v_hash) of all elements

#define FNV_32_PRIME 0x01000193
#define FNV_32_INIT 0x811c9dc5

typedef unsigned int uint32_t;

static inline uint32_t fnv1a(const char *key, int key_len)
{
  uint32_t h = FNV_32_INIT;
  int i;

  for (i=0; i<key_len; i++) {
      h ^= (uint32_t)key[i];
      h *= FNV_32_PRIME;
  }

  return h;
}

data structure

use an array（breadth first) to keep all the node
use an array (breadth first) to keep corresponding data for node. for internal node, corresponding data in this array is null. for leaf node, it is should not null.
Tricks of node position in a tree
- (node - tree->root) is the sequence number of node
- (node - tree->root) - g_index[(int)node->depth] is offset related to the first node in the same layer
- g_index[node->depth + 1] + (get_pos(tree, node) << 4) is the first child of node
- get_pos(tree, node) << ((8 - node->depth) * 4) is the min k_hash which is belong to the node; get_pos(tree, node) + 1 << ((8 - node->depth) * 4) is the max.
Tricks of pointer algorithm
- (Item*)((char*)p + p->length)
- (void*)p-(void*)data

typedef struct t_item Item;
struct t_item {
    uint32_t keyhash;
    uint32_t hash;
    short    ver;
    unsigned char length;
    char     name[1];
};

static Item* create_item(HTree *tree, const char *name, int ver, uint32_t hash)
{
    size_t n = strlen(name);
    Item *it = (Item*)tree->buf;
    strncpy(it->name, name, n);
    it->name[n] = 0;
    it->ver = ver;
    it->hash = hash;
    it->keyhash = fnv1a(name, n);
    it->length = sizeof(Item) + n;

    return it;
}