BeansDB Source Code Reading (1)

 

Introduction

Beansdb is a distributed key-value storage system designed for large scale online system, aiming for high avaliablility and easy management. It took the ideas from Amazon's Dynamo, then made some simplify to Keep It Simple Stupid.

Why BeansDB?

  • It's production proven by douban
  • It's KISS. Each to understand or rewrite
  • It's a suitable "first step" for small/middle scale

Distributed Design

The distributed design is very similar with Memcached.

Client-Driven-Route

client driven route

Partition by fixed bulk(shard) size

[Input: Key, Output: serverHost(s)]

  • we have two node, node1, node2.
  • divide data into 16 bulks. Each bulk can be stored in one node or multiply node. Each node can save one bulk or several bulks. In host config : assign bulk number to node. E.g node1[0 ~ 7], node2[8~15]
  • key -> hash_value -> bulk number -> serverHost(s)
    //key -> hash_value
    def fnv1a(s): 
        prime = 0x01000193
        h = 0x811c9dc5
        for c in s:
            h ^= ord(c)
            h = (h * prime) & 0xffffffff
        return h
    
    //hash_value -> bulk number
    bulk_size * hash_value / hash_space(1 << 32) 
    
    hash
     

Replication by client write

If we want have several copy of data, here is config

  • assign 1 bulk number to several hosts: node1[0~8], node2[4~12], node3[8~15], node4[11~3]
  • set write duplication : W=2
    //write into every replicate server for this key
    rs = [s.set(key, value) for s in self._get_servers(key)]
            if not rs.count(True) >= self.W:
                # try to get, it will return False when set same content into db
                if self.get(key) != value:
                    raise WriteFailedError(key)
            return True
    

same with write, client do multiply delete to clean all duplications.

def delete(self, key):
        rs = [s.delete(key) for s in self._get_servers(key)]
        return rs.count(True) >= self.W

Read-Repair

In reading, it will try to repair(rewrite) the replication if there are some nodes found to be missing data.

 def get(self, key):
        ss = self._get_servers(key)
        for i, s in enumerate(ss):
            r = s.get(key)
            if r is not None:
                # self heal
                for k in range(i):
                    ss[k].set(key, r)
                return r

Data Versioning (consistency)

A meta data is attached to value to record version information.
In the server side, when save data ([key,value]) into tc(Tokyo Cabinet) file, server will check local file data version ( called old_version)

  • If version = 0, write request is sent by client, version = old_version + 1
  • If version > 0, write request is sent by sync, do set when version > old_version
typedef struct t_meta {
    int32_t  version;
    uint32_t hash;
    uint32_t flag;
    time_t   modified;
} Meta;

HashTree(Merkle trees) sync

A cron job will call sync script to sync data between different server. So BeansDB can make sure all the data is eventual consistent.

 

Principle

Hash trees or Merkle trees are a type of data structure which contains a tree of summary information about a larger piece of data – for instance a file – used to verify its contents.
In the leaf node, hash_v = hash(data). In the node, hash_v = hash(hash_v of child nodes). So if data in a leaf node is changed, the change is broadcast from leaf to root node. To locate the change of two hash tree, search the change path in hash tree, it uses only log(n) to reach the changed leaf node.

douban Implementation

tree structure
  • node has 16 sub-node(2^4)
  • tree height is <= 8. Each hash tree at most 2^32 elements
  • leaf node has at most 128 elements, if there more than 128 elements in a node, then create 16 child nodes of this node, and distribute all items from parent to child nodes. The node become a internal node.
  • distribute: divide k_hash (32 byte) into 8 segment, each segment (4 byte) is a child index of a layer. child_idx = (0x0f & ( k_hash >> (7 - node->depth) * 4 )). The 8th layer do not have sub-node.
  • if number of items in all child is not bigger than 128, than move all items from child to parent and delete all the child.
    static const int g_index[] = {0, 1, 17, 289, 4913, 83521, 1419857, 24137569, 410338673};
    
                              0(root)
                          1 ... 16
                        17  ...  
                       289  ...                      
                     4913   ...                   
                    83521   ...                   
                 1419857    ...                   
               24137569     ...                   
             410338673      ...  
    
hash function
  • element(item) = [key, k_hash, v_hash]
  • node_hash = node_hash * 97 + child[i]_hash
  • leaf_node_hash = sum(k_hash * v_hash) of all elements
    #define FNV_32_PRIME 0x01000193
    #define FNV_32_INIT 0x811c9dc5
    
    typedef unsigned int uint32_t;
    
    static inline uint32_t fnv1a(const char *key, int key_len)
    {
      uint32_t h = FNV_32_INIT;
      int i;
    
      for (i=0; i<key_len; i++) {
          h ^= (uint32_t)key[i];
          h *= FNV_32_PRIME;
      }
    
      return h;
    }
    
data structure
  • use an array(breadth first) to keep all the node
  • use an array (breadth first) to keep corresponding data for node. for internal node, corresponding data in this array is null. for leaf node, it is should not null.
  • Tricks of node position in a tree
    • (node - tree->root) is the sequence number of node
    • (node - tree->root) - g_index[(int)node->depth] is offset related to the first node in the same layer
    • g_index[node->depth + 1] + (get_pos(tree, node) << 4) is the first child of node
    • get_pos(tree, node) << ((8 - node->depth) * 4) is the min k_hash which is belong to the node; get_pos(tree, node) + 1 << ((8 - node->depth) * 4) is the max.
  • Tricks of pointer algorithm
    • (Item*)((char*)p + p->length)
    • (void*)p-(void*)data
typedef struct t_item Item;
struct t_item {
    uint32_t keyhash;
    uint32_t hash;
    short    ver;
    unsigned char length;
    char     name[1];
};

static Item* create_item(HTree *tree, const char *name, int ver, uint32_t hash)
{
    size_t n = strlen(name);
    Item *it = (Item*)tree->buf;
    strncpy(it->name, name, n);
    it->name[n] = 0;
    it->ver = ver;
    it->hash = hash;
    it->keyhash = fnv1a(name, n);
    it->length = sizeof(Item) + n;

    return it;
}

 

 

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 3
    评论
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

FireCoder

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值