一致性hash算法虚拟节点_一致性hash算法和lvs sh调度算法改进

本文探讨了LVS中的source address hash算法及其问题,提出了一致性哈希作为解决方案。分析了Haproxy中一致性哈希的实现,包括虚拟节点、哈希计算、节点添加删除以及EBtree数据结构。最后,介绍了一个简化版的一致性哈希调度算法,适用于CDN业务。
摘要由CSDN通过智能技术生成

1 LVS-sh调度算法(souce address hash)

本节是LVS中sh调度算法的实现分析;

ip_vs_sh_init_svc() - 创建256个hash bucket的hash table,并将rs映射到这256个bucket中,增加rs的引用计数器;

ip_vs_sh_done_svc() - 释放rs引用计数器,并销毁hash table;

ip_vs_sh_update_svc - 销毁hash

table,重新创建hash table并映射rs;

ip_vs_sh_schedule – 计算hash

key,找到hash bucket对应的rs;

注意:sh调度算法中rs数目必须<=256;

该算法存在的问题:当一台rs

down了之后,rs映射到hash table的顺序会被重置,从而导致当前连接100%损失。

为了解决上述问题,我们第一步自然想到了一致性hash;

下面我们来分析一下haproxy中一致性hash的实现方法。

2 Consistent

hash

本节我们将分析haproxy1.4.5中一致性hash的实现方法。

2.1 Consistent

HASH

以cache为例,实现思想:

1.

根据server(cache)某个参数计算出的key,将server映射到2^31-1的环形空间中;object对象根据自身的key存储到相近的cache server中;当cache A server down之后,只会影响环中其key附近的object;

2.

由于cache

server无法均匀的映射到环形空间中(不均衡问题),所以引入虚拟节点的概念,每台cache server分配X个虚拟节点映射到环中。

Haproxy利用EBtree来管理node,实现一致性hash。

1.

虚拟节点个数

虚拟节点的分配在chash_init_server_tree()中实现;

虚拟节点个数srv->lb_nodes_tot = srv->uweight(32) *

BE_WEIGHT_SCALE(16)=512个;

即1台server会有512个虚拟节点;

2.

hash key的计算

每个虚拟节点的hash

key按如下公式计算:

for (node = 0; node <

srv->lb_nodes_tot; node++) {

……

srv->lb_nodes[node].node.key=chash_hash(srv->puid*SRV_EWGHT_RANGE

+ node);

}

其中,puid为server配置的id号,SRV_EWGHT_RANGE=4K;

3.

添加/删除节点

采用EBtree管理节点,添加删除节点在chash_queue_dequeue_srv()中实现;

chash_queue_dequeue_srv()相应操作函数为eb32_insert()和eb32_delete();

2.2 EBtree

an elastic

binary tree (or EB tree or EBtree) is a binary search tree,

a fast radix tree(参见)

EBtree采用entry包含node和leaf的方法,解决了radix tree中entry个数过多的问题;

Complexity :

lookup  : min=1,

max=log(P), avg=log(N)

insertion from

root : min=1, max=log(P), avg=log(N)

insertion of dups  :

min=1, max=log(D), avg=log(D)/2 after lookup

deletion  : min=1,

max=1, avg=1

prev/next  : min=1,

max=log(P), avg=2 :

数据结构

struct

eb_root {

eb_troot_t

*b[EB_NODE_BRANCHES];

};

struct

eb_node {

struct eb_root branches;

eb_troot_t

*node_p;

eb_troot_t

*leaf_p;

short int

bit;

short int

pfx;

};

The node part itself

is composed of a parent pointer (node_p), a level

(bit), and a root, which itself links to lower nodes

constituting a subtree. The

level part represents the lowest bit position in the key,

above which all bits are equal for all keys in the lower

subtrees.

A root only contains

two links called branches, which represent the value of the

bit below current node's. There is always a left and a

right branch, except for the top of the tree. Left

branch represents bit value 0 while right branch reprensents

value 1.

Branches are typed. This

means that a node knows whether its branches point to other

nodes or to leaves. Parent pointers are typed too.

This means that each node or leaf knows if it is

attached to the left or the right branches of the

upper root.

2.2.1 Branches and Parent

pointers are typed

#define

EB_LEFT

0

#define

EB_RGHT

1

#define

EB_NODE

0

#define

EB_LEAF

1

所有的地址是4字节对齐的,因此采用lowest bit来作标记。

a node pointer will be stored

verbatim (with lowest bit cleared)

a leaf pointer will be stored with

lowest bit set (which equals pointer + 1)

a parent pointer in a left branch

will be stored verbatim (with lowest bit cleared)

a parent pointer in a right branch

will be stored with lowest bit set (which equals pointer +

1)

2.2.2 Eb32tree

1.

eb32_insert

a)

Tree is empty, insert the leaf

part below the left branch;

if

(unlikely(troot == NULL)) {

……

b)

Walk down the tree by

newnode->key and oldnode’s bit;

root = &old->node.branches;

side = (newkey >> old_node_bit) &

EB_NODE_BRANCH_MASK;

troot = root->b[side];

c)

if we have reached a leaf node,

insert above a leaf;

if

(eb_gettag(troot) == EB_LEAF) {

……

d)

if we don't have common bits

anymore/ duplicate tree, insert above the node;

if

((old_node_bit < 0) ||

(((new->key ^ old->key) >> old_node_bit) >=

EB_NODE_BRANCHES))

……

2.

eb32_delete

remove leaf node and marks the node as

unlinked;

a)

获得父节点parent =

eb_root_to_node(eb_untag(node->leaf_p, pside));

b)

获得祖父节点gparent =

eb_untag(parent->node_p, gpside);

c)

祖父节点分支ç父节点分支节点gparent->b[gpside] =

parent->branches.b[!pside];

d)

父节点分支节点 node_p/leaf_p è祖父节点分支;

e)

Mark the parent

unused,if the parent is our own node,safely revmoce it;

parent->node_p = NULL;

if

(!node->node_p)

goto delete_unlink;

f)

New node:Leaf-parent instead node

parent->node_p = node->node_p;

parent->branches = node->branches;

parent->bit = node->bit;

g)

Update new node’s

parent

gpside = eb_gettag(parent->node_p);

gparent = eb_untag(parent->node_p, gpside);

gparent->b[gpside] = eb_dotag(&parent->branches,

EB_NODE);

h)

Update new node’s

branches

for (pside = 0; pside <= 1; pside++) {

if (eb_gettag(parent->branches.b[pside]) == EB_NODE)

{

eb_root_to_node(eb_untag(parent->branches.b[pside],

EB_NODE))->node_p =

eb_dotag(&parent->branches, pside);

} else {

eb_root_to_node(eb_untag(parent->branches.b[pside],

EB_LEAF))->leaf_p =

eb_dotag(&parent->branches, pside);

}

}

3.

eb32_lookup

遍历查找指定key的node;

a)

根据key(x)和node->bit 查找

troot =

node->node.branches.b[(x >> node_bit) &

EB_NODE_BRANCH_MASK];

b)

如果是leaf,则判断key是否相等;

if ((eb_gettag(troot) == EB_LEAF)) {

node = container_of(eb_untag(troot, EB_LEAF),

struct eb32_node, node.branches);

if (node->key == x)

return node;

else

return NULL;

}

c)

如果是node,则判断是否是dup-tree,或者

no more common

bits(return NULL);

4.

eb32_first

a)

遍历左子树,直到叶子节点;

b)

实现:eb_walk_down(root->b[0], EB_LEFT);

5.

eb32_last

a)

遍历右子树,直到叶子节点;

b)

实现:eb_walk_down(root->b[0], EB_RGHT);

6.

eb32_prev

a)

挂在左子树上,当前指针指向父节点,继续执行a;

b)

挂在右子树上,eb32_last遍历父节点左子树;

7.

eb32_next

a)

挂在右子树上,当前指针指向父节点,继续执行a;

b)

挂在左子树上,eb32_first遍历父节点右子树;

3

LVS-scsh调度算法(simple consistent souce address

hash)

通过上述分析,一致性hash实现比较复杂,对于LVS调度来说,显得较重;

为此,我们设计了一个简化的一致性调度算法,本算法可以保证rs down/up时的一致性,无法保证add new rs时的一致性;

本算法适用于CDN业务,因为CDN节点添加新rs频度几乎为0。

本算法其实现原理如下:

ip_vs_sh_init_svc() – 同sh,但分配total_rs个hash bucket;

ip_vs_sh_done_svc() – 同sh;

ip_vs_sh_update_svc – 同sh,但分配total_rs个hash bucket;同时,更新avalid_rs个数;

ip_vs_sh_schedule – 计算hash

key,key mod total_rs找到hash bucket对应的rs;如果rs不可用,则key mod avalid_rs,跳过weight=0的rs,找到对应的rs;

我们主要实现一次hash,二次取模;之所以采取二次取模,而不是取下一个可用rs,是为了分散热点;

其中:

total_rs=svc-> num_dests;

avalid_rs存储在svc->sched_data中;

注:二次hash算法有2个注意点

a)

vip必须配置inhibit-on-failure,以保证rs down时,只是修改weight=0,而不是重新map rs;

b)

新添rs时,会重新map rs,导致hash不一致;

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值