Cockroach Design 翻译 ( 十五) 节点分配(通过Gossip算法)

1  Node Allocation (via Gossip)节点分配(通过Gossip算法)

New nodesmust be allocated when a range is split. Instead of requiring every node toknow about the status of all or even a large number of peer nodes --or--alternatively requiring a specialized curator or master with sufficientlyglobal knowledge, we use a gossip protocol to efficiently communicate onlyinteresting information between all of the nodes in the cluster. What’sinteresting information? One example would be whether a particular node has alot of spare capacity. Each node, when gossiping, compares each topic of gossipto its own state. If its own state is somehow “moreinteresting” than the least interesting item in the topic it’s seen recently,it includes its own state as part of the next gossip session with a peer node.In this way, a node with capacity sufficiently in excess of the mean quicklybecomes discovered by the entire cluster. To avoid piling onto outliers, nodesfrom the high capacity set are selected at random for allocation.

range拆分时,必须分派新节点。作为查询每个节点来获取所有节点或者大量对等节点的状态的替代方法,或者作为向一个专职管理者或主节点查询充足的全局信息的替代方法,我们使用gossip协议来高效地通信,仅交流集群中所有节点间感兴趣的信息。那什么是感兴趣的信息呢?举个例子,一个指定节点是否有许多空闲的能力。每个节点,当互相gossip通信时,都与自己比较gossip的每个主题的状态。如果它自己的状态不知为什么比主题中的最近有兴趣的项“更有兴趣”,那么在下一个与对等节点的gossip会话中就会包含它自己的状态。通过此方法,整个集群就可以超出平均速度迅速地发现具有充裕能力的节点。为了避免负载都堆积到具有极端值的节点上,我们从高能力节点集合中随机选择节点来分配负载。

The gossipprotocol itself contains two primary components:

Gossip协议本身包含两个主要组件:

Peer Selection: each node maintains upto N peers with which it regularly communicates. It selects peers with an eyetowards maximizing fanout. A peer node which itself communicates with an arrayof otherwise unknown nodes will be selected over one which communicates with aset containing significant overlap. Each time gossip is initiated, each nodes’set of peers is exchanged. Each node is then free to incorporate the other’speers as it sees fit. To avoid any node suffering from excess incomingrequests, a node may refuse to answer a gossip exchange. Each node is biasedtowards answering requests from nodes without significant overlap and refusingrequests otherwise.

Peers are efficiently selected using aheuristic as described in Agarwal & Trachtenberg (2006).

TBD: how to avoid partitions? Need to work out a simulationof the protocol to tune the behavior and see empirically how well it works.

对等选择:每个节点维护N个与其定期通信的对等节点。它选择面向最大化扇出的对等节点。一个自身与一系列其它未知节点通信的对等节点,选择了一个与包含显著重叠的集合通信的节点。每次gossip被初始化时,每个节点的对等集合被交换。然后每个节点自由加入它看起来合适的其他节点的对等节点。为了避免任何一个节点遭受过量的请求,一个节点可以拒绝应答一个gossip交换。每个节点倾向于应答来自没有显著重叠节点的请求,否则拒绝请求。

使用Agarwal & Trachtenberg (2006)中描述的启发算法来高效选择对等节点。

待定:怎么避免分区?需要做协议仿真来优化行为,并从经验上观察怎样工作起来更好。

 

Gossip Selection: what to communicate.Gossip is divided into topics. Load characteristics (capacity per disk, cpuload, and state [e.g. draining, ok, failure]) are used to drive nodeallocation. Range statistics (range read/write load, missing replicas,unavailable ranges) and network topology (inter-rack bandwidth/latency,inter-datacenter bandwidth/latency, subnet outages) are used for determiningwhen to split ranges, when to recover replicas vs. wait for networkconnectivity, and for debugging / sysops. In all cases, a set of minimums and aset of maximums is propagated; each node applies its own view of the world toaugment the values. Each minimum and maximum value is tagged with the reportingnode and other accompanying contextual information. Each topic of gossip hasits own protobuf to hold the structured data. The number of items of gossip ineach topic is limited by a configurable bound.

Gossip选择:需要通信什么?Gossip内容按主题划分。负载特性(每个磁盘的容量、CPU负载、状态[如:draining耗尽,OK成功,failure失败])被用于驱动节点如何分派。Range统计信息(Range读/写负载、丢失副本、无效的range)和网络拓扑(机架间带宽/延迟、数据中心间带宽/延迟、子网故障期)用于决定什么时间拆分range,什么时间恢复副本VS 等待网络连通、等待调试/等待管理员处理。在所有场景中,最小值集合和最大值集合都被传播;每个节点都提供它自己的全局视图来附加这些值。每个最小值和最大值被汇报节点标记,并附加上其他上下文信息。Gossip的每个主题都有其自己的协议缓存区来保存这些结构化数据。每个主题中gossip条目的数量由配置的边界限定。

Forefficiency, nodes assign each new item of gossip a sequence number and keeptrack of the highest sequence number each peer node has seen. Each round ofgossip communicates only the delta containing new items.

为了高效,节点分配给每个gossip新条目一个序列号并追踪每个对等节点看到的最大序列号。每轮gossip通信仅同步包含新条目的增量。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值