Cockroach Design 翻译 ( 十五) 节点分配(通过Gossip算法)

1  Node Allocation (via Gossip)节点分配(通过Gossip算法)

New nodesmust be allocated when a range is split. Instead of requiring every node toknow about the status of all or even a large number of peer nodes --or--alternatively requiring a specialized curator or master with sufficientlyglobal knowledge, we use a gossip protocol to efficiently communicate onlyinteresting information between all of the nodes in the cluster. What’sinteresting information? One example would be whether a particular node has alot of spare capacity. Each node, when gossiping, compares each topic of gossipto its own state. If its own state is somehow “moreinteresting” than the least interesting item in the topic it’s seen recently,it includes its own state as part of the next gossip session with a peer node.In this way, a node with capacity sufficiently in excess of the mean quicklybecomes discovered by the entire cluster. To avoid piling onto outliers, nodesfrom the high capacity set are selected at random for allocation.


The gossipprotocol itself contains two primary components:


Peer Selection: each node maintains upto N peers with which it regularly communicates. It selects peers with an eyetowards maximizing fanout. A peer node which itself communicates with an arrayof otherwise unknown nodes will be selected over one which communicates with aset containing significant overlap. Each time gossip is initiated, each nodes’set of peers is exchanged. Each node is then free to incorporate the other’speers as it sees fit. To avoid any node suffering from excess incomingrequests, a node may refuse to answer a gossip exchange. Each node is biasedtowards answering requests from nodes without significant overlap and refusingrequests otherwise.

Peers are efficiently selected using aheuristic as described in Agarwal & Trachtenberg (2006).

TBD: how to avoid partitions? Need to work out a simulationof the protocol to tune the behavior and see empirically how well it works.


使用Agarwal & Trachtenberg (2006)中描述的启发算法来高效选择对等节点。



Gossip Selection: what to communicate.Gossip is divided into topics. Load characteristics (capacity per disk, cpuload, and state [e.g. draining, ok, failure]) are used to drive nodeallocation. Range statistics (range read/write load, missing replicas,unavailable ranges) and network topology (inter-rack bandwidth/latency,inter-datacenter bandwidth/latency, subnet outages) are used for determiningwhen to split ranges, when to recover replicas vs. wait for networkconnectivity, and for debugging / sysops. In all cases, a set of minimums and aset of maximums is propagated; each node applies its own view of the world toaugment the values. Each minimum and maximum value is tagged with the reportingnode and other accompanying contextual information. Each topic of gossip hasits own protobuf to hold the structured data. The number of items of gossip ineach topic is limited by a configurable bound.

Gossip选择:需要通信什么?Gossip内容按主题划分。负载特性(每个磁盘的容量、CPU负载、状态[如:draining耗尽,OK成功,failure失败])被用于驱动节点如何分派。Range统计信息(Range读/写负载、丢失副本、无效的range)和网络拓扑(机架间带宽/延迟、数据中心间带宽/延迟、子网故障期)用于决定什么时间拆分range,什么时间恢复副本VS 等待网络连通、等待调试/等待管理员处理。在所有场景中,最小值集合和最大值集合都被传播;每个节点都提供它自己的全局视图来附加这些值。每个最小值和最大值被汇报节点标记,并附加上其他上下文信息。Gossip的每个主题都有其自己的协议缓存区来保存这些结构化数据。每个主题中gossip条目的数量由配置的边界限定。

Forefficiency, nodes assign each new item of gossip a sequence number and keeptrack of the highest sequence number each peer node has seen. Each round ofgossip communicates only the delta containing new items.






