Cassandra 源码解析 4: GMS 集群管理


  • 节点的添加。通知大家,I join the group. 引起部分hash空间的重新分布,需要做数据传输(bootstrap);什么时候,新的节点开始响应request?所有group memeber视图一致时。部分节点更新了member视图,部分节点没有更新,如果这时读写数据会有什么结果?
  • 节点的删除(宕机)。原则上数据会有N个备份,一台宕机,则会要找寻下一台存放备份
  • 节点重启. 不能因为重启而导致rebalancing of the partition
  • 节点之间的heartbeat:检测节点的状态
  • 节点之间数据的一致性:拥有相同数据备份的节点,怎样保证数据的最终一致性
  • 节点视图的一致性:怎样保证节点拥有相同的member视图,比如member join 或者 leave,怎样用最少的网络代价来通知到所有节点.




  • 每个node起来时,发送一个multicast组播或者一个UDP广播. 已有node接收并更新自己的member视图。
  • 已有node中address最小的一个(master node)发送member list给新的节点
  • 每个节点对所有其他节点维持heartbeat,如果有节点死亡,则从list移走


Gossip Protocols

cassandra中使用的gossip protocol

A gossip-based protocol propagates membership changes and maintains an eventually consistent view of membership. Each node contacts a peer chosen at random every second and two nodes efficiently reconcile their persisted membership change histories.

illinois大学的一名学生在Adavnced Operation Systems中做了一个PPT

淘宝网的开发人员若海的介绍: Gossip简介

源头是一篇引用非常高的论文:Epidemic algorithms for replicated database maintenance


理论支撑:Epidemic model

Single infected site eventually infects entire population of susceptible sites
In database replication, infected site is the one with the latest update, susceptible sites are those needing the update

Gossip的一个要点就是,每个node只需要定时和集群中某个node(每次随机)同步一次member视图,就能保证集群中所有节点的member视图一致. 按照Epidemic(伊波拉)理论,有一个节点被感染(新节点同该节点交互同步),最终所有的节点都会被感染.


在谈论Gossip时,另外一个重要名词就是Anti-Entropy,在 dbthink翻译的一篇 cassandra文章中,anti-entropy和gossip分别讨论(cassandra的 gossipanti-entropy实现如此)。事实上论文介绍中,anti-entropy是gossip的一种实现形式.
名词解析: 在信息论中,熵是衡量信息量多少的量化指标,或者说是对某个随机变量的不确定性的衡量,变量的不确定性越大,熵也就越大,把它搞清楚所需要的信息量也就越大( 百度百科). 我理解的逆熵就是将不确定性变为确定性的过程。
every site regularly choose another site at random and by exchaning database contetns with it resolves any differences between the two
forsome (site s in Sites) //注意这里是some不是all,不用和所有的节点都进行同步,这是gossip核心所在
resolveDifferences(localDB, s);


if (localItem.timeStamp < i.timeStamp)
localItem.value = i.value; //pull,将更新从其他节点拉倒本地 - 更新本地


if (localItem.timeStamp > i.timeStamp)
i.value = localItem.value; //push


if (localItem.timeStamp < i.timeStamp)
localItem.value = i.value; //pull
else if (localItem.timeStamp > i.timeStamp)
i.value = localItem.value; //push

表面上看,pull和push最终都能将更新从一个节点传播至所有节点,但二者从概率分析上传播速度有所不同,pull比push收敛更快(即更新更快的达到所有节点,详细分析见Epidemic algorithms论文)。假如是log(n)回合,则一个更新传播出去所需要的通信次数是log(n) * n,每个回合每个节点都通信一次。

Cassandra 实现




  1. EndPointState
  2. - updateTimestamp
  3. - lisAlive
  4. - isAGossiper
  5. - hasToken
  6. - HeartBeatState
  7. - generation
  8. - version
  9. - ApplicatonState
  10. - version MOVE_STATE
  11. NORMA,Token(Serial) //initServer
  12. BOOT,Token(Serial)//startBootstrap
  13. NORMAL,Token(Serial) //finishBootStrapping
  14. LEAVING,Token //startLeaving
  15. LEFT,left,Token //leaveRing
  16. LEFT,remove,Token //removeToken
  17. - version LOAD-INFORMATION
  18. diskUsage

EndPointState - updateTimestamp - lisAlive - isAGossiper - hasToken - HeartBeatState - generation - version - ApplicatonState - version MOVE_STATE NORMA,Token(Serial) //initServer BOOT,Token(Serial)//startBootstrap NORMAL,Token(Serial) //finishBootStrapping LEAVING,Token //startLeaving LEFT,left,Token //leaveRing LEFT,remove,Token //removeToken - version LOAD-INFORMATION diskUsage



generation: 系统启动时,赋为当前时间(in seconds)(StorageService.initServer -> Gossiper.start)

version: 每次应用状态变化时,增1;每次heartbeat消息时,增1

ApplicationState 目前包含两个,一个是系统当下是处在normal还是boot当前(如果系统处在boot当中,不响应读写请求,可以参看TokenMetadata对所有Token的维护和使用;其他的left, leaving状态一般不会使用,比如强制某个节点退出ring时调用),另外一个是系统当下的负载信息(磁盘占用大小),load-balancer负载平衡用(另作分析)。同样,每个State有一个version。


generation + version(max) 构成同一节点两个状态的排序依据,version是heartbeat,MOVE_STATE和LOAD-INFORMATION中较大的一个version 。这两个字段可生成一个GossipDigest对象。在节点之间状态同步时,并不是将所有状态信息全部发送给对方比较,而是将每个EndPointState变为GossipDigest,节省传输数据量(?)。

member status syns 1 :(A -> B GossipDigestSynMessage)

如前介绍的Anti-Entropy,随机挑选一个(或者的)节点,将自身所有的GossipDigest(每个已知节点对应一个GossipDigest)发给对方。另外还根据一定的概率,随机挑选一个unreachable的endpoint,向其发送同步信息(如果响应,则live again)。最后,也随机挑选一个种子节点同步一下。GossipDigest被封装在GossipDigestSynMessage


  1. //Gossiper.GossipTimerTask
  2. Message message = makeGossipDigestSynMessage(gDigests);
  3. /* Gossip to some random live member */
  4. boolean gossipedToSeed = doGossipToLiveMember(message);
  5. doGossipToUnreachableMember(message);
  6. if (!gossipedToSeed || liveEndpoints_.size() < seeds_.size())
  7. doGossipToSeed(message);
  8. doStatusCheck();

member status syns 2 : (B -> A GossipDigestAckMessage)


If the max remote version is greater then we request the remote endpoint send us all the data for this endpoint with version greater than the max version number we have locally for this endpoint. If the max remote version is lesser, then we send all the data we have locally for this endpoint with version greater than the max remote version.

假如本地包含一个endpoint的version为(1,2,10), 10为heartbeat的version,在不停的增加. 另外一个remote point包含此endpoint的version为(1,2,20)。这时要求remote endpoint 发送version > 10的State,remote endpoint仅仅发送HeartbeatState,因为仅有heartbeat的version > 10。这样本地将endpoint的version更新为(1,2,20)


  1. //Gossiper.examineGossiper
  2. synchronized void examineGossiper(List<GossipDigest> gDigestList, List<GossipDigest> deltaGossipDigestList, Map<InetAddress, EndPointState> deltaEpStateMap)
  3. {
  4. for ( GossipDigest gDigest : gDigestList )
  5. {
  6. int remoteGeneration = gDigest.getGeneration();
  7. int maxRemoteVersion = gDigest.getMaxVersion();
  8. /* Get state associated with the end point in digest */
  9. EndPointState epStatePtr = endPointStateMap_.get(gDigest.getEndPoint());
  10. /*
  11. Here we need to fire a GossipDigestAckMessage. If we have some data associated with this endpoint locally
  12. then we follow the "if" path of the logic. If we have absolutely nothing for this endpoint we need to
  13. request all the data for this endpoint.
  14. */
  15. if ( epStatePtr != null )
  16. {
  17. int localGeneration = epStatePtr.getHeartBeatState().getGeneration();
  18. /* get the max version of all keys in the state associated with this endpoint */
  19. int maxLocalVersion = getMaxEndPointStateVersion(epStatePtr);
  20. if ( remoteGeneration == localGeneration && maxRemoteVersion == maxLocalVersion )
  21. continue;
  22. if ( remoteGeneration > localGeneration )
  23. {
  24. /* we request everything from the gossiper */
  25. requestAll(gDigest, deltaGossipDigestList, remoteGeneration);
  26. }
  27. if ( remoteGeneration < localGeneration )
  28. {
  29. /* send all data with generation = localgeneration and version > 0 */
  30. sendAll(gDigest, deltaEpStateMap, 0);
  31. }
  32. if ( remoteGeneration == localGeneration )
  33. {
  34. /*
  35. If the max remote version is greater then we request the remote endpoint send us all the data
  36. for this endpoint with version greater than the max version number we have locally for this
  37. endpoint.
  38. If the max remote version is lesser, then we send all the data we have locally for this endpoint
  39. with version greater than the max remote version.
  40. */
  41. if ( maxRemoteVersion > maxLocalVersion )
  42. {
  43. deltaGossipDigestList.add( new GossipDigest(gDigest.getEndPoint(), remoteGeneration, maxLocalVersion) );
  44. }
  45. if ( maxRemoteVersion < maxLocalVersion )
  46. {
  47. /* send all data with generation = localgeneration and version > maxRemoteVersion */
  48. sendAll(gDigest, deltaEpStateMap, maxRemoteVersion);
  49. }
  50. }
  51. }
  52. else
  53. {
  54. /* We are here since we have no data for this endpoint locally so request everything. */
  55. requestAll(gDigest, deltaGossipDigestList, remoteGeneration);
  56. }
  57. }
  58. }

member status syns 3 : (A -> B GossipDigestAck2Message)


member status syns 4




  1. //Gossiper.applyApplicationStateLocally
  2. markAlive(ep, localEpStatePtr); //live
  3. applyHeartBeatStateLocally(ep, localEpStatePtr, remoteState);/* apply ApplicationState */
  4. applyApplicationStateLocally(ep, localEpStatePtr, remoteState);
  5. handleNewJoin(ep, remoteState);

其基本原理是如果now - last_heart_time 远远大于(根据代码中的公式,为18倍左右)以往两次heartbeat之间的时间间隔的平均值,则宣布该节点dead,通知IFailureDetectionEventListener(Gossiper.convict)


  1. //FailureDetector.ArrivalWindow
  2. synchronized void add(double value)
  3. {
  4. double interArrivalTime;
  5. if ( tLast_ > 0L )
  6. {
  7. interArrivalTime = (value - tLast_);
  8. }
  9. else
  10. {
  11. interArrivalTime = Gossiper.intervalInMillis_ / 2;
  12. }
  13. tLast_ = value;
  14. arrivalIntervals_.add(interArrivalTime);
  15. }
  16. double p(double t)
  17. {
  18. double mean = mean();
  19. double exponent = (-1)*(t)/mean;
  20. return 1 - ( 1 - Math.pow(Math.E, exponent) );
  21. }
  22. double phi(long tnow)
  23. {
  24. int size = arrivalIntervals_.size();
  25. double log = 0d;
  26. if ( size > 0 )
  27. {
  28. double t = tnow - tLast_;
  29. double probability = p(t);
  30. log = (-1) * Math.log10( probability );
  31. }
  32. return log;
  33. }
  34. //FailureDetector.interpret
  35. if ( phi > phiConvictThreshold_ )
  36. {
  37. for ( IFailureDetectionEventListener listener : fdEvntListeners_ )
  38. {
  39. listener.convict(ep);
  40. }
  41. }

