在什么样的场景会出现脑裂的情况呢,打比方说现在有2个副本,个运行在Node1和node2上,我们都知道写完副本后要通知所有其他的副本我已经写完了,其他的副本记录下,哦,你写完了,你是OK的,详见《GlusterFS数据恢复机制AFR》。如果这时Node1和Node2之间的网断了,Node1就无法通知node2,那么node2这边就觉得,我写完了,node1没给我通信,Node1有问题;相应地node2也没法给node1报告说我写完了,node1也同样认为我写完了而对方有问题,而他们各自各客户端连接正常的,都会返回正确的信息给客户端,这样读写看似是正常进行的,但这个文件再次被访问到的时候,Node1和node2一查看自己的changelog,都发现自己正常对方OK,都尝试用自己的内容去恢复对方的内容,这就出现脑裂了。脑裂的文件访问会出现问题,一般是input/output error这样的问题。
#gluster volume set <volname> cluster.server-quorum-type none/server
#gluster volume set all cluster.server-quorum-ratio <percentage%>
cluster.server-quorum-type- none/server ,好像还有一种auto
cluster.server-quorum-ratio-大于50%. 对应quorum机制关于active的peer数的规定,默认是50%,否则就是设定的值
3、 确认cluster.server-quorum-type默认是none
4、确认是否通过all-volume set/reset 可以设置cluster.server-quorum-ratio,它是唯一一个可以作用于所有卷的设置项
17、用各种命令重置卷, /var/lib/glusterd/options 文件会更新
21、即使quorum数没有达到,卷的 set/reset操作也是可以正常工作的
1) If the quorum options are not enabled, There should be no change in the glusterd functionality.
2) Check if the volume set functionality for the following options is working fine.
cluster.server-quorum-type - none/server
cluster.server-quorum-ratio - this is % > 50. If the volume is not set with any ratio the equation for quorum is:
active_peer_count > 50% of all peers in cluster. But when the percentage (P)
is specified the equation for quorum is active_peer_count >= P % of all the
befriended peers in cluster.
3) Check if by default cluster.server-quorum-type is none for a volume.
4) check if all-volume set/reset is working for cluster.server-quorum-ratio
is working. This option is the only option allowed as an option for
5) When quorum is disabled keep triggering network disconnections between
peers and observe that the bricks are not going down or coming backup.
6) When quorum is enabled keep triggering network disconnections between
peers and observe that the bricks are going down or coming backup.
7) When quorum is disabled keep bringing down just the glusterd processes
and check that the bricks are not affected by this.
8) When quorum is enabled keep bringing down just the glusterds the
bricks will go down after quorum is not met.
NOTE: glusterd not running and network connection between two machines
is down are treated equally.
9) Check that when the quorum is not met for any of the volume, the volume
updates on the machine which does not meet quorum the updates are not allowed.
10) Check that peer probe/deprobe are not allowed on the machine where
the quorum is not met.
11) Check that the bricks for a volume are not up until the quorum is met
when the machine is rebooted if quorum on the volume is enabled.
12) bricks should comeup as soon as glusterd comes up when quorum is disabled.
13) Check glusterd volume/peer operations when the quorum status of peers
is in the process of initializing.
14) kill glusterd on one of the machines(lets call this M1) in cluster and
keep killing glusterds on the machines until the quorum on M1 would be lost.
Bring back the glusterd on M1. Bricks on M1 should not be running once the
glusterd comes backup.
15) kill glusterd and bring it back up bricks on the machine should not see
any brick re-starts if the quorum is not enabled.
16) Check that peer detach removes the peer when force option is given even
when quorum is not met.
17) Check the new store file /var/lib/glusterd/options is updated with the
volume set/reset all commands.
18) Peer probe/deprobe should reflect all volume options.
19) Check storing and restoring of all volume options.
20) volume status should work fine even when quorum is not met.
21) volume set/reset of quorum options should work fine even when the quorum
is not met. This is to get the system out of locked-in state of quorum in
desperate circumstances.
NOTE: Please note that the % above is a floating point percentage.