root couse: 对MongoDB复制集的认识不足


机器环境:

192.168.12.6  master状态

192.168.12.4 secondary状态

192.168.12.5  secondary状态

192.168.2.1    dump节点 ,之前因为磁盘不足,mongodb进程已宕机,这个实例也配置有vote投票权!


过程:

1、DBA在 192.168.12.5 这个 secondary节点上,执行了关闭实例命令

2、集群剩余的2台主机:192.168.12.4(secondary) 、192.168.12.6(master)  ,都变成了secondary状态

3、业务反馈大量报错

4、DBA恢复 192.168.12.5 上面的mongodb进程,集群状态恢复



复盘:

下面的日志,是在 192.168.12.6 主节点上面看到的:

2019-04-16T15:47:14.196+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to 192.168.12.5:27017 - HostUnreachable: Connection reset by peer

2019-04-16T15:47:14.196+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to 192.168.12.5:27017 due to failed operation on a connection

2019-04-16T15:47:14.196+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to close stream: Transport endpoint is not connected

2019-04-16T15:47:14.196+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.12.5:27017; HostUnreachable: Connection reset by peer

2019-04-16T15:47:14.196+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to 192.168.12.5:27017

2019-04-16T15:47:14.196+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to 192.168.12.5:27017 - HostUnreachable: Connection refused 

2019-04-16T15:47:14.196+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to 192.168.12.5:27017 due to failed operation on a connection

2019-04-16T15:47:14.196+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.12.5:27017; HostUnreachable: Connection refused

2019-04-16T15:47:14.197+0800 I REPL     [ReplicationExecutor] can't see a majority of the set, relinquishing primary

2019-04-16T15:47:14.197+0800 I REPL     [ReplicationExecutor] Stepping down from primary in response to heartbeat

2019-04-16T15:47:14.198+0800 I REPL     [replExecDBWorker-0] transition to SECONDARY

2019-04-16T15:47:14.274+0800 I NETWORK  [conn476944080] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [192.168.3.11:38712]


集群的配置如下:

set01:SECONDARY> rs.conf()

{

"_id" : "set01",

"version" : 130099,

"members" : [

{

"_id" : 6,

"host" : "192.168.2.1:27017",

"arbiterOnly" : false,

"buildIndexes" : true,

"hidden" : true,

"priority" : 0,

"tags" : {

"dc" : "IDC1",

"role" : "dump"

},

"slaveDelay" : NumberLong(0),

"votes" : 1

},

{

"_id" : 7,

"host" : "192.168.12.4:27017",

"arbiterOnly" : false,

"buildIndexes" : true,

"hidden" : false,

"priority" : 1,

"tags" : {

"dc" : "IDC1"

},

"slaveDelay" : NumberLong(0),

"votes" : 1

},

{

"_id" : 8,

"host" : "192.168.12.5:27017",

"arbiterOnly" : false,

"buildIndexes" : true,

"hidden" : false,

"priority" : 1,

"tags" : {

"dc" : "IDC1"

},

"slaveDelay" : NumberLong(0),

"votes" : 1

},

{

"_id" : 9,

"host" : "192.168.12.6:27017",

"arbiterOnly" : false,

"buildIndexes" : true,

"hidden" : false,

"priority" : 1,

"tags" : {

"dc" : "IDC1"

},

"slaveDelay" : NumberLong(0),

"votes" : 1

}

],

"settings" : {

"chainingAllowed" : true,

"heartbeatIntervalMillis" : 2000,

"heartbeatTimeoutSecs" : 10,

"electionTimeoutMillis" : 10000,

"getLastErrorModes" : {

},

"getLastErrorDefaults" : {

"w" : 1,

"wtimeout" : 0

}

}

}


根据上面的内容,可以判断出 192.168.2.1宕机后,我们再次关闭一台192.168.12.5后,集群就剩2个投票,少于一半节点,整个集群无法选举出Primary,集群退化为只读状态【表现为rs.status()看到的都是secondary角色】 因此,通常建议将复制集成员数量设置为奇数。


解决措施:

    将dump节点的投票属性去掉。



参考: 

http://www.ttlsa.com/mongodb/mongodb-replicaset-internal/

https://blog.csdn.net/qq_24598601/article/details/81150614