MongoDB一次节点宕机引发的思考(源码剖析)

声明:本文同步发表于 MongoDB 中文社区,传送门:
http://www.mongoing.com/archives/26759

简介

最近一个 MongoDB 集群环境中的某节点异常下电了,导致业务出现了中断,随即又恢复了正常。
通过ELK 告警也监测到了业务报错日志。

运维部对于节点下电的原因进行了排查,发现仅仅是资源分配上的一个失误导致。 在解决了问题之后,大家也对这次中断的也提出了一些问题:

"当前的 MongoDB集群 采用了分片副本集的架构,其中主节点发生故障会产生多大的影响?"
"MongoDB 副本集不是能自动倒换吗,这个是不是秒级的?"

带着这些问题,下面针对副本集的自动Failover机制做一些分析。

日志分析

首先可以确认的是,这次掉电的是一个副本集上的主节点,在掉电的时候,主备关系发生了切换。
从另外的两个备节点找到了对应的日志:

备节点1的日志

2019-05-06T16:51:11.766+0800 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
2019-05-06T16:51:11.766+0800 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
2019-05-06T16:51:11.766+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to 172.30.129.78:30071
2019-05-06T16:51:11.767+0800 I REPL     [ReplicationExecutor] VoteRequester(term 3 dry run) received a yes vote from 172.30.129.7:30071; response message: { term: 3, voteGranted: true, reason: "", ok: 1.0 }
2019-05-06T16:51:11.767+0800 I REPL     [ReplicationExecutor] dry election run succeeded, running for election
2019-05-06T16:51:11.768+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to 172.30.129.78:30071
2019-05-06T16:51:11.771+0800 I REPL     [ReplicationExecutor] VoteRequester(term 4) received a yes vote from 172.30.129.7:30071; response message: { term: 4, voteGranted: true, reason: "", ok: 1.0 }
2019-05-06T16:51:11.771+0800 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 4
2019-05-06T16:51:11.771+0800 I REPL     [ReplicationExecutor] transition to PRIMARY
2019-05-06T16:51:11.771+0800 I REPL     [ReplicationExecutor] Entering primary catch-up mode.
2019-05-06T16:51:11.771+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Ending connection to host 172.30.129.78:30071 due to bad connection status; 2 connections to that host remain open
2019-05-06T16:51:11.771+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to 172.30.129.78:30071
2019-05-06T16:51:13.350+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 172.30.129.78:30071; ExceededTimeLimit: Couldn't get a connection within the time limit

备节点2的日志

2019-05-06T16:51:12.816+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Ending connection to host 172.30.129.78:30071 due to bad connection status; 0 connections to that host remain open
2019-05-06T16:51:12.816+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 172.30.129.78:30071; ExceededTimeLimit: Operation timed out, request was RemoteCommand 72553 -- target:172.30.129.78:30071 db:admin expDate:2019-05
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值