本篇文章记录一次MongoDB集群故障的解决过程
2016-3-14早上,刚开完早会,运营组的同事QQ发来消息,线上环境MongoDB集群的一个节点down了,让我上去看一下。当时没把他当回事,因为之前集群也出现过这样的现象,最简单的解决方法就是把mongod.conf配置文件中配置dbpath目录下的文件全部删除,然后重启mongod,想当然照做
鲁莽的第一次尝试
按照官方的说明,当replica set的节点落后于主节点很多数据或者一个节点新加入一个集群时,就应该利用mongodb的initial sync特性来同步集群的数据,并给出了两种解决方法:
- Restart the mongod with an empty data directory and let MongoDB’s normal initial syncing feature restore the data. This is the more simple option but may take longer to replace the data.See Procedures.
- Restart the machine with a copy of a recent data directory from another member in the replica set. This procedure can replace the data more quickly but requires more manual steps.See Sync by Copying Data Files from Another Member.
简单说,第一种方法简单粗暴,直接删除dbpath下的全部文件,当做是第一次加入replica set的节点,优点是操作简单,缺点是如果数据太多将是一个非常漫长的过程
查看日志后再次尝试
好景不长,很快这个节点又down了,我开始变得紧张,难道initial sync对我们的场景没用?我尝试在网上搜索:“mongodb replica set member always STARTUP2”,没用找到答案,网上求助无望之后我决定查看mongodb的日志,期望从这里可以找到一点有用的信息,当我打开日志的一瞬间,有点被惊到了:
2016-03-14T10:23:40.921+0800 [initandlisten] ERROR: Out of file descriptors. Waiting one second before trying to accept more connections.
2016-03-14T10:23:41.921+0800 [initandlisten] Listener: accept() returns -1 errno:24 Too many open files
2016-03-14T10:23:41.921+0800 [initandlisten] ERROR: Out of file descriptors. Waiting one second before trying to accept more connections.
2016-03-14T10:23:42.922+08