MongoDB 分片集群故障RECOVERING 处理纪实

原创 2017年03月12日 18:28:59

 

1、问题描述,备库故障RECOVERING

运营同事说查询mongodb备库数据,没有最新的记录,估计是复制延时了,或者是故障了,赶紧上去查看状态rs.status(),看到备库处于RECOVERING状态

shard1:RECOVERING> rs.status();

{

        "set" : "shard1",

        "date" : ISODate("2017-03-03T03:08:50.882Z"),

        "myState" : 3,

        "members" : [

                {

                        "_id" : 0,

                        "name" : "192.168.3.11:27017",

                        "health" : 1,

                        "state" : 1,

                        "stateStr" : "PRIMARY",

                        "uptime" : 69310,

                        "optime" : Timestamp(1488510526, 3),

                        "optimeDate" : ISODate("2017-03-03T03:08:46Z"),

                        "lastHeartbeat" : ISODate("2017-03-03T03:08:50.416Z"),

                        "lastHeartbeatRecv" : ISODate("2017-03-03T03:08:49.706Z"),

                        "pingMs" : 0,

                        "electionTime" : Timestamp(1479454146, 1),

                        "electionDate" : ISODate("2016-11-18T07:29:06Z"),

                        "configVersion" : 1

                },

                {

                        "_id" : 1,

                        "name" : "192.168.3.12:27017",

                        "health" : 1,

                        "state" : 3,

                        "stateStr" : "RECOVERING",

                        "uptime" : 69311,

                        "optime" : Timestamp(1471072341, 1),

                        "optimeDate" : ISODate("2016-08-13T07:12:21Z"),

                        "configVersion" : 1,

                        "self" : true

                },

                {

                        "_id" : 2,

                        "name" : "192.168.3.11:27037",

                        "health" : 1,

                        "state" : 7,

                        "stateStr" : "ARBITER",

                        "uptime" : 69310,

                        "lastHeartbeat" : ISODate("2017-03-03T03:08:50.412Z"),

                        "lastHeartbeatRecv" : ISODate("2017-03-03T03:08:50.322Z"),

                        "pingMs" : 0,

                        "configVersion" : 1

                }

        ],

        "ok" : 1

}

shard1:RECOVERING>

 

 

 

 

2、从后台error日志分析replSet errorRS102

查看下后台日志路径:

[mongodb@mongodb_m2 ~]$ ps -eaf|grep 27017

mongodb  24630     1  0 Mar02 ?        00:03:41 /usr/local/mongodb-linux-x86_64-3.0.3/bin/mongod --shardsvr --replSet shard1 --port 27017 --dbpath /data/mongodb/shard27017 --oplogSize 2048 --logpath /data/mongodb/logs/shard_m1s1_27017.log --logappend --fork

mongodb  39309 30937  0 10:35 pts/0    00:00:00 grep 27017

[mongodb@mongodb_m2 ~]$

 

查看后台error日志显示没,找到错误信息

more /data/mongodb/logs/shard_m1s1_27017.log

2017-03-03T09:44:59.070+0800 I REPL     [ReplicationExecutor] syncing from: 192.168.3.11:27017

2017-03-03T09:44:59.071+0800 W REPL     [rsBackgroundSync] we are too stale to use 192.168.3.11:27017 as a sync source

2017-03-03T09:44:59.071+0800 I REPL     [ReplicationExecutor] could not find member to sync from

2017-03-03T09:44:59.071+0800 I REPL     [rsBackgroundSync] replSet error RS102 too stale to catch up

2017-03-03T09:44:59.071+0800 I REPL     [rsBackgroundSync] replSet our last optime : Aug 13 15:12:21 57aec855:1

2017-03-03T09:44:59.071+0800 I REPL     [rsBackgroundSync] replSet oldest available is Feb  7 14:13:10 58996576:1

2017-03-03T09:44:59.071+0800 I REPL     [rsBackgroundSync] replSet See http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember

2017-03-03T09:45:18.914+0800 I NETWORK  [conn6420] end connection 192.168.3.11:5804 (3 connections now open)

2017-03-03T09:45:18.915+0800 I NETWORK  [initandlisten] connection accepted from 192.168.3.11:5824 #6423 (4 connections now open)

2017-03-03T09:45:20.195+0800 I NETWORK  [conn6421] end connection 192.168.3.11:5806 (3 connections now open)

2017-03-03T09:45:20.196+0800 I NETWORK  [initandlisten] connection accepted from 192.168.3.11:5829 #6424 (4 connections now open)

 

 

看记录“replSet oldest available isFeb  7 14:13:10 58996576:1”得知这个副本集合里面最新的记录是27日同步过来,从那之后,sync就停止了,所以我们需要再次人工手动进行同步sync复制,表面现象是这样的,具体详细的复制信息,我们还要再去命令窗口查看。

 

3、主库备库查看复制集信息

去备库secondary查看复制集信息

shard1:RECOVERING>  db.printReplicationInfo();

configured oplog size:   2048.003890991211MB

log length start to end: 11028041secs (3063.34hrs)

oplog first event time:  Thu Apr 07 2016 23:51:40 GMT+0800 (CST)

oplog last event time:   Sat Aug 13 2016 15:12:21 GMT+0800 (CST)

now:                     Fri Mar 03 2017 10:37:25 GMT+0800 (CST)                    

shard1:RECOVERING>

 

可以看到维护窗口为3063.34小时,oplog日志大小为2g,oplog开始时间2016年4月7日,openlog结束日期为2016年8月13日。表示这台备库已经断档很久很久了。

 

再看primary主库的复制信息:

shard1:PRIMARY>  db.printReplicationInfo();

configured oplog size:   2048.003890991211MB

log length start to end: 2059878secs (572.19hrs)

oplog first event time:  Tue Feb 07 2017 14:31:13 GMT+0800 (CST)

oplog last event time:   Fri Mar 03 2017 10:42:31 GMT+0800 (CST)

now:                     Fri Mar 03 2017 10:42:32 GMT+0800 (CST)                    

shard1:PRIMARY>

 

可以看出,主库的服务起始时间oplog记录是在2017年2月7日,最后是在2017年3月3日。而看上面记录备库sencondary的最后openlog记录也是在2017年2月7日,这个时间比较吻合,也就是主库服务重启后,备库接收到了sync复制信息,但是因为断档时间是2016年8月13日这个时间太久了,导致sync失败。所以我们需要再次人工同步。

 

 

4、人工同步secondary备库

 

看error日志里面提供的sync的资料 2017-03-03T09:44:59.071+0800 I REPL     [rsBackgroundSync] replSet Seehttp://dochub.mongodb.org/core/resyncingaverystalereplicasetmember,发现有如下几种办法同步

1Automatically Sync a Member 自动同步

         WARNING

         Duringinitial sync, mongod will remove the content of the dbPath.

步骤

You can also force a mongod that is alreadya member of the set to perform an initial sync by restarting the instancewithout the content of the dbPath as follows:

         Stopthe member’s mongod instance. To ensure a clean shutdown, use thedb.shutdownServer() method from the mongo shell or on Linux systems, the mongod--shutdown option.

         Deleteall data and sub-directories from the member’s data directory. By removing thedata dbPath, MongoDB will perform a complete resync. Consider making a backupfirst.      

        

2Sync by Copying Data Files from Another Member,从另外一个成员拷贝数据文件

 

This approach “seeds” a new or stale memberusing the data files from an existing member of the replica set. The data filesmust be sufficiently recent to allow the new member to catch up with the oplog.Otherwise the member would need to perform an initial sync.

(2.1)Copy the Data Files,         停止备库,然后从seed服务器(也就是primary库)copy数据文件,在copy的时候,注意要把local库也复制过来,复制不能采用mongodump,仅仅只允许使用快照备份数据文件( only a snapshot backup),

(2.2)Sync the Member,启动mongodb实例服务,然后开始应用oplog日志

 

5、开始恢复secondary备库

分析了上面的2种方式,第一种方式,清空数据目录重启mongodb实例让mongodb初始化同步数据,操作简单,但是恢复时间比较长,需要花费更多时间替换数据,第二种方式从副本集合的另外一个成员拷贝数据目录后重启mongodb实例,这个恢复过程速度快但是需要比较多的手工操作步骤。

 

这里综合考虑,简单方便,所以采用第一种方案恢复.

 

1)先关闭mongodb server

shard1:RECOVERING> db.shutdownServer();

2017-03-03T11:10:34.536+0800 I NETWORK  DBClientCursor::init call() failed

server should be down...

2017-03-03T11:10:34.539+0800 I NETWORK  trying reconnect to localhost:27017 (127.0.0.1) failed

2017-03-03T11:10:34.539+0800 W NETWORK  Failed to connect to 127.0.0.1:27017, reason: errno:111 Connection refused

2017-03-03T11:10:34.539+0800 I NETWORK  reconnect localhost:27017 (127.0.0.1) failed failed couldn't connect to server localhost:27017 (127.0.0.1), connection attempt failed

2017-03-03T11:10:34.543+0800 I NETWORK  trying reconnect to localhost:27017 (127.0.0.1) failed

2017-03-03T11:10:34.543+0800 W NETWORK  Failed to connect to 127.0.0.1:27017, reason: errno:111 Connection refused

2017-03-03T11:10:34.543+0800 I NETWORK  reconnect localhost:27017 (127.0.0.1) failed failed couldn't connect to server localhost:27017 (127.0.0.1), connection attempt failed

 

 

2)然后移除旧目录,再启动mongodb实例

[mongodb@mongodb_m2 shard27017]$ mv /data/mongodb/shard27017 /data/mongodb/shard27017_bak

[mongodb@mongodb_m2 shard27017]$ mkdir /data/mongodb/shard27017

[mongodb@mongodb_m2 shard27017]$ /usr/local/mongodb-linux-x86_64-3.0.3/bin/mongod --shardsvr --replSet shard1 --port 27017 --dbpath /data/mongodb/shard27017 --oplogSize 2048 --logpath /data/mongodb/logs/shard_m1s1_27017.log --logappend --fork

about to fork child process, waiting until server is ready for connections.

forked process: 44687

child process started successfully, parent exiting

[mongodb@mongodb_m2 shard27017]$

 

 

 

3)查看恢复状态,为STARTUP2,会看到数据目录文件在不停的复制中

shard1:STARTUP2> rs.status();

{

        "set" : "shard1",

        "date" : ISODate("2017-03-03T03:19:43.367Z"),

        "myState" : 5,

        "syncingTo" : "192.168.3.11:27017",

        "members" : [

                {

                        "_id" : 0,

                        "name" : "192.168.3.11:27017",

                        "health" : 1,

                        "state" : 1,

                        "stateStr" : "PRIMARY",

                        "uptime" : 85,

                        "optime" : Timestamp(1488511178, 8),

                        "optimeDate" : ISODate("2017-03-03T03:19:38Z"),

                        "lastHeartbeat" : ISODate("2017-03-03T03:19:41.796Z"),

                        "lastHeartbeatRecv" : ISODate("2017-03-03T03:19:41.796Z"),

                        "pingMs" : 0,

                        "electionTime" : Timestamp(1479454146, 1),

                        "electionDate" : ISODate("2016-11-18T07:29:06Z"),

                        "configVersion" : 1

                },

                {

                        "_id" : 1,

                        "name" : "192.168.3.12:27017",

                        "health" : 1,

                        "state" : 5,

                        "stateStr" : "STARTUP2",

                        "uptime" : 141,

                        "optime" : Timestamp(0, 0),

                        "optimeDate" : ISODate("1970-01-01T00:00:00Z"),

                        "syncingTo" : "192.168.3.11:27017",

                        "configVersion" : 1,

                        "self" : true

                },

                {

                        "_id" : 2,

                        "name" : "192.168.3.11:27037",

                        "health" : 1,

                        "state" : 7,

                        "stateStr" : "ARBITER",

                        "uptime" : 85,

                        "lastHeartbeat" : ISODate("2017-03-03T03:19:41.796Z"),

                        "lastHeartbeatRecv" : ISODate("2017-03-03T03:19:41.796Z"),

                        "pingMs" : 0,

                        "configVersion" : 1

                }

        ],

        "ok" : 1

}

shard1:STARTUP2>

 

6、查看恢复结果

[mongodb@mongodb_m2 mongodb]$  /usr/local/mongodb-linux-x86_64-3.0.3/bin/mongo localhost:27017/admin

MongoDB shell version: 3.0.3

connecting to: localhost:27017/admin

Server has startup warnings:

2017-03-03T11:18:16.884+0800 I CONTROL  [initandlisten]

2017-03-03T11:18:16.884+0800 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.

2017-03-03T11:18:16.884+0800 I CONTROL  [initandlisten] **        We suggest setting it to 'never'

2017-03-03T11:18:16.885+0800 I CONTROL  [initandlisten]

2017-03-03T11:18:16.885+0800 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'.

2017-03-03T11:18:16.885+0800 I CONTROL  [initandlisten] **        We suggest setting it to 'never'

2017-03-03T11:18:16.885+0800 I CONTROL  [initandlisten]

shard1:PRIMARY> rs.status();

{

        "set" : "shard1",

        "date" : ISODate("2017-03-03T03:31:34.528Z"),

        "myState" : 1,

        "members" : [

                {

                        "_id" : 0,

                        "name" : "192.168.3.11:27017",

                        "health" : 1,

                        "state" : 2,

                        "stateStr" : "SECONDARY",

                        "uptime" : 797,

                        "optime" : Timestamp(1488511889, 2),

                        "optimeDate" : ISODate("2017-03-03T03:31:29Z"),

                        "lastHeartbeat" : ISODate("2017-03-03T03:31:32.612Z"),

                        "lastHeartbeatRecv" : ISODate("2017-03-03T03:31:33.347Z"),

                        "pingMs" : 0,

                        "syncingTo" : "192.168.3.12:27017",

                        "configVersion" : 1

                },

                {

                        "_id" : 1,

                        "name" : "192.168.3.12:27017",

                        "health" : 1,

                        "state" : 1,

                        "stateStr" : "PRIMARY",

                        "uptime" : 852,

                        "optime" : Timestamp(1488511889, 2),

                        "optimeDate" : ISODate("2017-03-03T03:31:29Z"),

                        "electionTime" : Timestamp(1488511825, 1),

                        "electionDate" : ISODate("2017-03-03T03:30:25Z"),

                        "configVersion" : 1,

                        "self" : true

                },

                {

                        "_id" : 2,

                        "name" : "192.168.3.11:27037",

                        "health" : 1,

                        "state" : 7,

                        "stateStr" : "ARBITER",

                        "uptime" : 797,

                        "lastHeartbeat" : ISODate("2017-03-03T03:31:32.612Z"),

                        "lastHeartbeatRecv" : ISODate("2017-03-03T03:31:33.347Z"),

                        "pingMs" : 0,

                        "configVersion" : 1

                }

        ],

        "ok" : 1

}

shard1:PRIMARY>

 

 

 

 

7、重建oplog方式

> use local

> db.oplog.rs.drop()

>db.createCollection("oplog.rs", {"capped" : true,"size" : 23 * 1024 * 1024 * 1024})

> db.runCommand( { create:"oplog.rs", capped: true, size: (23 * 1024 * 1024 * 1024) } )


Mongodb副本集的维护

Mongodb副本集配置好以后,少不了维护,维护内容也不是很多,主要是现在状态和增删节点等。   在说维护之前,得先说说Mongodb副本集的同步机制。 数据复制的目的是使数据得到最...

mongodb副本集中其中一个节点宕机无法重启的问题

2-8日我还在家中的时候,被告知mongodb副本集中其中一个从节点因未知原因宕机,然后暂时负责代管的同事无论如何就是启动不起来。 当时mongodb的日志信息是这样的: 实际上这里这么长一串...

MongoDB 集群节点 RECOVERYING 状态解决办法

问题描述公司项目搭建的mongodb集群,前几天发现有好几次访问异常。 一个分片的primary节点服务总是down掉,后来经过仔细排查,发现原来是该集群内的副本节点状态一直是”RECOVERYIN...

mongodb副本集自动切换修复节点解决方案

副本集部署 1.启动mongod 在每台运行mongod服务的机器上增加配置文件/etc/mongodb-rs.conf,内容为: [root@MongodbF-A etc]# vi /etc/...

MongoDB Replica Set集群部署

复制集是一个带有故障转移的主从集群。是从现有的主从模式演变而来,增加了自动故障转移和节点成员自动恢复。复制集模式中没有固定的主结点,在启动后,多个服务节点间将自动选举产生一个主结点。该主结点被称为pr...

mongodb启动和停止

启动:使用mongodb bin目录下的 mongod --dbpath=/home/mongodb/dbfile/data_file/ --logpath=/home/mongodb/dbfile...
  • chjl2020
  • chjl2020
  • 2014年04月30日 10:58
  • 14183

mongodb 的重启

kill -9 8545 使用了这个命令关闭mongodb,使得重启的时候,重启不了! 需要把mongodb.lock 这个文件删除掉才能启动! 而且还会偶尔发生数据丢失的事情,需要repa...
  • shmnh
  • shmnh
  • 2013年12月16日 13:15
  • 2658

mongoDB 启动与停止

MongoDB是一个基于分布式文件存储的数据库。由C++语言编写。旨在为WEB应用提供可扩展的高性能数据存储解决方案。它以高性能、易部署、易使用,存储数据非常方便等优点被广泛使用。其安装配置相当简单,...

MongoDB基本管理命令

MongoDB是一个NoSQL数据库系统:一个数据库可以包含多个集合(Collection),每个集合对应于关系数据库中的表;而每个集合中可以存储一组由列标识的记录,列是可以自由定义的,非常灵活,由一...
  • shirdrn
  • shirdrn
  • 2011年12月27日 17:03
  • 109805

linux下mongo启动关闭重启方法

1 启动    启动mongodb首先要指定mongo的数据目录和日志文件路径, 如:     /data/mongodb/     /var/log/mongodb/mongodb.log...
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:MongoDB 分片集群故障RECOVERING 处理纪实
举报原因:
原因补充:

(最多只允许输入30个字)