zookeeper节点数太多导致的服务瘫痪

zookeeper客户端连不上服务器,查看服务端日志 文件zookeeper.out(目录见 /usr/local/zookeeper/conf/zoo.cfg中的dataLogDir配置项)如下:


2016-02-16 11:17:44,503 [myid:2] - WARN  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when following the leader

java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:129)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
        at java.io.DataInputStream.readInt(DataInputStream.java:370)
        at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
        at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
        at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
        at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152)
        at org.apache.zookeeper.server.quorum.Learner.registerWithLeader(Learner.java:272)
        at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:72)
        at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740)
2016-02-16 11:17:44,503 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
java.lang.Exception: shutdown Follower
        at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
        at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:744)
2016-02-16 11:17:44,504 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@139] - Shutting down
2016-02-16 11:17:44,504 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@428] - shutting down
2016-02-16 11:17:44,504 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@670] - LOOKING
2016-02-16 11:17:44,507 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FileSnap@83] - Reading snapshot /web/applog/zookeeper/data/version-2/snapshot.119f00000001
2016-02-16 11:17:46,752 [myid:2] - INFO  [WorkerReceiver[myid=2]:FastLeaderElection@542] - Notification: 4 (n.leader), 0x119f00000001 (n.zxid), 0x139b (n.round), LOOKING (n.state), 4 (n.sid), 0x14bf (n.peerEPoch), LOOKING (my state)
2016-02-16 11:17:52,916 [myid:2] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] - Accepted socket connection from /10.21.34.240:49942
2016-02-16 11:17:52,916 [myid:2] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
2016-02-16 11:17:52,916 [myid:2] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for client /10.21.34.240:49942 (no session established for client)
2016-02-16 11:17:52,920 [myid:2] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@213] - Accepted socket connection from /10.21.33.25:21546

2016-02-16 11:17:52,920 [myid:2] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running


起初以为是防火墙或者其它网络问题,搜索引擎上没找到有过类似经验的贴子(这也是为什么我写下来的原因),对zookeeper不熟,无奈只能自己研究配置项的含义,发现它提到几个超时项可能与上述异常有关

# The number of ticks that the initial 
# synchronization phase can take
initLimit=10

# The number of ticks that can pass between 
# sending a request and getting an acknowledgement
syncLimit=5


登录和同步超时时间,当数据量比较大时,这个值要相应增大,不然会超时,故我将这两项调大(10倍),重启所有server,果然不再报错,客户端连上了


再核实此推断,发现确实有一个节点的子节点数量很大:

[zk: localhost:2181(CONNECTED) 2] stat /scheduler/trigger/formatnotify/_SCND
cZxid = 0x8023698c2
ctime = Thu Jul 30 17:19:11 CST 2015
mZxid = 0x8023698c2
mtime = Thu Jul 30 17:19:11 CST 2015
pZxid = 0xd01406c09
cversion = 2216863
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 2216675

2百多万,所以使用zookeeper一定要注意做好清理工作,否则做一个协调指挥者,一个小小的失误,将导致整个生产系统瘫痪!

展开阅读全文
博主设置当前文章不允许评论。

没有更多推荐了,返回首页