ZooKeeper does not recover

ZooKeeper does not recover from crash when disk was full

Description

The disk that ZooKeeper was using filled up. During a snapshot write, I got the following exception

2013-01-16 03:11:14,098 - ERROR [SyncThread:0:SyncRequestProcessor@151] - Severe unrecoverable error, exiting
java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:282)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
at org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:309)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:306)
at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:484)
at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:162)
at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:101)

Then many subsequent exceptions like:

2013-01-16 15:02:23,984 - ERROR [main:Util@239] - Last transaction was partial.
2013-01-16 15:02:23,985 - ERROR [main:ZooKeeperServerMain@63] - Unexpected exception, exiting abnormally
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:558)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:577)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:543)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:625)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.<init>(FileTxnLog.java:504)
at org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:130)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
at org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:259)
at org.apache.zookeeper.server.ZooKeeperServer.startdata(ZooKeeperServer.java:386)
at org.apache.zookeeper.server.NIOServerCnxnFactory.startup(NIOServerCnxnFactory.java:138)
at org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:112)
at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:86)
at org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:52)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)

It seems to me that writing the transaction log should be fully atomic to avoid such situations. Is this not the case?

  • Answers

  • delete zookeeper snapshot file its solved.

1.I was able to workaround the issue by deleting the partially written snapshot file

2.I believe the exception is being thrown while reading the snapshot and the partial transaction message is not an indication of what is causing it to crash. It sounds right that we should try a different snapshot, but according to the log messages you posted, it sounds like the problem is that we are not catching EOFException.
3.So there exceptions are thrown when ZooKeeper is running? Am not sure why its exiting so many times. Do you guys restart the ZK server if it dies?

4.We run ZooKeeper with runit, so yes it is restarted when it dies. It ends up in a loop of:

  • No space left on device
  • Starting server
  • Last transaction was partial
  • Snapshotting: 0x19a3d to /opt/zookeeper-3.4.3/data/version-2/snapshot.19a3d
  • No space left on device

5.I thought you said it does not recover when disk was full, but looks like the disk is still full? No?

6.Here is the full sequence of events (sorry for the confusion):

  • Noticed disk was full
  • Cleaned up disk space
  • Tried zkCli.sh, got errors
  • Checked ZK log, loop of:

2013-01-16 15:01:35,194 - ERROR [main:Util@239] - Last transaction was partial.
2013-01-16 15:01:35,196 - ERROR [main:ZooKeeperServerMain@63] - Unexpected exception, exiting abnormally
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:558)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:577)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:543)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:625)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.<init>(FileTxnLog.java:504)
at org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:130)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
at org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:259)
at org.apache.zookeeper.server.ZooKeeperServer.startdata(ZooKeeperServer.java:386)
at org.apache.zookeeper.server.NIOServerCnxnFactory.startup(NIOServerCnxnFactory.java:138)
at org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:112)
at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:86)
at org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:52)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)

  • Stopped ZK
  • Listed ZK data directory

ubuntu@ip-10-78-19-254:/opt/zookeeper-3.4.3/data/version-2$ ls -lat
total 18096
drwxr-xr-x 2 zookeeper zookeeper 4096 Jan 16 06:41 .
rw-rr- 1 zookeeper zookeeper 0 Jan 16 06:41 log.19a3e
rw-rr- 1 zookeeper zookeeper 585377 Jan 16 06:41 snapshot.19a3d
rw-rr- 1 zookeeper zookeeper 67108880 Jan 16 03:11 log.19a2a
rw-rr- 1 zookeeper zookeeper 585911 Jan 16 03:11 snapshot.19a29
rw-rr- 1 zookeeper zookeeper 67108880 Jan 16 03:11 log.11549
rw-rr- 1 zookeeper zookeeper 585190 Jan 15 17:28 snapshot.11547
rw-rr- 1 zookeeper zookeeper 67108880 Jan 15 17:28 log.1
rw-rr- 1 zookeeper zookeeper 296 Jan 14 16:44 snapshot.0
drwxr-xr-x 3 zookeeper zookeeper 4096 Jan 14 16:44 ..

  • Removed log.19a3e and snapshot.19a3d

ubuntu@ip-10-78-19-254:/opt/zookeeper-3.4.3/data/version-2$ sudo rm log.19a3e
ubuntu@ip-10-78-19-254:/opt/zookeeper-3.4.3/data/version-2$ sudo rm snapshot.19a3d

  • Started ZK
  • Back to normal

1. Attaching zookeeper.log

2.FYI, this issue is a duplication of ZOOKEEPER-1612 (curiously, a permutation of the last two digits, heh). I'd suggest to close 1612 as dup instead, if possible.

3.Ill makr 1612 as dup. Thanks for pointing that out Edward.

4.Looks like the header was incomplete. Unfortunately we do not handle corrupt header but do handle corrupt txn's later. Am suprised that this happened twice in a row for 2 users. Ill upload a patch and test case.

5.Should FileTxnIterator.goToNextLog() return false if the header is corrupted/incomplete, or should it skip the log file and go to the next log file if it exists?

6.-1 overall. Here are the results of testing the latest attachment 

http://issues.apache.org/jira/secure/attachment/12645856/ZOOKEEPER-1621.patch
against trunk revision 1596284.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2105//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2105//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2105//console

This message is automatically generated.

1.Here's a different option - intuitively once zookeeper fails to write to disk, by continuing to operate normally it violates its promises to users (which is that if a majority acked the data is always there even if reboots happen). Once we realize the promise can't be kept it may be better to crash the server at that point and violate liveness (no availability) rather than to continue and risk coming up with a partial log at a later point violating safety (inconsistent state, lost transactions, etc).

2.I'm fine with Alex's suggestion. We should document how to manually recover when the server doesn't start because the log file doesn't contain the complete header.

3.I actually like Alexander Shraer's suggestion. However, if this is going to be the way you recommended recovering a corrupt log file, there should be a script that does it for users: zk-recover.sh or some such. In this world of deployment automation, it's not a nice thing to say "go delete the most recent log segment from ZK's data dir". Much better for the application to handle it through a script or command.

4.Apart from these corrective measures there should be some preventive measures as well. 

Can we have disk space availability checker which check periodically whether disk space is available or not and if not available then close the Zookeeper gracefully.

5.You mean, like a ZK thread dedicated to this? What would the behavior be, only shutdown if it's the leader?

6.Yes, dedicated thread for this like org.apache.zookeeper.server.DatadirCleanupManager

  • shut-down in every case, because without disk space zookeeper can not serve any purpose
  • The idea is as follows
    • add two new zookeeper properties
      diskspace.min.threshold=5% (values can be % of data directory available space or in GB)
      diskspace.check.interval=5 second (default:5,min:1,max:Long.MAX_VALUE)
    • add dedicated disk check thread
      • which runs on every {{diskspace.check.interval)) second
      • if disk space is less than diskspace.min.threshold then shutdown zookeeper instance
  • Some clarifications:
    • Query: Suppose diskspace.check.interval=5 and disk space can be full within 5 second by zookeeper or by other process. What is handling for this?
      Ans: User should know what is their usage scenario, and what other processes are using the same disk space and based on that they should optimize the diskspace.check.interval values
    • Query: let say diskspace.check.interval=1 but disk space can be filled even within 1 second by zookeeper and other process
      Ans: yes it can be filled if diskspace.min.threshold is less, again based on disk space usage user need to optimize diskspace.min.threshold

7.Reviving this old thread. Alexander Shraer has a valid concern about trading off consistency for availability. However, for the specific issue being addressed here, we can have both.

The patch skips transaction logs with an incomplete header (the first 16 bytes). Skipping such files should not cause any loss of data as the header is an internal bookkeeping write from Zookeeper and does not contain any user data. This avoids the current behavior of Zookeeper crashing on encountering an incomplete header, which compromises availability.

This has been a recurring problem for us in production because our app's operating environment occasionally causes a Zookeeper server's disk to become full. After that, the server invariably runs into this problem - perhaps because there's something else that deterministically triggers a log rotation when the previous txn log throws an IOException due to disk full?

That said, we can tighten the exception being caught in Michi Mutsuzaki's patch to EOFException instead of IOException to make sure that the log we are skipping indeed only has a partially written header and nothing else (in FileTxnLog.goToNextLog).

Additionally, I have written a test to verify that EOFException is thrown if and only if the header is truncated. Zookeeper already ignores any other partially written transactions in the txn log. If that's useful, I can upload the test, thanks.

8.Agreed. Forcing users to manually clean up the partial/empty header in this scenario seems undesirable, and if we only catch EOFException instead of IOException, we shouldn't run into any problems with correctness. Additionally, since this issue should only occur "legitimately" in the most recent txn log file, we can be even more conservative and only continue in that case.

9.Thanks Meyer Kizner. Your suggestion of doing this only for the most recent txn log file is sound. Are you also suggesting that we delete this truncated txn log file?Cause, if we skip it and don't delete, then in the future, newer txn log files will get created. So, the truncated txn log file will no longer be the latest txn log when we do a purge afterwards.Deletion seems consistent with this approach as well as consistent with PurgeTxnLog's behavior.

10.Yes, we would have to delete such a log file upon encountering it. I don't believe this would cause any problems, and it seems desirable to have the extra check this enables.

11.he proposal of the fix makes sense to me.

Is it feasible to make a stronger guarantee for the ZooKeeper serialization semantics - that is, under no cases (disk full, power failure, hardware failure) would ZooKeeper generates invalid persistent files (for both snapshot and tx logs)? This might be possible by serializing things to a swap file first and then at one point do an atomic rename of the file. With the guarantee of the sanity of the on disk formats the deserializing logic would be simplified, as there will not be many corner cases to consider, besides the existing basic checksum check logic.

I can think two potential drawback of this approach:

  • Performance: if we write to swap file and then rename for every writes, we will be making more sys calls per write. Might impact performance / latency of write?
  • Potential data loss during recover: to improve performance, we could batch writes and only do rename at certain points - (i.e. every 1000 writes). In case of a failure, part of the data might loss as those data (possibly corrupted / partially serialized) living in swap file will not be parsed by ZK during start up (we will only load and parse renamed files.).

My feeling is the best approach might be a mix of efforts on both serialization and deserialization side:

  • When serializing, we do our best efforts to avoid generate corrupted files (i.e. through atomic writes to files.).
  • When deserializing, we do best efforts to detect corrupt files and recover conservatively - the success of recovery might be case by case - for example for this disk full case the proposed fix sounds pretty safe to perform while in other cases it might not be straightforward to tell which data is good and which is bad.
  • As a result - the expectation is when things crash and files corrupted, ZK should be able to recover later without manual intervention. This would be good for users.

详情查看:https://issues.apache.org/jira/browse/ZOOKEEPER-1621

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

꧁꫞ND꫞꧂

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值