测试elasticsearch过程中,遇到translog损坏的异常,将修复的过程记录下来。
1. 问题
单机数据量有8亿+,一个index,20+个字段,使用bulk不停的写数据,bulk.size=5W,此时机器意外断电宕机。
机器修复后重启ES,出现translogCorruptedException异常:
提示有四个shard start failed,bulk写数据到index失败:
2. 解决方法
找了一些办法修复,包括lucene的CheckIndex修复工具。
CheckIndex的官方解释:Basic tool and API to check the health of an index and write a new segments file that removes reference to problematic segments.
会造成损坏segment中的数据丢失。
想找一个数据丢失最少的解决方法,在google group上找到一个类似的问题:ES failed to recover after crash
Motov给的解决方案:
- shut down elasticsearch cluster
- find all shards that cannot recover by searching log file
- for each shard move its non-zero length translog file into a temporary directory (see explanation below)
- start elasticsearch cluster
- if you see messages for other shards - repeat
也就是
关闭集群 --> 找到不能启动的shard --> 清除这些shard的 translog(注意做备份) --> 重启ES集群
如果还不行重复以上过程。
尝试着清除出现问题shard 的 translog,果然ES所有的shard都启动成功。
3. 分析总结
ES 的translog中包含 对ES所有的所有更改,是数据备份和恢复的重要组件。
如果在写translog时发生宕机事故,translog写入流程没有正常的结束,translog文件结尾没有正确的结束符号,
导致eof Exception。
另:Motov的完整回答:
In nel's case it was corrupted transaction log. When you run out of disk space sometimes the last transaction cannot be fully written into transaction log and then it fails on recovery. If you see exactly the same error messages, you can try the following:
- shut down elasticsearch cluster
- find all shards that cannot recover by searching log file
- for each shard move its non-zero length translog file into a temporary directory (see explanation below)
- start elasticsearch cluster
- if you see messages for other shards - repeat
If you see message like this:
[2012-06-22 17:36:17,165][WARN ][indices.cluster ] [Cat-Man] [
myindex][
1] failed to start shard
It means that it cannot recover shard
1 of the index
myindex on the node Cat-Man. If you take a look at data/elasticsearch/nodes/0/indices/
myindex/1/translog directory, you will find files like this: translog-123456677899 or translog-123456677899.recovering. One of them will have non-zero length. Move it to a temporary directory and try starting the server.
The transaction log files that you will be moving out contain your most recently updated and indexed documents. So, these updates will be lost as a result of this operations, but you should be able to recover the rest of your data.