背景
我们的es集群是一个读多写少的集群。最近因为一个需求,需要对线上的数据进行处理,大概5/6的数据都要被处理,约3.5T~4T的数据吧。于是用kibana的update_by_query和4个进程对线上的数据进行处理。
事故
18:55开始报警,es的一台机器挂了,导致线上的数据写入异常,影响到了实时的数据。于是马上进行重启。重启之后发现由于长时间没有起来,导致集群把这台机器踢出了,数据做了自平衡。然后,我导数据的时候,神奇般的将三张表的数据的replica设置为了0。这台机器重启了以后,有5个shard找不到了。虽然都是历史数据,但还是很方,赶紧处理。
解决方案
首先去查一下shard unsigned的原因,上kibana上,操作命令:
GET _cluster/allocation/explain
查了一下索引shard失败的原因,发现无法signed的原因是:
cannot allocate because all found copies of the shard are either stale or corrupt
上网查了一下,这种情况是因为version比较老导致的原因。再次查一下解决方案,发现可以通过_cluster/reroute中的allocate_stale_primary对主分片进行重新分配。allocate_stale_primary这个方法会丢失一部分数据。不过因为我这些都是老数据,所以不会有更新,这个方法还是比较适合我这个场景。于是果断在kibana上进行一波操作:
POST _cluster/reroute
{
"commands": [
{
"allocate_stale_primary": {
"index": "index",
"shard": 0,
"node": "node-01",
"accept_data_loss" : true
}
}
]
}
其中,节点名字可以通过 _nodes/process?pretty这个方法查询到。以上操作结束之后,发现有3个shard恢复了,还有2个shard没有恢复,查询日志,发现如下报错:
[2020-04-06T14:22:08,988][WARN ][o.e.i.c.IndicesClusterStateService] [node-01] [[index][9]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [index][9]: Recovery failed on {node-01}{mHXkGfZCSwqtPA0Qq8mDXA}{x9amFJOHRrCb8G2Ua8bz-Q}{ml.machine_memory=65844264960, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$4(IndexShard.java:2059) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-6.4.0.jar:6.4.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_171]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_171]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed to fetch index version after copying it over
at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:388) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:301) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1603) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$4(IndexShard.java:2055) ~[elasticsearch-6.4.0.jar:6.4.0]
... 4 more
Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: shard allocated for local recovery (post api), should exist, but doesn't, current files: []
at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:373) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:301) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1603) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$4(IndexShard.java:2055) ~[elasticsearch-6.4.0.jar:6.4.0]
... 4 more
Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in store(ByteSizeCachingDirectory(MMapDirectory@/data/software/elasticsearch-6.4.0/data/nodes/0/indices/rBobrStOTTmDLHSfQuHLEw/9/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@774cb153)): files: []
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:683) ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz - 2018-06-18 16:51:45]
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:640) ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz - 2018-06-18 16:51:45]
at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:442) ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz - 2018-06-18 16:51:45]
at org.elasticsearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:122) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.index.store.Store.readSegmentsInfo(Store.java:204) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.index.store.Store.readLastCommittedSegmentsInfo(Store.java:189) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:363) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:301) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1603) ~[elasticsearch-6.4.0.jar:6.4.0]
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$4(IndexShard.java:2055) ~[elasticsearch-6.4.0.jar:6.4.0]
... 4 more
然后去上面的路径看了一下,果然index下面是空的,连tronslog都没了,一下子就方了。赶紧去各种查,也尝试了POST _cluster/reroute?retry_failed=true这种方法,但是还是没有恢复。最后的办法就是使用allocate_empty_primary方法,但是这样的话数据也就丢了。抱着不甘的心情,再一次查阅了相关解决方案,最后在stack overflow中找到了解决方案。先贴链接:https://stackoverflow.com/questions/49005638/healthy-elasticsearch-cluster-turns-red-after-opening-a-closed-index。按照他的方法首先进行了GET _shard_stores?pretty查找,查找到相关的node,然后再进行allocate_stale_primary,节点换成刚刚查到了,执行了两次,shard恢复。
总结
- 线上挂了先尽快恢复服务,数据应尽量恢复,不能恢复再考虑其他方案。
- 已经要对所有的index做relica,不然机器挂了之后恢复很难。
- 事故原因是内存耗尽导致的,后续要仔细查代码,防止这种事情发生。