记一次线上elastic search集群事故

背景

      我们的es集群是一个读多写少的集群。最近因为一个需求,需要对线上的数据进行处理,大概5/6的数据都要被处理,约3.5T~4T的数据吧。于是用kibana的update_by_query和4个进程对线上的数据进行处理。

事故

      18:55开始报警,es的一台机器挂了,导致线上的数据写入异常,影响到了实时的数据。于是马上进行重启。重启之后发现由于长时间没有起来,导致集群把这台机器踢出了,数据做了自平衡。然后,我导数据的时候,神奇般的将三张表的数据的replica设置为了0。这台机器重启了以后,有5个shard找不到了。虽然都是历史数据,但还是很方,赶紧处理。

解决方案

      首先去查一下shard unsigned的原因,上kibana上,操作命令:

GET _cluster/allocation/explain

      查了一下索引shard失败的原因,发现无法signed的原因是:

cannot allocate because all found copies of the shard are either stale or corrupt

      上网查了一下,这种情况是因为version比较老导致的原因。再次查一下解决方案,发现可以通过_cluster/reroute中的allocate_stale_primary对主分片进行重新分配。allocate_stale_primary这个方法会丢失一部分数据。不过因为我这些都是老数据,所以不会有更新,这个方法还是比较适合我这个场景。于是果断在kibana上进行一波操作:

POST _cluster/reroute
{
  "commands": [
    {
      "allocate_stale_primary": {
        "index": "index",
        "shard": 0,
        "node": "node-01",
        "accept_data_loss" : true
      }
    }
  ]
}

      其中,节点名字可以通过 _nodes/process?pretty这个方法查询到。以上操作结束之后,发现有3个shard恢复了,还有2个shard没有恢复,查询日志,发现如下报错:

[2020-04-06T14:22:08,988][WARN ][o.e.i.c.IndicesClusterStateService] [node-01] [[index][9]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [index][9]: Recovery failed on {node-01}{mHXkGfZCSwqtPA0Qq8mDXA}{x9amFJOHRrCb8G2Ua8bz-Q}{ml.machine_memory=65844264960, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
        at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$4(IndexShard.java:2059) ~[elasticsearch-6.4.0.jar:6.4.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-6.4.0.jar:6.4.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_171]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_171]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed to fetch index version after copying it over
        at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:388) ~[elasticsearch-6.4.0.jar:6.4.0]
        at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95) ~[elasticsearch-6.4.0.jar:6.4.0]
        at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:301) ~[elasticsearch-6.4.0.jar:6.4.0]
        at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[elasticsearch-6.4.0.jar:6.4.0]
        at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1603) ~[elasticsearch-6.4.0.jar:6.4.0]
        at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$4(IndexShard.java:2055) ~[elasticsearch-6.4.0.jar:6.4.0]
        ... 4 more
Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: shard allocated for local recovery (post api), should exist, but doesn't, current files: []
        at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:373) ~[elasticsearch-6.4.0.jar:6.4.0]
        at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95) ~[elasticsearch-6.4.0.jar:6.4.0]
        at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:301) ~[elasticsearch-6.4.0.jar:6.4.0]
        at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[elasticsearch-6.4.0.jar:6.4.0]
        at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1603) ~[elasticsearch-6.4.0.jar:6.4.0]
        at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$4(IndexShard.java:2055) ~[elasticsearch-6.4.0.jar:6.4.0]
        ... 4 more
Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in store(ByteSizeCachingDirectory(MMapDirectory@/data/software/elasticsearch-6.4.0/data/nodes/0/indices/rBobrStOTTmDLHSfQuHLEw/9/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@774cb153)): files: []
        at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:683) ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz - 2018-06-18 16:51:45]
        at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:640) ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz - 2018-06-18 16:51:45]
        at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:442) ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz - 2018-06-18 16:51:45]
        at org.elasticsearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:122) ~[elasticsearch-6.4.0.jar:6.4.0]
        at org.elasticsearch.index.store.Store.readSegmentsInfo(Store.java:204) ~[elasticsearch-6.4.0.jar:6.4.0]
        at org.elasticsearch.index.store.Store.readLastCommittedSegmentsInfo(Store.java:189) ~[elasticsearch-6.4.0.jar:6.4.0]
        at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:363) ~[elasticsearch-6.4.0.jar:6.4.0]
        at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95) ~[elasticsearch-6.4.0.jar:6.4.0]
        at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:301) ~[elasticsearch-6.4.0.jar:6.4.0]
        at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[elasticsearch-6.4.0.jar:6.4.0]
        at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1603) ~[elasticsearch-6.4.0.jar:6.4.0]
        at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$4(IndexShard.java:2055) ~[elasticsearch-6.4.0.jar:6.4.0]
        ... 4 more

      然后去上面的路径看了一下,果然index下面是空的,连tronslog都没了,一下子就方了。赶紧去各种查,也尝试了POST _cluster/reroute?retry_failed=true这种方法,但是还是没有恢复。最后的办法就是使用allocate_empty_primary方法,但是这样的话数据也就丢了。抱着不甘的心情,再一次查阅了相关解决方案,最后在stack overflow中找到了解决方案。先贴链接:https://stackoverflow.com/questions/49005638/healthy-elasticsearch-cluster-turns-red-after-opening-a-closed-index。按照他的方法首先进行了GET _shard_stores?pretty查找,查找到相关的node,然后再进行allocate_stale_primary,节点换成刚刚查到了,执行了两次,shard恢复。

总结

  1. 线上挂了先尽快恢复服务,数据应尽量恢复,不能恢复再考虑其他方案。
  2. 已经要对所有的index做relica,不然机器挂了之后恢复很难。
  3. 事故原因是内存耗尽导致的,后续要仔细查代码,防止这种事情发生。
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值