Elasticsearch yellow unassigned_shards 恢复 replicas 节点恢复

最新推荐文章于 2024-08-14 14:41:56 发布

starcwang

最新推荐文章于 2024-08-14 14:41:56 发布

阅读量8.3k

点赞数 1

分类专栏： elasticsearch 文章标签： elasticsearch unassigned yellow replica 集群

本文链接：https://blog.csdn.net/u012546526/article/details/78598608

版权

elasticsearch 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Elasticsearch yellow unassigned_shards 恢复 replicas 节点恢复

问题

Elasticsearch 5.1 集群三个节点，由于某些原因，导致其中两个挂掉了。及时重启后，es的健康状态由red变为yellow，并一直持续yellow状态

分析

调用健康状态接口：

GET /_cluster/health
----------------------
{
  "cluster_name": "es",
  "status": "yellow",
  "timed_out": false,
  "number_of_nodes": 3,
  "number_of_data_nodes": 3,
  "active_primary_shards": 37,
  "active_shards": 71,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 3,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 95.94594594594594
}

发现有三个节点的副本没有分配。
调用节点详情接口：

GET /_cat/shards
my_index             4 p STARTED    7795827  10.7gb 192.168.0.205 node-205-1
my_index             4 r UNASSIGNED                               
my_index             3 p STARTED    7801305  12.4gb 192.168.0.205 node-205-1
my_index             3 r UNASSIGNED                               
my_index             2 r STARTED    7797142  10.6gb 192.168.0.149 node-149-1
my_index             2 p STARTED    7797211  10.6gb 192.168.0.173 node-173-1
my_index             1 p STARTED    7801554  11.4gb 192.168.0.205 node-205-1
my_index             1 r UNASSIGNED                               
my_index             0 r STARTED    7795061  10.8gb 192.168.0.149 node-149-1
my_index             0 p STARTED    7795107  10.8gb 192.168.0.173 node-173-1

发现 primary shards 是ok的，但是 1 3 4 的 replica shards 是挂掉的。
按理来说，Elasticsearch是有自我分配节点功能的，首先查看这个功能是否开启：

GET /_cluster/settings
-----------------------
{
    "persistent": {},
    "transient": {
        "cluster": {
            "routing": {
                "allocation": {
                    "enable": "all" }
            }
        }
    }
}

已经开启了自动分配功能。那就很奇怪了，为什么这三个节点没有分配呢。
于是登上机器，查看es的日志，发现error如下：

[2017-11-21T15:43:54,799][WARN ][o.e.i.c.IndicesClusterStateService] [node-149-1] [[my_index][4]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [my_index][4]: Recovery failed from {node-205-1}{WPC32CtxTtOiTPuCseqF8g}{AyjHnVtwSnik2Rcu_SQg8A}{192.168.0.205}{192.168.0.205:9300} into {node-149-1}{fa9ZVqyXSHKhYHvAhr8x6w}{ECrBVQS_QPOXtc9E0is9Tw}{192.168.0.149}{192.168.0.149:9300}
......
Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: Failed to transfer [0] files with total size of [0b]
......
Caused by: java.lang.IllegalStateException: try to recover [my_index][4] from primary shard with sync id but number of docs differ: 7728586 (node-205-1, primary) vs 7728583(node-149-1)
......

原来并不是es没有尝试恢复，而是恢复报错了。
一头雾水，不知道发生了什么，于是google。
尝试了各种方法，比如

stackoverflow上面的使用/_cluster/reroute来恢复，但是我和这个提问者情况好像不太一样，我所有的primary shards都是好的，使用 /_cluster/reroute 会报错：

POST /_cluster/reroute
{
    "commands" : [ {
        "allocate_empty_primary" : {
            "index" : "my_index",
            "shard" : 1,
            "node" : "node-149-1",
            "accept_data_loss":true
        }
    }]
}
-----------------------------
{
    "error": {
        "root_cause": [
            {
                "type": "remote_transport_exception",
                "reason": "[node-205-1][192.168.0.205:9300][cluster:admin/reroute]"
            }
        ],
        "type": "illegal_argument_exception",
        "reason": "[allocate_empty_primary] primary [my_index][1] is already assigned"
    },
    "status": 400
}

说 primary shards 1 已经被分配了，所以使用这个方法好像行不通。

找到一篇文章：解决elasticsearch集群Unassigned Shards 无法reroute的问题。这个和我情况类似，他是使用reindex的方法解决了此问题，但是我这个没办法reindex啊，我这个是热index啊。。。
It’s RED!!! How do I recover unassigned elasticsearch cluster shards?，又翻到一片这个文章，于是乎有了启发

解决方法

因为所有primary shards都是好的，所有replica shards有问题，那么我强制删除掉replica shards，让es再重新生成，不就ok了吗。
首先先将出问题的index的副本为0

PUT /my_index/_settings
{
    "index" : {
        "number_of_replicas" : 0
    }
}
--------------------------
{
    "acknowledged": true
}

此时再查看节点状态：

GET /_cat/shards
my_index             4 p STARTED    7795827  10.7gb 192.168.0.205 node-205-1
my_index             3 p STARTED    7801305  12.4gb 192.168.0.205 node-205-1
my_index             2 p STARTED    7797211  10.6gb 192.168.0.173 node-173-1
my_index             1 p STARTED    7801554  11.4gb 192.168.0.205 node-205-1
my_index             0 p STARTED    7795107  10.8gb 192.168.0.173 node-173-1

没有 replica shards 了。
接下来再恢复回去：

PUT /my_index/_settings
{
    "index" : {
        "number_of_replicas" : 1
    }
}
--------------------------
{
    "acknowledged": true
}

等待节点自动分配后，集群成功恢复成green！！！

starcwang

关注

1
点赞
踩
6

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录