Elasticsearch集群Yellow亚健康状态修复

最新推荐文章于 2024-08-14 14:41:56 发布

GottdesKrieges

最新推荐文章于 2024-08-14 14:41:56 发布

阅读量1.8k

点赞数

分类专栏： Redis+ES+ELK 文章标签： elasticsearch 大数据搜索引擎

本文链接：https://blog.csdn.net/Sebastien23/article/details/129188905

版权

Redis+ES+ELK 专栏收录该内容

16 篇文章 0 订阅

订阅专栏

文章介绍了当Elasticsearch集群显示Yellow健康状态时的排查和解决步骤。通过Kibana的ConsoleAPI检查索引、分片和节点信息，发现未分配的副本分片。通过执行/_cluster/reroute命令尝试手动分配或重试失败分配，确保磁盘空间充足且主副分片不在同一节点，从而恢复集群健康。

摘要由CSDN通过智能技术生成

问题背景

Elasticsearch集群健康状态为Yellow，涉及到多个索引。

排查流程

在浏览器打开Kibana Console进行问题排查，console地址为：

http://{Kibana_IP}:5601/app/dev_tools#/console

在console运行以下API命令来获取基本信息：

GET _cat/health?v
GET _cat/master?v
GET _cat/nodes?v
GET _cat/indices?v

GET _cat/shards?v
# 输出中各列分别为：
# shard：分片名称；prirep：主分片或副本，
# state：分片状态，可以为 INITIALIZING | RELOCATING | STARTED | UNASSIGNED
# docs：分片中文档的数量；store：分片占用的磁盘空间

GET _cat/allocation?v
# 获取分配到每个节点的分片数量以及所占用的磁盘空间

获取健康状态为Yellow的索引信息：

GET _cat/indices?v&health=yellow

输出中包含的列有health、status（索引状态）、index（索引名称）、uuid、pri（主分片数量）、rep（副本数量）、docs.count、docs.deleted、store.size、pro.store.size。

从上面拿到的异常状态索引中，任选一个（假设为ftimes_infra_migrad_2022-09）继续查看该索引的分片信息：

GET _cat/shards/ftimes_infra_migrad_2022-09?v

输出的列中包含index、shard（分片名称）、prirep（primary还是replica）、state、docs、store（分片大小）、ip、node（分片所在节点）。

观察目标索引的各个分片的分配情况。Yellow健康状态下一般这里可以看到有replica分片没有被正确分配，即prirep=r的行记录，对应的分片状态为state=UNASSIGNED。

假设未被正确分配的replica分片名称为0，检查该分片分配失败的原因：

GET _cluster/allocation/explain
{
  "index": "ftimes_infra_migrad_2022-09",
  "shard": 0,
  "primary": false
}

检查输出中的explanation部分：

...
"explanation": "shard has exceeded the maximum number of retries [5] on failed
allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry,
..."

解决办法

下面我们尝试手动分配该replica分片。需要确保replica分片要分配的节点上有足够的磁盘空间，并且同一索引的primary分片和replica分片不在同一节点上。

# 查看分片的大小、主分片所在节点
GET _cat/shards/ftimes_infra_migrad_2022-09?v

# 查看各节点的磁盘空间使用情况
GET _cat/allocation?v

# 将replica分片手动分配到指定节点es_data_21
POST /_cluster/reroute
{
  "command": [
    {
      "allocation_replica": {
        "index": "ftimes_infra_migrad_2022-09",
        "shard": 0,
        "node": "es_data_21"
      }
    }
  ]
}

执行后收到下面的报错：

...
"type": "illegal_argument_exception",
"reason": "[allocation_replica] allocation of [ftimes_infra_migrad_2022-09][0] on
node {es_data_21}{...}{...} is not allowed, reason: [NO(shard has exceeded the 
maximum number of retries [5] on failed allocation attempts - manually call 
[/_cluster/reroute?retry_failed=true] to retry, ... )]"

根据错误提示执行以下命令：

POST /_cluster/reroute?retry_failed=true

ES集群就会自动重新分配之前分配出错的replica副本。

过一小段时间后，检查所有索引健康状态：

GET _cat/indices?v&health=yellow

🐟MORE …

在Kibana的console API命令中，可以使用s来对检索结果按指定的列排序，并使用通配符*来匹配任意字符串。

# 获取集群中所有索引信息，并按index列排序
GET _cat/indices?v&s=index

# 获取集群中名称以ftimes开头的所有索引信息，并按index列排序
GET _cat/indices/ftimes*?v&s=index

# 获取集群中名称以gzone开头的索引的所有分片信息
GET _cat/shards/gzone*