elasticsearch集群节点重启导致分片丢失的问题

最新推荐文章于 2025-02-10 09:03:57 发布

技术菜逼

最新推荐文章于 2025-02-10 09:03:57 发布

阅读量8.7k

点赞数 1

分类专栏： elasticsearch

本文链接：https://blog.csdn.net/w1346561235/article/details/105852936

版权

本文记录了一个 Elasticsearch 5.4.3 集群在节点重启后，由于分片配置不当，导致0和3号分片丢失的问题。在调整节点内存配置后，集群未能正确恢复，分片无法分配，原因是找不到原始分片的文件。通过检查`_cluster/allocation/explain`和`_cluster/state`，发现因in_sync_allocations特性导致的数据丢失。临时解决方案是使用reroute API重新分配分片，但会导致数据丢失。后续需深入调查原因以避免类似问题。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

记录一下es丢失分片的问题。

5.4.3版本的es。3个节点分布在三台主机上，分片设置为5分片1副本的配置。因为压力测试，需要升级节点java堆内存，从2G升级到6G。因为是测试集群，在改了jvm.options配置之后(要改es软件目录下的才生效，配置目录下的不生效)，挨个重启节点，每个节点相差几秒钟的样子。好了，在我一顿操作猛如虎之后，集群起来了，皆大欢喜，继续测试。过了10来天，开发找过来，说一个refresh操作需要执行超过10s，问题必现。拿到出问题的索引后，通过GET /_cluster/state 发现0和3分片unassigned。再继续通过explain：
GET /_cluster/allocation/explain
{
"index": "dcvs_nonmotorvehicle",
"shard": 3,
"primary": true
}
结果如下：
{
  "index": "dcvs_nonmotorvehicle",
  "shard": 3,
  "primary": true,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "CLUSTER_RECOVERED",
    "at": "2020-04-10T03:40:41.127Z",
    "last_allocation_status": "no_valid_shard_copy"
  },
  "can_allocate": "no_valid_shard_copy",
  "allocate_explanation": "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster",
  "node_allocation_decisions": [
    {
      "node_id": "TklXzLKySf-czdu8zZ5hyQ",
      "node_name": "MYSQL2",
      "transport_address": "10.45.156.202:9300",
      "node_attribute

最低0.47元/天解锁文章