《Linux运维总结：elasticsearch基于reindex索引重建与数据迁移(方案四)》

东城绝神

已于 2023-05-25 17:02:17 修改

阅读量1k

点赞数 2

分类专栏：《Linux运维实战总结》文章标签： elasticsearch

于 2023-05-20 16:35:33 首次发布

本文为博主原创文章，未经博主允许不得转载。

本文链接：https://blog.csdn.net/m0_37814112/article/details/130783056

版权

《Linux运维实战总结》专栏收录该内容

212 篇文章 189 订阅

订阅专栏

一、应用场景

当您需要将远程elasticsearch集群中的数据迁移到本地elasticsearch集群中时，可以通过reindex API重建索引来实现。本文介绍具体的实现方法。

reindex的应用场景如下：

1、elasticsearch集群间迁移数据。
2、索引分片分配不合理，例如数据量太大分片数太少，可通过reindex重建索引。
3、索引中存在大量数据的情况下，需要修改索引mapping，可通过reindex复制索引数据。

说明：elasticsearch中，定义了索引mapping且导入数据后，将不能再修改索引mapping。

二、注意事项

1、从elasticsearch 2.3.0开始， Reindex API被引入。
2、需要在目标ES集群配置reindex.remote.whitelist参数，指明能够reindex远程集群的白名单。
3、reindex要求为源索引中的所有文档启用_source。
4、reindex不尝试建立目标索引。它不复制源索引的设置。您应该在运行_reindex操作之前设置目标索引，包括设置映射、分片计数、副本等。

在目标es集群的elasticsearch.yml配置文件，设置远程e集群的白名单，并重启es服务，如下所示：

reindex.remote.whitelist: ["192.168.1.62:19202", "192.168.1.63:19202"]

三、迁移场景

3.1、从本地ES集群进行reindex

源索引和目标索引为相同es主机或相同es集群

curl -s -u elastic:elastic -X POST http://192.168.1.62:19203/_reindex -H 'Content-Type: application/json' -d'
{
  "source": {
    "index": "test1"
  },
  "dest": {
    "index": "test2"
  }
}'

3.2、从远程ES集群进行reindex

源索引和目标索引为不同es集群主机

适用场景：从一个远程的elasticsearch的服务器上进行reindex，需要在请求体的remote参数填写连接信息。

curl -s -u $es_dest_user:$es_dest_passwd -X POST $es_dest_server/_reindex -H 'Content-Type: application/json' -d'
{
  "source": {
    "remote": {
      "host": "'$es_source_server'",
      "username": "'$es_source_user'",
      "password": "'$es_source_pass'"
    },
    "index": "'$es_source_index'"
  },
  "dest": {
    "index": "'$es_dest_index'"
  }
}'

3.3、使用分片进行reindex

适用场景：源索引的文档数量较多，为了提高reindex效率，采用设置slices参数进行并行加速处理，当值设置为auto时，ES 会合理的选择切片数量进行处理，建议使用auto。

curl -s -u elastic:elastic -X POST http://192.168.1.62:19203/_reindex?slices=auto -H 'Content-Type: application/json' -d'
{
  "source": {
    "index": "test1"
  },
  "dest": {
    "index": "test2"
  }
}'

或者在head插件页面上执行

POST _reindex?slices=auto&refresh
{
  "source": {
    "index": "test1"
  },
  "dest": {
    "index": "test2"
  }
}

3.4、指定部分字段进行reindex

适用场景：如果只需要把源索引的部分字段进行reindex到目标索引，在请求体的 source中设置 _source参数指定这些字段即可。

以下源索引中的原始数据，其中字段分别为：id、rulesId、userId、createTime

{
"_index": "test1",
"_type": "integralOperatorRecord",
"_id": "163128",
"_version": 1,
"_score": 1,
"_source": {
"id": "163128",
"rulesId": 24,
"userId": 43070,
"createTime": 1561001338000
}
}

现在需要将id和createTime两个字段的数据迁移到目标索引中

curl -s -u elastic:eKVSEne3Re7yrWOOXVYg http://192.168.1.62:19203/_reindex -XPOST -H 'Content-Type: application/json' -d'
{
  "source": {
    "index": "test1",
    "_source":["id","createTime"]
  },
  "dest": {
    "index": "test2"
  }
}'

迁移后的索引数据，如下所示：

{
"_index": "test2",
"_type": "integralOperatorRecord",
"_id": "163128",
"_version": 1,
"_score": 1,
"_source": {
"createTime": 1561001338000,
"id": "117156"
}
}

3.4、指定部分文档进行reindex

适用场景：使用query DSL语句查询到文档集，进行reindex的时候设置 max_docs最大文档数量不超过5W个。当然，请求体不设置max_docs参数也是可以的，将查询到的所有文档集进行reindex 。

在head插件页面上执行

POST _reindex
{
  "max_docs": 50000,
  "source": {
    "remote": {
      "host": "http://otherhost:9200",
      "username": "user",
      "password": "pass"
    },
    "index": "my-index-000001",
    "query": {
      "match": {
        "test": "data"
      }
    }
  },
  "dest": {
    "index": "my-new-index-000001"
  }
}

3.5、指定速率进行reindex

适用场景：如果 reindex 操作过快，可能会给 ES 集群造成写入压力，严重的话会导致集群的崩溃。为此，通过请求参数可以设置 requests_per_second参数限制处理的速率，而size用于批量读写操作的文档数，此参数是可选的，缓冲区最大为200MB，默认100M

在head插件页面上执行

POST _reindex?requests_per_second=500
{
  "source": {
    "index": "test1",
    "size": 600
  },
  "dest": {
    "index": "test2"
  }
}

3.6、使用script进行reindex

适用场景：ES script是一个强大的存在，可以轻松帮我们实现很多对文档修改的需求，比如，把文档中的tag 字段名称改为flag；又比如，在文档中新增一个字段并赋默认值等。

在head插件页面上执行

POST _reindex
{
  "source": {
    "index": "my-index-000001"
  },
  "dest": {
    "index": "my-new-index-000001"
  },
  "script": {
    "source": "ctx._source.tag = ctx._source.remove(\"flag\")"
  }
}

3.7、多个源索引进行reindex到一个目标索引

适用场景：多个源索引向同一个目标索引进行reindex，但需要注意多个源索引的文档id有可能是一样的，reindex到目标索引时无法保证是哪个源索引的文档id，最终覆盖只保留一个文档id。

在head插件页面上执行

POST _reindex?refresh
{
  "source": {
    "index": ["test1","test2","test3"]
  },
  "dest": {
    "index": "test4"
  }
}

四、常见问题

1、执行curl命令时，提示{“error”:“Content-Type header [application/x-www-form-urlencoded] is not supported”,“status”:406}

解决方法：在curl命令中，添加-H "Content-Type: application/json"脚本重试。

2、问题：单索引数据量比较大，数据同步速度比较慢时，如何处理？

1、增加资源：通过增加Elasticsearch的资源（如内存、CPU等），可以提高reindex操作的效率。
2、避免磁盘IO瓶颈：在进行reindex操作时，磁盘IO可能会成为瓶颈，因此可以通过将源索引和目标索引放在不同的磁盘上，或者使用更快的磁盘来避免磁盘IO瓶颈。
3、避免索引分片过多：在进行reindex操作时，如果源索引和目标索引的分片数过多，可能会影响reindex操作的效率。因此可以通过减少索引分片数来提高reindex操作的效率。
4、关闭索引刷新：在进行reindex操作时，可以将 目标索引的刷新间隔设置为-1，从而避免不必要的刷新操作，提高reindex操作的效率。
5、由于reindex功能的底层实现原理是通过scroll方式实现的，所以您可以 适当调大scroll size的大小或配置scroll slice，借助scroll并行化机制提升效率。详情请参见reindex API。
6、如果单索引数据量比较大，可以在 迁移前将目标索引的副本数设置为0，以加快数据同步速度。待数据迁移完成后，再修改回来。

总之，提高elasticsearch的reindex操作效率可以通过优化API、增加资源、避免磁盘IO瓶颈、避免索引分片过多、关闭索引刷新等方式来实现。

1、使用reindex迁移前

#创建索引
curl -u user:password -XPUT 'http://<host:port>/indexName'

#迁移索引数据前可以先将索引副本数设为0，不刷新，用于加快数据迁移速度
curl -u user:password -XPUT 'http://<host:port>/indexName/_settings' -H 'Content-Type: application/json' -d' {
        "number_of_replicas" : 0,
        "refresh_interval" : "-1"
}'

2、使用reindex迁移中

# 源索引的文档数量较多,为了提高reindex效率,采用设置slices参数进行并行加速处理,当值设置为auto时,ES会合理的选择切片数量进行处理，建议使用auto
# 默认情况下,_reindex使用1000进行批量操作，你可以在source中调整batch_size的大小，这里设置为5000
curl -s -u elastic:$es_dest_passwd -X POST $es_dest_server/_reindex?slices=auto&refresh -H 'Content-Type: application/json' -d'
{
  "source": {
    "remote": {
      "host": "'$es_source_server'"
      "username": "user",
      "password": "pass" ,
      "size": 5000      
    },
    "index": "college"
  },
  "dest": {
    "index": "college"
  }
}'

3、使用reindex迁移后

#索引数据迁移完成后，可以重置索引副本数为1，刷新时间1s（1s是默认值）
curl -u user:password -XPUT 'http://<host:port>/indexName/_settings' -H 'Content-Type: application/json' -d' {
        "number_of_replicas" : 1,
        "refresh_interval" : "1s"
}'