ElasticSearch的REST APIs 之索引管理(中): shrink, split, clone

最新推荐文章于 2024-04-19 18:02:24 发布

aben_sky

最新推荐文章于 2024-04-19 18:02:24 发布

阅读量754

点赞数

文章标签： java python 数据库大数据 elasticsearch

本文链接：https://blog.csdn.net/aben_sky/article/details/121514486

版权

基于ES 7.7, 官方文档 https://www.elastic.co/guide/en/elasticsearch/reference/7.7/indices.html#indices

上半篇: https://my.oschina.net/abensky/blog/5280911

0x06 Shrink index

收缩索引, 减少主分片的数量.

POST|PUT /<index>/_shrink/<target_index>

在执行shrink之前要求:

索引必须是只读的(read-only)
索引的所有主分片必须在一个节点上(All primary shards for the index must reside on the same node.)
索引的健康状态必须是 green.
目标索引不存在

为了更容易的进行分片的分配操作, 建议移除该索引的副本分片, 然后再重新添加副本分片.

可以用下面的更新索引设置api来移除副本分片, 把余下的分片重新定位到同一个节点, 并设置为只读:

# 移除副本分片, 并把余下的分片重新定位到同一个节点
PUT /<source_index_name>/_settings
{
  "settings":{
    "index.number_of_replicas":0,
    "index.routing.allocation.require._name":"shrink_node_name",
    "index.blocks.write":true
  }
}

index.number_of_replicas = 0: 移除副本分片
index.routing.allocation.require._name = shrink_node_name : 重新定位索引的分片到shrink_node_name节点
阻止对该索引的写操作. 但是元数据(metadata)的修改操作, 比如删除索引 ,仍然是允许的.

目标主分片数量必须是原数量的因子. 比如, 原来是8个那么可以收缩成4/2/1个, 原来是15个key收缩成5/3/1个, 如果原来的个数是一个质数则只能收缩成1个.

如果目标索引主分片数量是1, 则要求源索引的文档数量不能超过单个分片的最大文档数量限制 2,147,483,519。

收缩操作的过程

1. 按照源索引的定义创建一个新的索引, 但是主分片数量更少
1. 将segments从源索引硬连接(hard link)到新的索引.(如果文件系统不支持hard link就会执行复制操作, 更消耗时间; 而且 hard link不支持跨硬盘操作, 分布在多个路径下的分片也会执行复制操作)
1. 恢复新的索引, 就好像是一个closed的索引被重新打开了.

感觉这个操作与MySQL中的"optimize table xxx"类似, 会重建整个表, 以shrink表空间.

收缩操作

# 收缩操作: shrink(默认主副分片数都是1), 并清除来自源索引的设置
POST /<source_index_name>/_shrink/<target_index_name>
{
  "settings": {
    "index.routing.allocation.require._name": null,
    "index.blocks.write": null
  }
}

当前默认的主分片和副本分片数量都是1 index.routing.allocation.require._name = null 表示清除来自源索引的定位的要求 index.blocks.write = null 表示清除来自源索引的禁止写设置不需要特别指定mappings,会从源索引复制??

可以像create操作一样设置别名.

请求提交后, 一旦目标索引添加到集群的状态中就会立即返回结果, 它不会等待shrink操作真正开始, 所以, 我们需要监视shrink的进度, 可以使用 GET /_cat/recovery/<target_index_name>

更多请参考官方文档

shrink操作的完整示例:

# Shrink an index
DELETE /test4,test4_new
## 创建源索引
PUT /test4
{
  "settings": {
    "index.number_of_shards": 8,
    "index.number_of_replicas": 1
  }
  ,"mappings": {
    "properties": {
      "name": {
        "type": "text"
      }
    }
  }
}
GET /test4
## 准备
PUT /test4/_settings
{
  "settings": {
    "index.number_of_replicas": 0,
    "index.routing.allocation.require._name": "shrink_node_name",
    "index.blocks.write": true
  }
}
## Shrink执行
# > 默认的主分片和副本分片的数量都是1
# > 可以像create操作一样指定别名
POST /test4/_shrink/test4_new
{
  "settings": {
    "index.routing.allocation.require._name": null,
    "index.blocks.write": null,
    "index.number_of_replicas": 1,
    "index.number_of_shards": 1,
    "index.codec": "best_compression"
  },
  "aliases": {
    "test4_new_alias": {}
  }
}

那么新的索引的名称怎么更名呢? 有个操作_reindex可以复制索引, 但是数据量大时会非常的耗费时间和资源, 所以建议使用别名!!! 而且亲测win10下的 _reindex 操作无法复制!

POST /_reindex
{
  "source": {"index": "test4_new"},
  "dest": {"index": "test4_new2"}
}
# 这个复制操作后, test4_new2 找不到. 而且也不建议使用 _reindex

0x07 Split index

官方文档

索引(的分片的)分隔, 把一个索引分割到一个新的有更多主分片的索引上.

POST|PUT /<source_index_name>/_split/<target_index_name>
{
  "settings":{
    "index.number_of_shards":2
  }
}

要求:

源索引必须是只读(read-only)状态
集群健康状态必须是green
目标索引不存在

设置源索引为不可写状态:

# 设置索引为不可写状态
PUT /<source_index_name>/_settings
{
  "settings":{
    "index.blocks.write": true
  }
}

阻止对该索引的写操作. 但是元数据(metadata)的修改操作, 比如删除索引 ,仍然是允许的.

描述

Split索引的过程, 其实就是把原来的主分片都分割成2个以上的主分片.

索引可以拆分的次数（以及每个原始分片可以拆分的分片数）由路由分片数量(index.number_of_routing_shard)设置确定的。路由分片的数量指定内部使用的散列空间，该散列空间用于在具有一致散列的分片之间分发文档。例如，一个5分片索引，其路由分片数 (number_of_routing_shard) 设置为30（5 x 2 x 3），可以按因子2或3分割。换句话说，它可以按如下方式拆分:

5 → 10 → 30 (先按2拆分, 再按3拆分)
5 → 15 → 30 (先按3拆分, 再按2拆分)
5 → 30 (按6拆分)

index.number_of_routing_shards是一个静态设置, 只能在索引创建时或者索引为closed状态时设置.

index.number_of_routing_shard的默认值取决于原索引中的主分片数。默认允许按因子2~1024进行拆分。但是，必须考虑主分片的原始数量。例如，有5个主分片的索引可以拆分为10、20、40、80、160、320或最多640个分片（可以使用单次或多次拆分操作）。

如果原始索引包含一个主分片（或原来是多个分片的索引已shrink为一个主分片），则可以将索引拆分为大于1的任意数量的分片。新拆分的索引的默认的路由分片的数量就是它了。

默认的index.number_of_routing_shards配置为1, 可以用请求GET /test4_new?include_defaults=true (在defaults节点) 找到它.

Split的工作过程

与shrink不同的地方就是目标索引的主分片数量更大, 而shrink是更小.

1. 按照源索引的定义创建一个新的索引, 但是主分片数量更大
1. 将segments从源索引硬连接(hard link)到新的索引.(如果文件系统不支持hard link就会执行复制操作, 更消耗时间; 而且 hard link不支持跨硬盘操作, 分布在多个路径下的分片也会执行复制操作)
1. 恢复新的索引, 就好像是一个closed的索引被重新打开了.

ES不支持增量的重新分片, 而必须是整数倍的方式, why?

许多键值存储系统支持从N个分片扩展到N+1个分片(又叫增量重分片). ES不提供添加一个新的分片并将新数据推送到这个新的分片上的选项：这很可能会造成一个索引瓶颈，并且要判断 get、delete和update请求所必需的文档的_id 在哪个分片上将变得相当复杂。这意味着我们需要使用不同的哈希方案重新平衡现有数据。

键值存储最常用的高效的方法是使用一致性哈希。当分片数量从N增加到N+1时，一致性哈希只需要重新定位1/N个key。但是ES的存储单元(分片)是Lucene索引。由于其面向搜索的数据结构，占据了Lucene索引的很大一部分（可能仅占文档的5%），因此删除它们并在另一个分片上为它们编制索引通常比使用键值存储的成本要高得多。但是当按照前面讲的的乘法方式增加分片数量时，成本就比较合理了: ES在本地执行拆分，从而允许在索引级别执行拆分，而不是重新索引需要移动（到新的分片）的文档，还可以使用硬链接进行高效文件复制。

在仅追加数据(append-only data)的情况下，可以通过创建新索引并将新数据推送到索引中，同时为读操作添加覆盖新旧索引的别名，从而获得更大的灵活性。假设旧索引和新索引分别具有M个和N个分片，与搜索具有M+N个分片的索引相比，没有更多的开销。

split操作的完整示例

# split操作示例
## 设置为只读
PUT /test4_new/_settings
{
  "settings": {
    "index.blocks.write": true
  }
}
## 把原来的1个主分片拆分成2个, 并重置除新索引的不可写状态
POST /test4_new/_split/test4_new_splited
{
  "settings": {
    "index.blocks.write": null,
    "index.number_of_shards": 2
  },
  "aliases": {
    "test4_new_splited_alias": {}
  }
}

监视split的进度, 可以使用 GET /_cat/recovery/<target_index_name>

0x08 Clone index

克隆一个索引

POST|PUT /<source_index_name>/_clone/<target_index_name>

要求:

索引必须是read-only状态
集群健康状态必须是green

clone操作复制了源索引的大部分设置，但是不会复制索引模板(index templates)、原数据(metadata, 里面包含了别名、IML phase definitions、CCR follower index)、以及index.number_of_replicas 和 index.auto_expand_replicas, 这些需要在clone时特别指定。

clone的工作过程与shrink、split都极度相似。

完整的clone代码示例

# clone操作示例
## 设置为只读
PUT /test4_new/_settings
{
  "settings": {
    "index.blocks.write": true
  }
}
## 克隆, 并重置除新索引的不可写状态
## 注意: number_of_shards必须与源索引一致
POST /test4_new/_clone/test4_new_cloned
{
  "settings": {
    "index.blocks.write": null,
    "index.number_of_shards": 1,
    "index.number_of_replicas": 2,
    "index.auto_expand_replicas": false
  },
  "aliases": {
    "test4_new_cloned_alias": {}
  }
}
## 查看进度
GET /_cat/recovery/test4_new_cloned
## 查看索引信息
GET /test4_new_cloned?include_defaults=true
## 恢复源索引可写状态
PUT /test4_new/_settings
{
  "settings": {
    "index.blocks.write": null
  }
}

mappings不需要设置, 会直接复制源索引的.

last updated at: 2021/10/21 14:18 修正链接中的版本号7.x为指定版本

aben_sky

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
ElasticSearch的REST APIs 之索引管理(中): shrink, split, clone

基于ES 7.7, 官方文档 https://www.elastic.co/guide/en/elasticsearch/reference/7.7/indic...
复制链接

扫一扫