elasticsearch 自定义hash function （routing）

最新推荐文章于 2024-04-16 00:34:02 发布

湛蓝鹭白

最新推荐文章于 2024-04-16 00:34:02 发布

阅读量3.4k

点赞数 1

分类专栏： elasticsearch 文章标签： elasticsearch routing hash

本文链接：https://blog.csdn.net/zhanlanlubai/article/details/51546698

版权

elasticsearch 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

本次研究基于elastic search verison 2.1.1

为何想要自定义hash function？

本意是想提高elasticsearch的indexing速度。

具体思路就是：

1.将自己的数据在生成的时候就按照预先设定的routing逻辑分片好。

2.这样真正执行大批量bulk导入的时候，每个bulk里面都是一个shard的数据，就可以直接写到对应的shard，而不需要再分发到不同的shard。

3.减少了大量的网络t通讯开销。

可惜的是，经过实际研究发现，es已经不建议自定义设置hash function：

原文在这里：https://www.elastic.co/guide/en/elasticsearch/reference/2.3/breaking_20_crud_and_routing_changes.html#_routing_hash_function

关键部分如下：

In addition, the following routing-related node settings have been deprecated:

cluster.routing.operation.hash.type

This was an undocumented setting that allowed to configure which hash function to use for routing. murmur3 is now enforced on new indices.

cluster.routing.operation.use_type

This was an undocumented setting that allowed to take the _type of the document into account when computing its shard (default: false). false is now enforced on new indices.

虽然结果是不能自定义，但是研究过程中某些发现，还是值得列出来：

1.elastic search 默认hsah function 为Murmur3HashFunction

The default hash function that is used for routing has been changed from djb2 to murmur3. This change should be transparent unless you relied on very specific properties of djb2. This will help ensure a better balance of the document counts between shards.

2.elastic search 的源码里面还有两个hash function：

a.simple hash function :就是最简单的string hash，使用的java的默认实现

b.djb2 hash function ：应该是es2.0版本以前，一直使用的hash function

2.es 为啥不建议在去修改默认的hsah function呢？

只能猜测，就以我的目的来讲，如果我真的自定义了hash function，那么我每个bulk都持续写入一个shard的情况下，必然导致这个shard的load非常高，然后产生大量的segment，进而引起merge 瓶颈，最终效率说不定还不如把数据分发给所有shard来的快。

当然这还需要实际测试和验证。

湛蓝鹭白

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
elasticsearch 自定义hash function （routing）

本次研究基于elastic search verison 2.1.1为何想要自定义hash function？本意是想提高elasticsearch的indexing速度。具体思路就是：1.将自己的数据在生成的时候就按照预先设定的routing逻辑分片好。2.这样真正执行大批量bulk导入的时候，每个bulk里面都是一个shard的数据，就可以直接写到对应的shard，而不需要
复制链接

扫一扫