人名搜索 - 如何改进结果

最新推荐文章于 2024-06-22 16:23:46 发布

衣舞晨风

最新推荐文章于 2024-06-22 16:23:46 发布

阅读量647

点赞数 1

分类专栏： ElasticSearch 文章标签： elasticsearch 搜索引擎 java

本文链接：https://blog.csdn.net/jiankunking/article/details/130692466

版权

ElasticSearch 专栏收录该内容

70 篇文章

订阅专栏

文章探讨了在Elasticsearch中实现用户名检索的两种方法：使用IKAnalyzer和Ngram。IKAnalyzer需要自定义词库，而Ngram虽能处理部分问题但可能导致相似数据匹配。由于数据量不大，最终选择了Ngram方案，通过调整查询语句优化性能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

基于Elasticsearch如何实现一个好用的用户名检索?

先说一下背景,需要提供一个检索接口,根据用户输入的值去检索：姓名、姓名拼音、工号、昵称等。

迭代过的版本：

标准分词
- 中文支持不好
wildcard
- 因为有安全考虑,所以限制了总的返回条数为10,导致完整匹配的不是在前10条
- 这里还遇到了一个坑华为云elasticsearch是基于OpenSearch的,不支持wildcard这种类型

关于姓名这个,由于大多数用户的姓名是汉字,所以第一个想到的是ik

IK方案

构建Dockerfile

FROM registry.jiankunking.com/library/elasticsearch:7.13.4

# RUN ./bin/elasticsearch-plugin install --batch https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.13.4/elasticsearch-analysis-ik-7.13.4.zip
ADD ik/elasticsearch-analysis-ik-7.13.4.zip /usr/share/elasticsearch
RUN ./bin/elasticsearch-plugin install --batch file:///usr/share/elasticsearch/elasticsearch-analysis-ik-7.13.4.zip

ADD pinyin/elasticsearch-analysis-pinyin-7.13.4.zip /usr/share/elasticsearch
RUN ./bin/elasticsearch-plugin install --batch file:///usr/share/elasticsearch/elasticsearch-analysis-pinyin-7.13.4.zip

# 移除插件安装包
RUN rm -f /usr/share/elasticsearch/elasticsearch-analysis-ik-7.13.4.zip

IK插件已经上传到代码中,所有没有采用在线安装的方式
已将姓名字典放置到 elasticsearch-analysis-ik-7.13.4/config/custom_user_name.dic

IKAnalyzer.cfg.xml也已修改

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典 -->
	<entry key="ext_dict">custom_user_name.dic</entry>
	 <!--用户可以在这里配置自己的扩展停止词字典-->
	<!-- <entry key="ext_stopwords"></entry> -->
	<!--用户可以在这里配置远程扩展字典 -->
	<!-- <entry key="remote_ext_dict">words_location</entry> -->
	<!--用户可以在这里配置远程扩展停止词字典-->
	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

验证

模版

POST /_template/jiankunking-attr
{
	"order": 0,
	"index_patterns": [
		"jiankunking-attrs",
		"jiankunking-attrs-dev"
	],
	"settings": {
		"index": {
			"number_of_shards": "6",
			"number_of_replicas": "1",
			"refresh_interval": "200ms"
		}
	},
	"mappings": {
		"dynamic_templates": [
			{
				"strings": {
					"mapping": {
						"type": "keyword"
					},
					"match_mapping_type": "string"
				}
			}
		],
		"properties": {
			"id": {
				"type": "keyword"
			},
			"attrs.user_name_ik.attrValue": {
				"type": "text",
				"analyzer": "ik_max_word",
				"search_analyzer": "ik_smart"
			},
			"creator": {
				"type": "keyword"
			},
			"updater": {
				"type": "keyword"
			},
			"createdAt": {
				"format": "epoch_second",
				"type": "date"
			},
			"updatedAt": {
				"format": "epoch_second",
				"type": "date"
			}
		}
	}
}

查询验证过程略

结论

1、需要自定义词库
使用IK插件但不自定义词库的话,无法正确的分词姓名;
使用词库的话,需要穷举可能得搜索项,
假如,词典中只加载姓名,那么对于搜索后半段,比如：孙新伟,搜索：新伟,会搜索不到,这时候搜索结果就会不符合预期。
2、对于英文姓名分词有一定限制
比如搜索:dam,Adam Dean不会被检索到
3、需要安装IK插件,需要重启集群

Ngram方案

验证

模版

POST /_template/jiankunking-attr-ngram
{
	"order": 0,
	"index_patterns": [
		"jiankunking-attrs-v3-*"
	],
	"settings": {
		"index": {
			"max_ngram_diff": "9",
			"refresh_interval": "200ms",
			"analysis": {
				"analyzer": {
					"ngram_analyzer": {
						"tokenizer": "ngram"
					}
				},
				"tokenizer": {
					"ngram": {
						"token_chars": [
							"letter",
							"digit"
						],
						"min_gram": "1",
						"type": "ngram",
						"max_gram": "10"
					}
				}
			},
			"number_of_shards": "10",
			"number_of_replicas": "1"
		}
	},
	"mappings": {
		"dynamic_templates": [
			{
				"strings": {
					"mapping": {
						"type": "keyword"
					},
					"match_mapping_type": "string"
				}
			}
		],
		"properties": {
			"attrs.pinyin.attrValue": {
				"search_analyzer": "ngram_analyzer",
				"analyzer": "ngram_analyzer",
				"type": "text",
				"fields": {
					"keyword": {
						"type": "keyword"
					}
				}
			},
			"createdAt": {
				"format": "epoch_second",
				"type": "date"
			},
			"creator": {
				"type": "keyword"
			},
			"attrs.nickname.attrValue": {
				"search_analyzer": "ngram_analyzer",
				"analyzer": "ngram_analyzer",
				"type": "text",
				"fields": {
					"keyword": {
						"type": "keyword"
					}
				}
			},
			"attrs.username.attrValue": {
				"search_analyzer": "ngram_analyzer",
				"analyzer": "ngram_analyzer",
				"type": "text",
				"fields": {
					"keyword": {
						"type": "keyword"
					}
				}
			},
			"attrs.user_id.attrValue": {
				"type": "keyword",
				"fields": {
					"text": {
						"search_analyzer": "ngram_analyzer",
						"analyzer": "ngram_analyzer",
						"type": "text"
					}
				}
			},
			"id": {
				"type": "keyword"
			},
			"attrs": {
				"type": "object"
			},
			"updater": {
				"type": "keyword"
			},
			"updatedAt": {
				"format": "epoch_second",
				"type": "date"
			}
		}
	},
	"aliases": {}
}

查询验证过程略

结论

搜索返回的数据中,会有相似数据
比如搜索:“土豆儿”,会匹配到:“王豆豆”、"田豆豆"等
Ngram分词会消耗大量资源(尤其是磁盘),reindex有可能会超时

结论

由于用户数据量不到100万,所以即使资源效果的多一些,也在可以接受的范围内,所以最终采取了Ngram方案

最优的查询语句如下:

{
	"size": 10,
	"_source": [
		"attrs.username.attrValue",
		"attrs.pinyin.attrValue",
		"attrs.user_id.attrValue",
		"attrs.nickname.attrValue"
	],
	"query": {
		"multi_match": {
			"query": "jiankunking",
			"type": "best_fields",
			"fields": [
				"attrs.user_id.attrValue.text",
				"attrs.username.attrValue",
				"attrs.nickname.attrValue",
				"attrs.pinyin.attrValue"
			]
		}
	}
}