安装拼音分词器
下载后解压,放到es的插件目录下
重启es
自定义分词器
拼音分词器——可选配置
1. 首字母处理配置
keep_first_letter
(默认: true)
解释:是否提取每个汉字的首字母组合,用于支持首字母缩写搜索
开启时:刘德华
→ [ldh]
关闭时:刘德华
→ []
(不生成首字母)
应用场景:适用于"ldh"搜索"刘德华"的需求
keep_separate_first_letter
(默认: false)
解释:是否将每个汉字的首字母分开存储
开启时:刘德华
→ [l,d,h]
关闭时:刘德华
→ [ldh]
注意:开启会增加索引体积,但能支持更灵活的搜索(如"l d h")
limit_first_letter_length
(默认: 16)
解释:限制首字母结果的最大长度
示例:
中华人民共和国
→ 默认输出[zhrmghg]
(7字符)
设置为3时 → [zhr]
用途:控制长文本的首字母结果长度
2. 完整拼音处理
keep_full_pinyin
(默认: true)
解释:是否保留每个汉字的完整拼音
开启时:刘德华
→ [liu,de,hua]
关闭时:刘德华
→ []
必要性:支持拼音精确搜索的基础配置
keep_joined_full_pinyin
(默认: false)
解释:是否将完整拼音连接成一个词
开启时:刘德华
→ [liudehua]
关闭时:刘德华
→ [liu,de,hua]
优劣:连接后减少索引词项,但会丢失单字搜索能力
3. 非中文处理配置
keep_none_chinese
(默认: true)
解释:是否保留原始文本中的非中文字符
开启时:刘德华AT2016
→ [liu,de,hua,AT2016]
关闭时:刘德华AT2016
→ [liu,de,hua]
重要性:处理混合文本的关键参数
keep_none_chinese_together
(默认: true)
解释:是否保持非中文连续字符的完整性
开启时:DJ音乐家
→ [DJ,yin,yue,jia]
关闭时:DJ音乐家
→ [D,J,yin,yue,jia]
影响:关闭后会显著增加索引词项数量
4. 高级处理配置
none_chinese_pinyin_tokenize
(默认: true)
解释:是否将非中文按拼音规则拆分
开启时:liudehua2016
→ [liu,de,hua,2,0,1,6]
关闭时:liudehua2016
→ [liudehua2016]
特殊用途:处理拼音与数字混合的情况
remove_duplicated_term
(默认: false)
解释:是否去除重复的词项
开启时:de的
→ [de]
关闭时:de的
→ [de,的]
权衡:节省30-50%索引空间,但影响高亮精度
keep_original
(默认:false)
解释:是否保留原始的文本
开启时:"北京"
→ ["北京", "beijing", "bj"]
关闭时:"北京"
→ ["beijing", "bj"]
5. 系统行为配置
ignore_pinyin_offset
(默认: true)
解释:是否忽略拼音分词的位置偏移
开启时:允许重叠分词(节省资源)
关闭时:严格位置约束(保证高亮准确)
版本注意:Elasticsearch 6.0+必须关注此参数
自定义分词器的工作原理
elasticsearch中分词器(analyzer)的组成包含三部分:
- character filter:在tokenizer之前对文本进行处理。例如删除字符、替换字符
- tokenizer:将文本按照一定的规则切割成词条(term)。例如keyword,就是不分词;还有ik_smart
- tokenizer filter:将tokenizer输出的词条做进一步处理。例如大小写转换、同义词处理、拼音处理等
案例
新建用于测试自定义分词器的索引库test
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "ik_max_word",
"filter": "py"
}
},
"filter": {
"py": {
"type": "pinyin",
"keep_full_pinyin": false, # 不保留每个汉字的完整拼音
"keep_joined_full_pinyin": true, # 把完整的拼音连成一个长拼音
"keep_original": true, # 保留原始的文本
"limit_first_letter_length": 16, # 限制首字母的最大长度为16
"remove_duplicated_term": true, # 去除重复的选项
"none_chinese_pinyin_tokenize": false # 不将非中文按拼音规则拆分
}
}
}
},
"mappings": {
"properties": {
"words": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "ik_max_word"
}
}
}
}
创建倒排索引的时候使用 my_analyzer
查询的时候指定分词器为 ik_max_word
这样就不会出现查询"狮子"的时候,出现虱子有关的词条了
测试
POST /test/_analyze
{
"text": ["了却君王天下事junwang天下事"],
"analyzer": "my_analyzer"
}
{
"tokens" : [
{
"token" : "了却",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "leque",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "lq",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "君王",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "junwang",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "jw",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "天下事",
"start_offset" : 4,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "tianxiashi",
"start_offset" : 4,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "txs",
"start_offset" : 4,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "天下",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "tianxia",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "tx",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "事",
"start_offset" : 6,
"end_offset" : 7,
"type" : "CN_CHAR",
"position" : 4
},
{
"token" : "shi",
"start_offset" : 6,
"end_offset" : 7,
"type" : "CN_CHAR",
"position" : 4
},
{
"token" : "s",
"start_offset" : 6,
"end_offset" : 7,
"type" : "CN_CHAR",
"position" : 4
},
{
"token" : "junwang",
"start_offset" : 7,
"end_offset" : 14,
"type" : "ENGLISH",
"position" : 5
},
{
"token" : "天下事",
"start_offset" : 14,
"end_offset" : 17,
"type" : "CN_WORD",
"position" : 6
},
{
"token" : "tianxiashi",
"start_offset" : 14,
"end_offset" : 17,
"type" : "CN_WORD",
"position" : 6
},
{
"token" : "txs",
"start_offset" : 14,
"end_offset" : 17,
"type" : "CN_WORD",
"position" : 6
},
{
"token" : "天下",
"start_offset" : 14,
"end_offset" : 16,
"type" : "CN_WORD",
"position" : 7
},
{
"token" : "tianxia",
"start_offset" : 14,
"end_offset" : 16,
"type" : "CN_WORD",
"position" : 7
},
{
"token" : "tx",
"start_offset" : 14,
"end_offset" : 16,
"type" : "CN_WORD",
"position" : 7
},
{
"token" : "事",
"start_offset" : 16,
"end_offset" : 17,
"type" : "CN_CHAR",
"position" : 8
},
{
"token" : "shi",
"start_offset" : 16,
"end_offset" : 17,
"type" : "CN_CHAR",
"position" : 8
},
{
"token" : "s",
"start_offset" : 16,
"end_offset" : 17,
"type" : "CN_CHAR",
"position" : 8
}
]
}
PUT /test/_doc/1
{
"words":"身上有虱子"
}
PUT /test/_doc/2
{
"words":"山里有狮子"
}
执行DSL
GET /test/_search
{
"query": {
"match": {
"words": "虱子"
}
}
}
指定search_analyzer为ik_max_word前的结果
{
"took" : 6,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.33425623,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.33425623,
"_source" : {
"words" : "身上有虱子"
}
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.3085442,
"_source" : {
"words" : "山里有狮子"
}
}
]
}
}
指定search_analyzer为ik_max_word后的结果
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.9530773,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.9530773,
"_source" : {
"words" : "身上有虱子"
}
}
]
}
}
显然,第二个结果是才是我们所希望的。
自动补全
es提供了completion suggest 查询来实现自动补全的功能,这个查询会匹配用户输入开头的词条并返回。
参与补全查询的字段必须是completion类型的,字段里内容是参与补全的多个词条。
自动补全(DSL实现)
创建一个game索引库,里面仅有一个completion类型的字段——title
PUT /game
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "ik_max_word",
"filter": "py"
}
},
"filter": {
"py": {
"type": "pinyin",
"keep_full_pinyin": false,
"keep_joined_full_pinyin": true,
"keep_original": true,
"limit_first_letter_length": 16,
"remove_duplicated_term": true,
"none_chinese_pinyin_tokenize": false
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "completion",
"analyzer": "my_analyzer",
"search_analyzer": "ik_max_word"
}
}
}
}
POST /game/_bulk
{"index":{"_id":1}}
{"title":["原神","开放世界","角色扮演","动作冒险","多平台","米哈游"]}
{"index":{"_id":2}}
{"title":["王者荣耀","MOBA","5v5","竞技","手游"]}
{"index":{"_id":3}}
{"title":["绝地求生","大逃杀","FPS","射击","Steam"]}
{"index":{"_id":4}}
{"title":["英雄联盟","MOBA","PC","竞技","团队合作"]}
{"index":{"_id":5}}
{"title":["崩坏:星穹铁道","角色扮演","回合制","科幻","米哈游"]}
测试案例1
GET /game/_search
{
"suggest": {
"game_suggest": {
"text": "mi",
"completion": {
"field": "title",
"skip_duplicates":false,
"size": 5
}
}
}
}
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"suggest" : {
"game_suggest" : [
{
"text" : "mi",
"offset" : 0,
"length" : 2,
"options" : [
{
"text" : "米哈游",
"_index" : "game",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"title" : [
"原神",
"开放世界",
"角色扮演",
"动作冒险",
"多平台",
"米哈游"
]
}
},
{
"text" : "米哈游",
"_index" : "game",
"_type" : "_doc",
"_id" : "5",
"_score" : 1.0,
"_source" : {
"title" : [
"崩坏:星穹铁道",
"角色扮演",
"回合制",
"科幻",
"米哈游"
]
}
}
]
}
]
}
}
测试案例2
GET /game/_search
{
"suggest": {
"game_suggest": {
"text": "ha",
"completion": {
"field": "title",
"skip_duplicates":false,
"size": 5
}
}
}
}
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"suggest" : {
"game_suggest" : [
{
"text" : "ha",
"offset" : 0,
"length" : 2,
"options" : [ ]
}
]
}
}
RestAPI实现自动补全
@Test
void testSuggest() throws Exception {
SearchRequest request = new SearchRequest("game");
request.source().suggest(new SuggestBuilder()
.addSuggestion("game_suggest", SuggestBuilders.completionSuggestion("title").prefix("mi").skipDuplicates(false).size(5)));
SearchResponse response = client.search(request, RequestOptions.DEFAULT);
CompletionSuggestion completionSuggestion = response.getSuggest().getSuggestion("game_suggest");
for (CompletionSuggestion.Entry entry : completionSuggestion.getEntries()) {
for (CompletionSuggestion.Entry.Option option : entry) {
// 获取补全文本
String suggestedText = option.getText().string();
// 获取关联文档的_source(如果有)
Map<String, Object> source = option.getHit().getSourceAsMap();
System.out.println("命中: " + suggestedText);
System.out.println("关联文档: " + source);
}
}
}
命中: 米哈游
关联文档: {title=[原神, 开放世界, 角色扮演, 动作冒险, 多平台, 米哈游]}
命中: 米哈游
关联文档: {title=[崩坏:星穹铁道, 角色扮演, 回合制, 科幻, 米哈游]}