es 自动补全

最新推荐文章于 2025-04-25 18:07:04 发布

c1tenj2

最新推荐文章于 2025-04-25 18:07:04 发布

阅读量827

点赞数 26

文章标签： java elasticsearch

本文链接：https://blog.csdn.net/c1tenj2/article/details/147457600

版权

安装拼音分词器

选择es版本对应的pinyin分词器版本

下载后解压，放到es的插件目录下

重启es

自定义分词器

拼音分词器——可选配置

1. 首字母处理配置

keep_first_letter (默认: true)

解释：是否提取每个汉字的首字母组合，用于支持首字母缩写搜索
开启时：刘德华 → [ldh]
关闭时：刘德华 → []（不生成首字母）
应用场景：适用于"ldh"搜索"刘德华"的需求

keep_separate_first_letter (默认: false)

解释：是否将每个汉字的首字母分开存储
开启时：刘德华 → [l,d,h]
关闭时：刘德华 → [ldh]
注意：开启会增加索引体积，但能支持更灵活的搜索（如"l d h"）

limit_first_letter_length (默认: 16)

解释：限制首字母结果的最大长度
示例：
中华人民共和国 → 默认输出[zhrmghg]（7字符）
设置为3时 → [zhr]
用途：控制长文本的首字母结果长度

2. 完整拼音处理

keep_full_pinyin (默认: true)

解释：是否保留每个汉字的完整拼音
开启时：刘德华 → [liu,de,hua]
关闭时：刘德华 → []
必要性：支持拼音精确搜索的基础配置

keep_joined_full_pinyin (默认: false)

解释：是否将完整拼音连接成一个词
开启时：刘德华 → [liudehua]
关闭时：刘德华 → [liu,de,hua]
优劣：连接后减少索引词项，但会丢失单字搜索能力

3. 非中文处理配置

keep_none_chinese (默认: true)

解释：是否保留原始文本中的非中文字符
开启时：刘德华AT2016 → [liu,de,hua,AT2016]
关闭时：刘德华AT2016 → [liu,de,hua]
重要性：处理混合文本的关键参数

keep_none_chinese_together (默认: true)

解释：是否保持非中文连续字符的完整性
开启时：DJ音乐家 → [DJ,yin,yue,jia]
关闭时：DJ音乐家 → [D,J,yin,yue,jia]
影响：关闭后会显著增加索引词项数量

4. 高级处理配置

none_chinese_pinyin_tokenize (默认: true)

解释：是否将非中文按拼音规则拆分
开启时：liudehua2016 → [liu,de,hua,2,0,1,6]
关闭时：liudehua2016 → [liudehua2016]
特殊用途：处理拼音与数字混合的情况

remove_duplicated_term (默认: false)

解释：是否去除重复的词项
开启时：de的 → [de]
关闭时：de的 → [de,的]
权衡：节省30-50%索引空间，但影响高亮精度

keep_original (默认：false)

解释：是否保留原始的文本
开启时："北京" → ["北京", "beijing", "bj"]
关闭时："北京"→ ["beijing", "bj"]

5. 系统行为配置

ignore_pinyin_offset (默认: true)

解释：是否忽略拼音分词的位置偏移
开启时：允许重叠分词（节省资源）
关闭时：严格位置约束（保证高亮准确）
版本注意：Elasticsearch 6.0+必须关注此参数

自定义分词器的工作原理

elasticsearch中分词器（analyzer）的组成包含三部分:

character filter：在tokenizer之前对文本进行处理。例如删除字符、替换字符
tokenizer：将文本按照一定的规则切割成词条（term）。例如keyword,就是不分词;还有ik_smart
tokenizer filter：将tokenizer输出的词条做进一步处理。例如大小写转换、同义词处理、拼音处理等

案例

新建用于测试自定义分词器的索引库test

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "ik_max_word",
          "filter": "py"  
        }
      },
      "filter": { 
        "py": {
          "type": "pinyin",
          "keep_full_pinyin": false, # 不保留每个汉字的完整拼音
          "keep_joined_full_pinyin": true, # 把完整的拼音连成一个长拼音
          "keep_original": true, # 保留原始的文本
          "limit_first_letter_length": 16, # 限制首字母的最大长度为16
          "remove_duplicated_term": true,  # 去除重复的选项
          "none_chinese_pinyin_tokenize": false  # 不将非中文按拼音规则拆分
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "words": {
        "type": "text",
        "analyzer": "my_analyzer",
        "search_analyzer": "ik_max_word"
      }
    }
  }
}

创建倒排索引的时候使用 my_analyzer

查询的时候指定分词器为 ik_max_word

这样就不会出现查询"狮子"的时候，出现虱子有关的词条了

测试

POST /test/_analyze
{
  "text": ["了却君王天下事junwang天下事"],
  "analyzer": "my_analyzer"
}

{
  "tokens" : [
    {
      "token" : "了却",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "leque",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "lq",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "君王",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "junwang",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "jw",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "天下事",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "tianxiashi",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "txs",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "天下",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "tianxia",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "tx",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "事",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "shi",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "s",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "junwang",
      "start_offset" : 7,
      "end_offset" : 14,
      "type" : "ENGLISH",
      "position" : 5
    },
    {
      "token" : "天下事",
      "start_offset" : 14,
      "end_offset" : 17,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "tianxiashi",
      "start_offset" : 14,
      "end_offset" : 17,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "txs",
      "start_offset" : 14,
      "end_offset" : 17,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "天下",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "tianxia",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "tx",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "事",
      "start_offset" : 16,
      "end_offset" : 17,
      "type" : "CN_CHAR",
      "position" : 8
    },
    {
      "token" : "shi",
      "start_offset" : 16,
      "end_offset" : 17,
      "type" : "CN_CHAR",
      "position" : 8
    },
    {
      "token" : "s",
      "start_offset" : 16,
      "end_offset" : 17,
      "type" : "CN_CHAR",
      "position" : 8
    }
  ]
}

PUT /test/_doc/1
{
  "words":"身上有虱子"
}

PUT /test/_doc/2
{
  "words":"山里有狮子"
}

执行DSL

GET /test/_search
{
  "query": {
    "match": {
      "words": "虱子"
    }
  }
}

指定search_analyzer为ik_max_word前的结果

{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.33425623,
    "hits" : [
      {
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.33425623,
        "_source" : {
          "words" : "身上有虱子"
        }
      },
      {
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.3085442,
        "_source" : {
          "words" : "山里有狮子"
        }
      }
    ]
  }
}

指定search_analyzer为ik_max_word后的结果

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.9530773,
    "hits" : [
      {
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.9530773,
        "_source" : {
          "words" : "身上有虱子"
        }
      }
    ]
  }
}

显然，第二个结果是才是我们所希望的。

自动补全

es提供了completion suggest 查询来实现自动补全的功能，这个查询会匹配用户输入开头的词条并返回。

参与补全查询的字段必须是completion类型的，字段里内容是参与补全的多个词条。

自动补全（DSL实现）

创建一个game索引库，里面仅有一个completion类型的字段——title

PUT /game
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "ik_max_word",
          "filter": "py"  
        }
      },
      "filter": { 
        "py": {
          "type": "pinyin",
          "keep_full_pinyin": false,
          "keep_joined_full_pinyin": true,
          "keep_original": true,
          "limit_first_letter_length": 16,
          "remove_duplicated_term": true,
          "none_chinese_pinyin_tokenize": false
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "completion",
        "analyzer": "my_analyzer",
        "search_analyzer": "ik_max_word"
      }
    }
  }
}

POST /game/_bulk
{"index":{"_id":1}}
{"title":["原神","开放世界","角色扮演","动作冒险","多平台","米哈游"]}
{"index":{"_id":2}}
{"title":["王者荣耀","MOBA","5v5","竞技","手游"]}
{"index":{"_id":3}}
{"title":["绝地求生","大逃杀","FPS","射击","Steam"]}
{"index":{"_id":4}}
{"title":["英雄联盟","MOBA","PC","竞技","团队合作"]}
{"index":{"_id":5}}
{"title":["崩坏：星穹铁道","角色扮演","回合制","科幻","米哈游"]}

测试案例1

GET /game/_search
{
  "suggest": {
    "game_suggest": {
      "text": "mi",
      "completion": {
        "field": "title",
        "skip_duplicates":false,
        "size": 5
      }
    }
  }
}

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "suggest" : {
    "game_suggest" : [
      {
        "text" : "mi",
        "offset" : 0,
        "length" : 2,
        "options" : [
          {
            "text" : "米哈游",
            "_index" : "game",
            "_type" : "_doc",
            "_id" : "1",
            "_score" : 1.0,
            "_source" : {
              "title" : [
                "原神",
                "开放世界",
                "角色扮演",
                "动作冒险",
                "多平台",
                "米哈游"
              ]
            }
          },
          {
            "text" : "米哈游",
            "_index" : "game",
            "_type" : "_doc",
            "_id" : "5",
            "_score" : 1.0,
            "_source" : {
              "title" : [
                "崩坏：星穹铁道",
                "角色扮演",
                "回合制",
                "科幻",
                "米哈游"
              ]
            }
          }
        ]
      }
    ]
  }
}

测试案例2

GET /game/_search
{
  "suggest": {
    "game_suggest": {
      "text": "ha",
      "completion": {
        "field": "title",
        "skip_duplicates":false,
        "size": 5
      }
    }
  }
}

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "suggest" : {
    "game_suggest" : [
      {
        "text" : "ha",
        "offset" : 0,
        "length" : 2,
        "options" : [ ]
      }
    ]
  }
}

RestAPI实现自动补全

@Test
void testSuggest() throws Exception {
SearchRequest request = new SearchRequest("game");
request.source().suggest(new SuggestBuilder()
                         .addSuggestion("game_suggest", SuggestBuilders.completionSuggestion("title").prefix("mi").skipDuplicates(false).size(5)));


SearchResponse response = client.search(request, RequestOptions.DEFAULT);
CompletionSuggestion completionSuggestion = response.getSuggest().getSuggestion("game_suggest");

for (CompletionSuggestion.Entry entry : completionSuggestion.getEntries()) {
    for (CompletionSuggestion.Entry.Option option : entry) {
        // 获取补全文本
        String suggestedText = option.getText().string();

        // 获取关联文档的_source（如果有）
        Map<String, Object> source = option.getHit().getSourceAsMap();

        System.out.println("命中: " + suggestedText);
        System.out.println("关联文档: " + source);
    }
}
}

命中: 米哈游
关联文档: {title=[原神, 开放世界, 角色扮演, 动作冒险, 多平台, 米哈游]}
命中: 米哈游
关联文档: {title=[崩坏：星穹铁道, 角色扮演, 回合制, 科幻, 米哈游]}