ElasticSearch配置IK灵活匹配单个汉字与词组

最新推荐文章于 2024-08-22 03:44:44 发布

LittleMagics

最新推荐文章于 2024-08-22 03:44:44 发布

阅读量4.3k

点赞数 3

本文链接：https://blog.csdn.net/nazeniwaresakini/article/details/104220237

版权

在我们的工作中，将ElasticSearch当做全文检索引擎来使用，同时为用户和后台提供服务。版本是比较老旧的2.3.2。

最近接到一个优化需求：在检索单个中文字符时，能够匹配包含该单字的文档；在检索词语时，就不按单字进行匹配。也就是说以商品为例，如果搜索“酒”字，能够匹配到关于“啤酒”“白酒”“红酒”等所有的文档；但如果搜索“啤酒”词语，就只匹配“啤酒”。另外，在匹配时，能够全文匹配的结果排在前面，包含分词匹配的结果排在后面，并且要按匹配度与销量来排序。

最初想到的办法是，对有这种需求的字段，索引与检索时采用不同的IK analyzer。索引时做最细粒度分词，检索时则用智能分词。即设置mapping时如下：

~ curl -s -H 'Content-Type:application/json' \
-XPUT 'es0:9200/index/_mapping/type?pretty=true' -d '{
"properties": {
  "productTitle": {
    "type": "string",
    "analyzer": "ik_max_word",
    "search_analyzer": "ik_smart"
  }
}
}'

但是，就算采用ik_max_word，有很多单字也是分不出来的。因此，我们在自定义词典中添加了一个单字字典，大约有12000个单字。这样再采用ik_max_word分词，单字都会被切分出来。如：

~ curl -s -H 'Content-Type: application/json' \
-XGET 'es0:9200/_analyze?pretty' -d '{
  "analyzer" : "ik_max_word",
  "text": "中华人民共和国"
}'

{
  "tokens" : [ {
    "token" : "中华人民共和国",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "CN_WORD",
    "position" : 0
  }, {
    "token" : "中华人民",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 1
  }, {
    "token" : "中华",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "CN_WORD",
    "position" : 2
  }, {
    "token" : "中",
    "start_offset" : 0,
    "end_offset" : 1,
    "type" : "CN_WORD",
    "position" : 3
  }, {
    "token" : "华人",
    "start_offset" : 1,
    "end_offset" : 3,
    "type" : "CN_WORD",
    "position" : 4
  }, {
    "token" : "华",
    "start_offset" : 1,
    "end_offset" : 2,
    "type" : "CN_WORD",
    "position" : 5
  }, {
    "token" : "人民共和国",
    "start_offset" : 2,
    "end_offset" : 7,
    "type" : "CN_WORD",
    "position" : 6
  }, {
    "token" : "人民",
    "start_offset" : 2,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 7
  }, {
    "token" : "人",
    "start_offset" : 2,
    "end_offset" : 3,
    "type" : "CN_WORD",
    "position" : 8
  }, {
    "token" : "民",
    "start_offset" : 3,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 9
  }, {
    "token" : "共和国",
    "start_offset" : 4,
    "end_offset" : 7,
    "type" : "CN_WORD",
    "position" : 10
  }, {
    "token" : "共和",
    "start_offset" : 4,
    "end_offset" : 6,
    "type" : "CN_WORD",
    "position" : 11
  }, {
    "token" : "共",
    "start_offset" : 4,
    "end_offset" : 5,
    "type" : "CN_WORD",
    "position" : 12
  }, {
    "token" : "和",
    "start_offset" : 5,
    "end_offset" : 6,
    "type" : "CN_WORD",
    "position" : 13
  }, {
    "token" : "国",
    "start_offset" : 6,
    "end_offset" : 7,
    "type" : "CN_WORD",
    "position" : 14
  } ]
}

单字字典其实在ES IK插件中就有提供，见https://github.com/medcl/elasticsearch-analysis-ik/blob/master/config/extra_single_word_full.dic。
在IKAnalyzer.cfg.xml中加入单字字典后，重启ES生效（我们没有做热更新，惭愧惭愧）。

至于后来提到的排序规则就相对简单了，只需要让term query的优先级高于match-phrase query，用boost/slop可以轻易实现：

BoolQueryBuilder currentBuilder = QueryBuilders.boolQuery();
currentBuilder.should(QueryBuilders.termQuery("productTitle", keyword).boost(6.5f));
currentBuilder.should(QueryBuilders.matchPhraseQuery("productTitle", keyword).slop(4).boost(2.5f));
// ......
requestBuilder.setFrom(start).setSize(limit);
requestBuilder.addSort(SortBuilders.scoreSort().order(SortOrder.DESC));
requestBuilder.addSort(SortBuilders.fieldSort("soldNum").order(SortOrder.DESC));