elastic search配置ik分词及pinyin分词使搜索同时支持中文和拼音搜索

最新推荐文章于 2024-08-14 00:04:35 发布

const伐伐

最新推荐文章于 2024-08-14 00:04:35 发布

阅读量1.2w

点赞数 3

分类专栏： ELK

本文链接：https://blog.csdn.net/u013905744/article/details/80935846

版权

ELK 专栏收录该内容

27 篇文章 0 订阅

订阅专栏

有这样一个需求：

对于某个中文field进行ik分词，并对ik分词后的结果进行pinyin分词，这样我通过中文和英文都可以对此field进行搜索。

比如说“道路挖掘”，分词结果是道路和挖掘，其拼音应该是daolu和wajue，那么我通过daolu或道路应该都能搜索到这条记录。

如何解决呢？

1. 先下载ik分词和pinyin分词，并放到esplugins相应目录中

通过kibana

GET /_cat/plugins?v&s=component&h=name,component,version,description

结果

name component version description

WPhvS8c analysis-ik        6.2.4   IK Analyzer for Elasticsearch
WPhvS8c analysis-pinyin    6.2.4   Pinyin Analysis for Elasticsearch

可以看到两个分词器都安装好了

2. 定义ik分词后的pinyin分词器，即定义一个自定义分词器ik_pinyin_analyzer

PUT test_index
{
  "settings":{
    "number_of_shards":"1",
    "index.refresh_interval":"15s",
    "index":{
      "analysis":{
        "analyzer":{
           "ik_pinyin_analyzer":{
            "type":"custom",
            "tokenizer":"ik_smart",
            "filter":"pinyin_filter"
          }
        },
        "filter":{
          "pinyin_filter":{
            "type":"pinyin",
            "keep_first_letter": false
          }
        }
      }
    }
  }
}

一个基础知识是自定义分词的实现路径。

这里tokenizer使用ik分词，分词之后将分词结果通过pinyin再filter一次，这样就可以了。

测试一下

POST test_index/_analyze
{
  "analyzer": "ik_pinyin_analyzer",
  "text":"道路挖掘"
}

结果

{
  "tokens": [
    {
      "token": "dao",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "lu",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "wa",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "jue",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 3
    }
  ]
}

3. 这样，当我们建立index的mapping的时候，就可以像使用ik_smart分词器一样使用ik_pinyin_analyzer

比如lawbasis字段的mapping可以是这样的

PUT test_index/_mapping/test_type
{
  "properties": {
    "lawbasis":{
      "type": "text",
      "analyzer": "ik_smart",
      "search_analyzer": "ik_smart",
      "fields": {
        "my_pinyin":{
          "type":"text",
          "analyzer": "ik_pinyin_analyzer",
          "search_analyzer": "ik_pinyin_analyzer"
        }
      }
    }
  }
}

其中field满足以不同的目的以不同的方式为相同的字段编制索引，也就是说lawbasis这个field会以中文ik_smart分词以及分词后的pinyin分词来编制索引，并支持中文和拼音搜索。

4. 测试一下

加入两条数据

POST test_index/test_type
{
  "lawbasis":"道路挖掘"
}
POST test_index/test_type
{
  "lawbasis":"道路施工"
}

使用拼音搜索

GET test_index/test_type/_search
{
  "query":{
    "match": {
      "lawbasis.my_pinyin": "daolu"
    }
  }
}

可以看到有两条结果

如果只搜索“shigong”，那么只有一条结果

const伐伐

关注

3
点赞
踩
15

收藏

觉得还不错? 一键收藏
5
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录