ES安装IK分词器插件

最新推荐文章于 2024-06-03 18:25:02 发布

AnimalsD

最新推荐文章于 2024-06-03 18:25:02 发布

阅读量1.8k

点赞数

分类专栏： ElasticSearch

本文链接：https://blog.csdn.net/qq_36964872/article/details/117361484

版权

ElasticSearch 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

2）使用ik_max_word进行分词（最细粒度划分，穷尽词库的可能）

3）自定义单词

1、什么是IK分词器？

分词就是把一段文本内容根据设定的规则划分成一个个单词或词组
IK分词器是为了解决ES对于中文分词的不足，ES对于中文文本的分词默认是每个字看成一个词，同时支持自定义字典
IK分词器提供两个分词算法：ik_smart(最少切分)；ik_max_word（最细粒度划分）

2、IK分词的安装

下载IK分词器，版本要与ElasticSearch的版本相对应
放入ElasticSearch的plugins文件夹下，解压压缩包，然后在plugins文件夹下新建一个文件ik，将解压后的内容放进去（\elasticsearch-7.11.2\plugins\ik）
重启ES，显示加载的插件

3、IK分词器的使用

使用kibana的开发工具，对IK分词器进行测试

1）使用ik_smart进行分词

GET _analyze
{
  "analyzer": "ik_smart",
  "text": "中华人民共和国万岁"
}

分词结果为：

{
  "tokens" : [
    {
      "token" : "中华人民共和国",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "万岁",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 1
    }
  ]
}

2）使用ik_max_word进行分词（最细粒度划分，穷尽词库的可能）

GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "中华人民共和国万岁"
}

产生的结果为：

{
  "tokens" : [
    {
      "token" : "中华人民共和国",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "中华人民",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "中华",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "华人",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "人民共和国",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "人民",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "共和国",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "共和",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "国",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 8
    },
    {
      "token" : "万岁",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 9
    },
    {
      "token" : "万",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "TYPE_CNUM",
      "position" : 10
    },
    {
      "token" : "岁",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "COUNT",
      "position" : 11
    }
  ]
}

3）自定义单词

在文件夹\elasticsearch-7.11.2\plugins\ik\config下新建字典myword.dic，并将自己的单词或词组添加进去
在IKanalyzer.xml中配置自己的扩展词典
```
<entry key="ext_dict">myword.dic</entry>
```
重启ES，进行IK词划分测试，两种划分算法都会将文本中的词，依据myword.dci词典划分相应的词组

AnimalsD

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
ES安装IK分词器插件

1、什么是IK分词器？分词就是把一段文本内容根据设定的规则划分成一个个单词或词组 IK分词器是为了解决ES对于中文分词的不足，ES对于中文文本的分词默认是每个字看成一个词，同时支持自定义字典 IK分词器提供两个分词算法：ik_smart(最少切分)；ik_max_word（最细粒度划分）2、IK分词的安装下载IK分词器，版本要与ElasticSearch的版本相对应放入ElasticSearch的plugins文件夹下，解压压缩包，然后在plugins文件夹下新建一个文件ik，将解压
复制链接

扫一扫

专栏目录