ElasticSearch学习（十一） --IK分词插件的安装

最新推荐文章于 2024-09-06 16:19:53 发布

dicklong91

最新推荐文章于 2024-09-06 16:19:53 发布

阅读量129

点赞数

分类专栏： java 文章标签： elasticsearch es

原文链接：https://blog.csdn.net/qq_23536449/article/details/91048333

版权

java 专栏收录该内容

24 篇文章 0 订阅

订阅专栏

转载自：https://blog.csdn.net/chengyuqiang/article/details/78991570，ES版本号6.3.0
转载自：https://blog.csdn.net/qq_23536449/article/details/91048333

插件安装

离线安装
下载安装包:https://github.com/medcl/elasticsearch-analysis-ik/releases。
进入F:\elkStudy\elasticsearch\elasticsearch-6.3.0\plugins目录下，创建ik目录
将下载的压缩包解压到F:\elkStudy\elasticsearch\elasticsearch-6.3.0\plugins\ik目录下，重启es即可

在线安装
进入F:\elkStudy\elasticsearch\elasticsearch-6.3.0\bin\目录下
在dos窗口键入命令

elasticsearch-plugin install
https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.0/elasticsearch-analysis-ik-6.3.0.zip

注:在线安装的ik分词插件的配置文件在F:\elkStudy\elasticsearch\elasticsearch-6.3.0\config目录下

测试IK中文分词器
（1）ik_smart

GET _analyze?pretty
{
  "analyzer": "ik_smart",
  "text":"安徽省长江流域"
}

{
  "tokens": [
    {
      "token": "安徽省",
      "start_offset": 0,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "长江流域",
      "start_offset": 3,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 1
    }
  ]
}

（2）ik_max_world

GET _analyze?pretty
{
  "analyzer": "ik_max_word",
  "text":"安徽省长江流域"
}

{
  "tokens": [
    {
      "token": "安徽省",
      "start_offset": 0,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "安徽",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "省长",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "长江流域",
      "start_offset": 3,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "长江",
      "start_offset": 3,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "江流",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 5
    },
    {
      "token": "流域",
      "start_offset": 5,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 6
    }
  ]
}
（3）新词的分词结果

GET _analyze?pretty
{
  "analyzer": "ik_smart",
  "text": "王者荣耀"
}

{
  "tokens": [
    {
      "token": "王者",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "荣耀",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 1
    }
  ]
}

扩展已有词典
step1.进入F:\elkStudy\elasticsearch\elasticsearch-6.3.0\plugins\ik\config目录创建custom文件夹
step2.进入F:\elkStudy\elasticsearch\elasticsearch-6.3.0\plugins\ik\config\custom目录，创建文件my_word.dic，并添加内容，注意文件的编码一定要为UTF-8 无Bom编码，老哥卡在这里卡了半天。

王者荣耀

step3.修改F:\elkStudy\elasticsearch\elasticsearch-6.3.0\plugins\ik\config\IKAnalyzer.cfg.xml文件

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 扩展配置</comment>
    <!--用户可以在这里配置自己的扩展字典 -->
    <entry key="ext_dict">custom/my_word.dic</entry>
     <!--用户可以在这里配置自己的扩展停止词字典-->
    <entry key="ext_stopwords"></entry>
    <!--用户可以在这里配置远程扩展字典 -->
    <!-- <entry key="remote_ext_dict">words_location</entry> -->
    <!--用户可以在这里配置远程扩展停止词字典-->
    <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

step4.重启ES,Kibana

在这里插入图片描述

打印出来上述内容，说明自定义词典加载

step5.测试分词

GET _analyze?pretty
{
  "analyzer": "ik_smart",
  "text": "王者荣耀"
}

{
  "tokens": [
    {
      "token": "王者荣耀",
      "start_offset": 0,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 0
    }
  ]

关于ik分词器的分词类型（可以根据需求进行选择）：

ik_max_word：会将文本做最细粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”，会穷尽各种可能的组合；

ik_smart：会做最粗粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”。如下：

POST book_v6/_analyze
{
  "analyzer": "ik_smart",
  "text": "我是中国人"
}

结果：

{
  "tokens": [
    {
      "token": "我",
      "start_offset": 0,
      "end_offset": 1,
      "type": "CN_CHAR",
      "position": 0
    },
    {
      "token": "是",
      "start_offset": 1,
      "end_offset": 2,
      "type": "CN_CHAR",
      "position": 1
    },
    {
      "token": "中国人",
      "start_offset": 2,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 2
    }
  ]
}

下一篇：ElasticSearch学习（十二） --元数据概述

dicklong91

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
ElasticSearch学习（十一） --IK分词插件的安装

转载自：https://blog.csdn.net/chengyuqiang/article/details/78991570，ES版本号6.3.0转载自：https://blog.csdn.net/qq_23536449/article/details/91048333插件安装离线安装下载安装包:https://github.com/medcl/elasticsearch-analysi...
复制链接

扫一扫

专栏目录