ElasticSearch——IK分词器的下载及使用

最新推荐文章于 2024-11-08 12:28:19 发布

万里顾—程

最新推荐文章于 2024-11-08 12:28:19 发布

阅读量2.6w

点赞数 28

CC 4.0 BY-SA版权

分类专栏： ElasticSearch 文章标签： elasticsearch 大数据 Ik

本文链接：https://blog.csdn.net/wpc2018/article/details/121156880

ElasticSearch 专栏收录该内容

11 篇文章

订阅专栏

ElasticSearch——IK分词器的下载及使用

1、什么是IK分词器

ElasticSearch 几种常用分词器如下：

分词器	分词方式
StandardAnalyzer	单字分词
CJKAnalyzer	二分法
IKAnalyzer	词库分词

分词∶即把一段中文或者别的划分成一个个的关键字，我们在搜索时候会把自己的信息进行分词，会把数据库中或者索引库中的数据进行分词，然后进行一个匹配操作，默认的中文分词是将每个字看成一个词，比如“我爱中国"会被分为"我"“爱”“中”"国”，这显然是不符合要求的，所以我们需要安装中文分词器ik来解决这个问题。

IK提供了两个分词算法：ik_smart和ik_max_word，其中ik smart为最少切分，ik_max_word为最细粒度划分!

ik_max_word: 会将文本做最细粒度的拆分，比如会将"中华人民共和国国歌"拆分为"中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌"，会穷尽各种可能的组合；

ik_smart: 会做最粗粒度的拆分，比如会将"中华人民共和国国歌"拆分为"中华人民共和国,国歌"。

2、Ik分词器的下载安装

下载地址： https://github.com/medcl/elasticsearch-analysis-ik

注意：IK分词器插件的版本要和ElasticSearch的版本一致

在这里插入图片描述

下载完后，解压安装包到 ElasticSearch 所在文件夹中的plugins目录中：

在这里插入图片描述

再启动ElasticSearch，查看IK分词器插件是否安装成功：

在这里插入图片描述

安装成功！

3、使用Kibana测试IK

1、启动Kibana

在这里插入图片描述

2、访问请求：http://localhost:5601/

在这里插入图片描述

3、选择开发工具Dev Tools，点击控制台

在这里插入图片描述

4、在控制台编写分词请求，进行测试

IK提供了两个分词算法：ik_smart和ik_max_word，其中ik smart为最少切分，ik_max_word为最细粒度划分!

测试 ik_smart 分词算法，最少切分：

在这里插入图片描述

测试 ik_max_word 分词算法，最细粒度划分：

分词请求：

GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "我爱中国共产党"
}

分词结果：

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "爱",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "中国共产党",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "中国",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "国共",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "共产党",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "共产",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "党",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 7
    }
  ]
}

比较两个分词算法对同一句中文的分词结果，ik_max_word比ik_smart得到的中文词更多（从两者的英文名含义就可看出来），但这样也带来一个问题，使用ik_max_word会占用更多的存储空间。

4、扩展字典

我们用分词器对 “万里顾一程” 进行分词：先使用 ik_smart 分词算法

在这里插入图片描述

在使用 ik_max_word分词算法，进行细粒度的划分：

GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "万里顾一程"
}

分词结果：

{
  "tokens" : [
    {
      "token" : "万里",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "万",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "TYPE_CNUM",
      "position" : 1
    },
    {
      "token" : "里",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "COUNT",
      "position" : 2
    },
    {
      "token" : "顾",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "一程",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "一",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "TYPE_CNUM",
      "position" : 5
    },
    {
      "token" : "程",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 6
    }
  ]
}