【ElasticSearch】IK分词器

最新推荐文章于 2024-04-13 10:48:44 发布

国服冰

最新推荐文章于 2024-04-13 10:48:44 发布

阅读量131

点赞数

分类专栏： ElasticSearch

本文链接：https://blog.csdn.net/qq_43442335/article/details/115608572

版权

ElasticSearch 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

公众号上线啦！
搜一搜【国服冰】
使命：尽自己所能给自学后端开发的小伙伴提供一个少有弯路的平台
回复：国服冰，即可领取我为大家准备的资料，里面包含整体的Java学习路线，电子书，以及史上最全的面试题！

IK分词器

什么是IK分词器？

分词：即把一段中文或者别的划分成一个个的关键字，我们在搜索时候会把自己的信息进行分词,会把数据库中或者索引库中的数据进行分词,然后进行一个匹配操作，默认的中文分词器是将每个字看成一个词，比如"我爱可星"会被分为"我",“爱”,“可”,“星”，这显然不符合我们的要求，所以我们需要安装中文分词器IK来解决这个问题。

IK提供了两个分词算法:ik_smart和ik_max_word

其中ik_smart为最少切分,ik_max_word为最细粒度划分

下载安装

IK 安装时注意和ES版本一致，下载完毕后解压到ES的plugins目录下并创建Ik新文件夹，重启ES

ik_smart 最少切分

GET _analyze
{
  "analyzer": "ik_smart",
  "text": ["和可星的一个约定"]
}

切分后：

{
  "tokens" : [
    {
      "token" : "和",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "可",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "星",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "的",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "一个",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "约定",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 5
    }
  ]
}

ik_max_word为最细粒度划分穷尽词库的可能

GET _analyze
{
  "analyzer": "ik_max_word",
  "text": ["和可星的一个约定"]
}

切分后：

{
  "tokens" : [
    {
      "token" : "和",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "可",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "星",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "的",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "一个",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "一",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "TYPE_CNUM",
      "position" : 5
    },
    {
      "token" : "个",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "COUNT",
      "position" : 6
    },
    {
      "token" : "约定",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 7
    }
  ]
}

发现：IK把可星这个词拆开了，因为它自带的字典里并没有这个词，所以我们需要添加自己的词典，让IK分词器能识别这是一个词！

config
kexing.dic

更改IK配置文件，将自己的字典添加进去，重启

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典 -->
	<entry key="ext_dict">kexing.dic</entry>
	 <!--用户可以在这里配置自己的扩展停止词字典-->
	<entry key="ext_stopwords"></entry>
	<!--用户可以在这里配置远程扩展字典 -->
	<!-- <entry key="remote_ext_dict">words_location</entry> -->
	<!--用户可以在这里配置远程扩展停止词字典-->
	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

测试

GET _analyze
{
  "analyzer": "ik_smart",
  "text": ["和可星的一个约定"]
}

{
  "tokens" : [
    {
      "token" : "和",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "可星",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "的",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "一个",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "约定",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 4
    }
  ]
}

现在可星也被识别为一个存在的词了！

国服冰

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【ElasticSearch】IK分词器

IK分词器什么是IK分词器？分词：即把一段中文或者别的划分成一个个的关键字，我们在搜索时候会把自己的信息进行分词,会把数据库中或者索引库中的数据进行分词,然后进行一个匹配操作，默认的中文分词器是将每个字看成一个词，比如"我爱可星"会被分为"我",“爱”,“可”,“星”，这显然不符合我们的要求，所以我们需要安装中文分词器IK来解决这个问题。IK提供了两个分词算法:ik_smart和ik_max_word其中ik_smart为最少切分,ik_max_word为最细粒度划分下载安装IK 安装时
复制链接

扫一扫

专栏目录