四、分布式搜索引擎ElasticSearch——IK分词器

最新推荐文章于 2024-04-13 10:48:44 发布

若黑不择明

最新推荐文章于 2024-04-13 10:48:44 发布

阅读量191

点赞数

分类专栏： ElasticSearch

本文链接：https://blog.csdn.net/LQM1528490339/article/details/90598098

版权

ElasticSearch 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

分布式搜索引擎ElasticSearch

什么是IK分词器

默认的中文分词是将每个字看成一个词，这显然是不符合要求的，所以我们需要安装中
文分词器来解决这个问题。
IK分词是一款国人开发的相对简单的中文分词器。虽然开发者自2012年之后就不在维护
了，但在工程应用中IK算是比较流行的一款！我们今天就介绍一下IK中文分词器的使用。

IK分词器安装

下载地址：https://github.com/medcl/elasticsearch-analysis-ik/releases 下载5.6.8版
版本需要和elasticsearch版本一样即可。

先将其解压，将解压后的elasticsearch文件夹重命名文件夹为ik。
将ik文件夹拷贝到elasticsearch/plugins 目录下。
重新启动，即可加载IK分词器。

IK分词器测试

IK提供了两个分词算法 ik_smart 和 ik_max_word
其中 ik_smart 为最少切分， ik_max_word为最细粒度划分
我们分别来试一下
（1）最小切分：在浏览器地址栏输入地址

http://127.0.0.1:9200/_analyze?analyzer=ik_smart&pretty=true&text=IK分词器测试

输出的结果为

{
 "tokens": [
     {
         "token": "ik",
         "start_offset": 0,
         "end_offset": 2,
         "type": "ENGLISH",
         "position": 0
     },
     {
         "token": "分词器",
         "start_offset": 2,
         "end_offset": 5,
         "type": "CN_WORD",
         "position": 1
     },
     {
         "token": "测试",
         "start_offset": 5,
         "end_offset": 7,
         "type": "CN_WORD",
         "position": 2
     }
 ]
}

最细切分：在浏览器地址栏输入地址

http://127.0.0.1:9200/_analyze?analyzer=ik_max_word&pretty=true&text=IK分词器测试

{
 "tokens": [
     {
         "token": "ik",
         "start_offset": 0,
         "end_offset": 2,
         "type": "ENGLISH",
         "position": 0
     },
     {
         "token": "分词器",
         "start_offset": 2,
         "end_offset": 5,
         "type": "CN_WORD",
         "position": 1
     },
     {
         "token": "分词",
         "start_offset": 2,
         "end_offset": 4,
         "type": "CN_WORD",
         "position": 2
     },
     {
         "token": "器",
         "start_offset": 4,
         "end_offset": 5,
         "type": "CN_CHAR",
         "position": 3
     },
     {
         "token": "测试",
         "start_offset": 5,
         "end_offset": 7,
         "type": "CN_WORD",
         "position": 4
     }
 ]
}

自定义词库

我们现在测试"廖权名博客"，浏览器的测试效果如下：
http://127.0.0.1:9200/_analyze?analyzer=ik_smart&pretty=true&text=廖权名博客

{
 "tokens": [
     {
         "token": "廖",
         "start_offset": 0,
         "end_offset": 1,
         "type": "CN_CHAR",
         "position": 0
     },
     {
         "token": "权",
         "start_offset": 1,
         "end_offset": 2,
         "type": "CN_CHAR",
         "position": 1
     },
     {
         "token": "名",
         "start_offset": 2,
         "end_offset": 3,
         "type": "CN_CHAR",
         "position": 2
     },
     {
         "token": "博客",
         "start_offset": 3,
         "end_offset": 5,
         "type": "CN_WORD",
         "position": 3
     }
 ]
}

默认的分词并没有识别“廖权名”是一个词。如果我们想让系统识别“传智播客”是一个
词，需要编辑自定义词库。
步骤：
（1）进入elasticsearch/plugins/ik/config目录
（2）新建一个my.dic文件，编辑内容：

廖权名

修改IKAnalyzer.cfg.xml（在ik/config目录下）

<properties>
 <comment>IK Analyzer 扩展配置</comment>
 <!‐‐用户可以在这里配置自己的扩展字典 ‐‐>
 <entry key="ext_dict">my.dic</entry>
 <!‐‐用户可以在这里配置自己的扩展停止词字典‐‐>
 <entry key="ext_stopwords"></entry>
</properties>

重新启动elasticsearch,通过浏览器测试分词效果

{
  "tokens" : [
    {
      "token" : "廖权名",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 0
    }
  ]
}

若黑不择明

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
四、分布式搜索引擎ElasticSearch——IK分词器

分布式搜索引擎ElasticSearch什么是IK分词器默认的中文分词是将每个字看成一个词，这显然是不符合要求的，所以我们需要安装中文分词器来解决这个问题。 IK分词是一款国人开发的相对简单的中文分词器。虽然开发者自2012年之后就不在维护了，但在工程应用中IK算是比较流行的一款！我们今天就介绍一下IK中文分词器的使用。IK分词器安装下载地址：https://github....
复制链接

扫一扫