ElasticSearch7安装IK中文分词器

最新推荐文章于 2024-09-24 22:29:40 发布

冰上浮云

最新推荐文章于 2024-09-24 22:29:40 发布

阅读量3.8k

点赞数 2

分类专栏： ik-analyzer elasticsearch 文章标签： elasticsearch es

本文链接：https://blog.csdn.net/clj198606061111/article/details/112156902

版权

elasticsearch 同时被 2 个专栏收录

6 篇文章 0 订阅

订阅专栏

ik-analyzer

2 篇文章 0 订阅

订阅专栏

之前我们创建索引，查询数据，都是使用的默认的分词器，分词效果不太理想，会把text的字段分成一个一个汉字，然后搜索的时候也会把搜索的句子进行分词，所以这里就需要更加智能的分词器IK分词器了。

实验环境

操作系统：CentOS7
ES版本：7.10.0
IK：elasticsearch-analysis-ik-7.10.0.zip

ik分词器的下载和安装，测试

下载

下载地址：https://github.com/medcl/elasticsearch-analysis-ik/releases，这里你需要根据你的Es的版本来下载对应版本的IK，这里我使用的是7.10.0的ES，所以就下载elasticsearch-analysis-ik-7.10.0.zip
的文件。

https://github.com/medcl/elasticsearch-analysis-ik/releases/tag/v7.10.0

解压

将文件复制到 es的安装目录/plugin/ik下面即可

[es@localhost ik]$ pwd
/home/es/elasticsearch-7.10.0/plugins/ik
[es@localhost ik]$ ll
total 1432
-rw-r--r--. 1 es es 263965 May  6  2018 commons-codec-1.9.jar
-rw-r--r--. 1 es es  61829 May  6  2018 commons-logging-1.2.jar
drwxr-xr-x. 2 es es   4096 Dec 25  2019 config
-rw-r--r--. 1 es es  54625 Nov 12 10:01 elasticsearch-analysis-ik-7.10.0.jar
-rw-r--r--. 1 es es 736658 May  6  2018 httpclient-4.5.2.jar
-rw-r--r--. 1 es es 326724 May  6  2018 httpcore-4.4.4.jar
-rw-r--r--. 1 es es   1807 Nov 12 10:01 plugin-descriptor.properties
-rw-r--r--. 1 es es    125 Nov 12 10:01 plugin-security.policy

到这里已经完成了，不需要去elasticSearch的 elasticsearch.yml 文件去配置。

重启

重启ElasticSearch

测试

未使用ik分词器的时候测试分词效果

POST book/_analyze
{
  "text": "我是中国人"
}

结果是

{
  "tokens": [
    {
      "token": "我",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<IDEOGRAPHIC>",
      "position": 0
    },
    {
      "token": "是",
      "start_offset": 1,
      "end_offset": 2,
      "type": "<IDEOGRAPHIC>",
      "position": 1
    },
    {
      "token": "中",
      "start_offset": 2,
      "end_offset": 3,
      "type": "<IDEOGRAPHIC>",
      "position": 2
    },
    {
      "token": "国",
      "start_offset": 3,
      "end_offset": 4,
      "type": "<IDEOGRAPHIC>",
      "position": 3
    },
    {
      "token": "人",
      "start_offset": 4,
      "end_offset": 5,
      "type": "<IDEOGRAPHIC>",
      "position": 4
    }
  ]
}

使用IK分词器之后，结果如下

POST book/_analyze
{
  "analyzer": "ik_max_word",
  "text": "中华人民共和国"
}

结果如下：

{"tokens": [
      {
      "token": "中华人民共和国",
      "start_offset": 0,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 0
   },
      {
      "token": "中华人民",
      "start_offset": 0,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 1
   },
      {
      "token": "中华",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 2
   },
      {
      "token": "华人",
      "start_offset": 1,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 3
   },
      {
      "token": "人民共和国",
      "start_offset": 2,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 4
   },
      {
      "token": "人民",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 5
   },
      {
      "token": "共和国",
      "start_offset": 4,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 6
   },
      {
      "token": "共和",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 7
   },
      {
      "token": "国",
      "start_offset": 6,
      "end_offset": 7,
      "type": "CN_CHAR",
      "position": 8
   }
]}

对于上面两个分词效果的解释：

如果未安装ik分词器，那么，你如果写 “analyzer”: “ik_max_word”，那么程序就会报错，因为你没有安装ik分词器
如果你安装了ik分词器之后，你不指定分词器，不加上 “analyzer”: “ik_max_word” 这句话，那么其分词效果跟你没有安装ik分词器是一致的，也是分词成每个汉字。