Elasticsearch 安装ik分词器并自定义词库

最新推荐文章于 2024-07-05 17:02:36 发布

冒险的梦想家

最新推荐文章于 2024-07-05 17:02:36 发布

阅读量463

点赞数

分类专栏： ELK 文章标签： elasticsearch ik

本文链接：https://blog.csdn.net/weixin_43831049/article/details/119361530

版权

ELK 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

前言

❤Java学习路线个人总结-博客
❤欢迎点赞👍收藏⭐留言 📝分享给需要的小伙伴

文章目录

前言

分词器配置

下载指定版本

安装需要的版本
https://github.com/medcl/elasticsearch-analysis-ik/releases

ik分词器配置重启

先下载好的分词器.zip文件上传到挂载目录plugins
完成unzip 解压
创建ik目录

#创建ik目录
mkdir ik

#移动所有文件到ik目录
mv * ik

重启elasticsearch服务

测试IK分词器

IK分词器有两种分词模式：ik_max_word和ik_smart模式

普通分词器

GET _analyze
{
   "text":"我是中国人"
}

//结果
{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "中",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "国",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "人",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    }
  ]
}

IK分词器

ik_smart

会做最粗粒度的拆分，比如会将“中华人民共和国人民大会堂”拆分为中华人民共和国、人民大会堂。

GET _analyze
{
   "analyzer": "ik_smart", 
   "text":"我是中国人"
}

//结果
{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "中国人",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    }
  ]
}

ik_max_word

会将文本做最细粒度的拆分，比如会将“中华人民共和国人民大会堂”拆分为“中华人民共和国、中华人民、中华、华人、人民共和国、人民、共和国、大会堂、大会、会堂等词语。

GET _analyze
{
   "analyzer": "ik_max_word", 
   "text":"我是中国人"
}

//结果
{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "中国人",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "中国",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "国人",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 4
    }
  ]
}

自定义词库

新建词库文件

进入plugins–>ik–>config目录，新建一个my.dic的文件，并写入内容。

新增测试分词

引入自定义词库

进入plugins–>ik–>config目录，修改IKAnalyzer.cfg.xml文件的内容
修改文件地址
my.dic

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
        <comment>IK Analyzer 扩展配置</comment>
        <!--用户可以在这里配置自己的扩展字典 -->
        <entry key="ext_dict">my.dic</entry>
         <!--用户可以在这里配置自己的扩展停止词字典-->
        <entry key="ext_stopwords"></entry>
        <!--用户可以在这里配置远程扩展字典 -->
        <!--<entry key="remote_ext_dict">http://114.115.218.80/es/fenci.txt</entry> -->
        <!--用户可以在这里配置远程扩展停止词字典-->
        <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

重启es测试效果

GET _analyze
{
   "analyzer": "ik_max_word", 
   "text":"新增测试分词"
}


//结果
{
  "tokens" : [
    {
      "token" : "新增测试分词",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "新增",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "测试",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "分词",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 3
    }
  ]
}

冒险的梦想家

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
Elasticsearch 安装ik分词器并自定义词库

文章目录分词器配置下载指定版本ik分词器配置重启测试IK分词器普通分词器IK分词器ik_smartik_max_word自定义词库新建词库文件引入自定义词库重启es测试效果分词器配置下载指定版本安装需要的版本https://github.com/medcl/elasticsearch-analysis-ik/releasesik分词器配置重启先下载好的分词器.zip文件上传到挂载目录plugins完成unzip 解压创建ik目录#创建ik目录mkdir ik#移动所有文件到i
复制链接

扫一扫