Elasticsearch 实战（四、分词与IK分词器）

最新推荐文章于 2024-05-07 11:00:51 发布

绿水本无忧d

最新推荐文章于 2024-05-07 11:00:51 发布

阅读量469

点赞数

分类专栏：开发工具文章标签： elasticsearch

本文链接：https://blog.csdn.net/Freedomer3/article/details/119251035

版权

开发工具专栏收录该内容

12 篇文章 0 订阅

订阅专栏

本文介绍了如何在Elasticsearch中使用官方分词器和IK分词器对中英文文本进行分词。官方的standard分词器对英文处理较好，但对中文支持不足。为解决此问题，文章详细演示了安装IK分词器的步骤，并展示了IK分词器在中文分词上的优秀效果，同时提到了如何通过自定义字典实现更精确的分词需求。

摘要由CSDN通过智能技术生成

文章目录

- - 官方分词器的使用
  - 使用 IK 分词器

官方分词器的使用

使用分词器的格式如下

POST _analyze
{
  "analyzer": "standard",
  "text": "i am a good-boy"
}

其中 analyzer 参数用来指定分词器，text 位置指定需要分词的语句。

可登录 ES 官方文档-分词器查看分词器种类，例中使用标准 standard 分词器。

标准分词器将文本划分为单词边界上的术语，如 Unicode 文本分割算法所定义。它删除了大多数标点符号。

分词结果：

{
  "tokens" : [
    {
      "token" : "i",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "am",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "good",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "boy",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 4
    }
  ]
}

可以看出，分词的结果还比较满意。分词器将该条语句分为不同 token ，每个 token 中包含几个参数。

token ：分出来的词汇或字符
start_offset ：当前词汇或字符在本语句的开始偏移量
end_offset ：当前词汇或字符在本语句的结束偏移量
type ：当前词汇或字符的类型，例中为 ALPHANUM，字母数字
position ：当前词汇或字符在整个分词分组里的坐标

再尝试一下标准分词器对中文的分词结果

POST _analyze
{
  "analyzer": "standard",
  "text": "青山原不老为雪白头"
}

分词结果

{
  "tokens" : [
    {
      "token" : "青",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "山",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "原",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "不",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "老",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "为",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    },
    {
      "token" : "雪",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<IDEOGRAPHIC>",
      "position" : 6
    },
    {
      "token" : "白",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "<IDEOGRAPHIC>",
      "position" : 7
    },
    {
      "token" : "头",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "<IDEOGRAPHIC>",
      "position" : 8
    }
  ]
}

显然是不合适的

为此，需要为 ES 安装支持中文分词的插件 IK 分词器。

使用 IK 分词器

下载地址版本跟着 ES 走。

进入 ES 根目录下的 plugins 文件夹，新建一个 ik 文件夹，使用命令下载

wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.2/elasticsearch-analysis-ik-7.4.2.zip

使用命令解压

unzip elasticsearch-analysis-ik-7.4.2.zip

后重启 ES，可以在根目录下的 bin 目录下执行命令 list 展示当前分词插件

[root@0a45b28f87db bin]# elasticsearch-plugin list
ik

IK 分词器提供两种分词算法 ik_smart 和 ik_max_word，ik_smar为最少切分，ik_max_word最精细度切分。

使用 IK 分词器进行分词

POST _analyze
{
  "analyzer": "ik_max_word",
  "text": "青山原不老为雪白头"
}

分词结果

{
  "tokens" : [
    {
      "token" : "青山",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "原",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "不老",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "为",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "雪白",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "白头",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 5
    }
  ]
}

分词结果还是很理想的，并且其中 token 的 type 也变成了 CN_WORD 和 CN_CHAR。

但是作为一首诗，是否能将上下句分别保留下来呢，即保留上句“青山原不老”和下句“为雪白头”呢，我们需要自定义分词。

在之前创建的 ik 文件夹下有一个 config 配置文件，该配置文件里包括一些用来分词的字典文件，还有一个 IKAnalyzer.cfg.xml 配置文件，使用 vi 打开


<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
        <comment>IK Analyzer 扩展配置</comment>
        <!--用户可以在这里配置自己的扩展字典 -->
        <entry key="ext_dict"></entry>
         <!--用户可以在这里配置自己的扩展停止词字典-->
        <entry key="ext_stopwords"></entry>
        <!--用户可以在这里配置远程扩展字典 -->
        <!--<entry key="remote_ext_dict">远程地址位置	</entry>-->
        <!--用户可以在这里配置远程扩展停止词字典-->
        <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

这里选择配置一个远程扩展字典。

搭建了 nginx 作为一个 web 服务器，在 nginx 中新建一个 fenci.txt 用来存储词语。在该 txt 文件里增加这两句。

将该 txt 文件地址配置在上述位置。

青山原不老
为雪白头

重新启动 ES ，再次测试，得到结果。

{
  "tokens" : [
    {
      "token" : "青山原不老",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "青山",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "原",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "不老",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "为雪白头",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "雪白",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "白头",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 6
    }
  ]
}