ElasticSearch-分词与内置、自定义分词器

最新推荐文章于 2024-05-15 05:59:27 发布

sunywz

最新推荐文章于 2024-05-15 05:59:27 发布

阅读量409

点赞数

分类专栏： Elasticsearch

本文链接：https://blog.csdn.net/qq_31776219/article/details/114588739

版权

Elasticsearch 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

什么是分词？
把文本转换为一个个的单词，分词称之为analysis。es默认只对英文语句做分词，中文不支持，每个中文字都会被拆分为独立的个体。

es内置分词器
standard：默认分词，单词会被拆分，大小会转换为小写。

simple：按照非字母分词。大写转为小写。

whitespace：按照空格分词。忽略大小写。

stop：去除无意义单词，比如the/a/an/is…

keyword：不做分词。把整个文本作为一个单独的关键词。

向ES传入一段文本，查看所得标准分词

POST     http://10.0.0.220:9200/_analyze

{
    "analyzer": "standard",
    "text": "text文本"
}

向指定索引中的某一个字段传入文本，查看如果向该字段中输入该文本会得到的相关分词

POST     http://10.0.0.220:9200/index003/_analyze

{
    "analyzer": "standard",
    "field":"name",
    "text": "text 文本"
}

更多es分词相关：https://www.cnblogs.com/cjsblog/p/10171695.html

建立中文IK分词器
下载中文IK分词器：https://github.com/medcl/elasticsearch-analysis-ik/releases （注：需下载对应版本）
将ik分析器上传到linux
解压到elasticsearch目录的plugins下

[root@elasticsearch tools]# unzip elasticsearch-analysis-ik-7.6.1.zip -d /application/elasticsearch-7.6.1/plugins/ik

重启elasticsearch，测试中文分词器。

POST    http://10.0.0.220:9200/_analyze

{
	"analyzer":"ik_max_word",
	"text":"小朋友都喜欢玩泥巴。"
}

ik_max_word: 会将文本做最细粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”，会穷尽各种可能的组合，适合 Term Query；

ik_smart: 会做最粗粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”，适合 Phrase 查询。

自定义分词

一般情况下，一个常见的词语都能在中文分词器中进行分词。但是现在网络用语各种新起，就会出现很多口头语，是不能被正常收录在我们的分词器中，如下：

POST    http://10.0.0.220:9200/_analyze
{
	"analyzer":"ik_max_word",
	"text":"隔壁住着一个可爱的小盆友。"
}

结果：“小盆友”并没有像我们期望的那样进行分词。

{
    "tokens": [
        {
            "token": "隔壁",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "住着",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "一个",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "一",
            "start_offset": 4,
            "end_offset": 5,
            "type": "TYPE_CNUM",
            "position": 3
        },
        {
            "token": "个",
            "start_offset": 5,
            "end_offset": 6,
            "type": "COUNT",
            "position": 4
        },
        {
            "token": "可爱",
            "start_offset": 6,
            "end_offset": 8,
            "type": "CN_WORD",
            "position": 5
        },
        {
            "token": "的",
            "start_offset": 8,
            "end_offset": 9,
            "type": "CN_CHAR",
            "position": 6
        },
        {
            "token": "小",
            "start_offset": 9,
            "end_offset": 10,
            "type": "CN_CHAR",
            "position": 7
        },
        {
            "token": "盆",
            "start_offset": 10,
            "end_offset": 11,
            "type": "CN_CHAR",
            "position": 8
        },
        {
            "token": "友",
            "start_offset": 11,
            "end_offset": 12,
            "type": "CN_CHAR",
            "position": 9
        }
    ]
}

1.这个时候我们就需要按照我们的意愿进行自定义分词。我们需要去修改中文分词器的配置文件。
（注意：如下操作需要在root下操作，重启时用esuser用户）

root@elasticsearch tools]# cd /application/elasticsearch-7.6.1/plugins/ik/config/
[root@elasticsearch config]# 
[root@elasticsearch config]# vim IKAnalyzer.cfg.xml 

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
        <comment>IK Analyzer 扩展配置</comment>
        <!--用户可以在这里配置自己的扩展字典 -->
        <entry key="ext_dict">custom.dict</entry>              <!--- 这里的custom.dict就是我新添加的 ---->
         <!--用户可以在这里配置自己的扩展停止词字典-->
        <entry key="ext_stopwords"></entry>
        <!--用户可以在这里配置远程扩展字典 -->
        <!-- <entry key="remote_ext_dict">words_location</entry> -->
        <!--用户可以在这里配置远程扩展停止词字典-->
        <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

2.创建文件，加入“小盆友”一词。

[root@elasticsearch config]# vim custom.dict

小盆友

3.重启ES服务

4.测试，此时的“小盆友”就会被认为是一个词进行区分。
在这里插入图片描述

sunywz

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
ElasticSearch-分词与内置、自定义分词器

什么是分词？把文本转换为一个个的单词，分词称之为analysis。es默认只对英文语句做分词，中文不支持，每个中文字都会被拆分为独立的个体。es内置分词器standard：默认分词，单词会被拆分，大小会转换为小写。simple：按照非字母分词。大写转为小写。whitespace：按照空格分词。忽略大小写。stop：去除无意义单词，比如the/a/an/is…keyword：不做分词。把整个文本作为一个单独的关键词。向ES传入一段文本，查看所得标准分词POST http://10.0
复制链接

扫一扫