【ES2】分词器

最新推荐文章于 2024-01-28 14:30:20 发布

逸羽菲

最新推荐文章于 2024-01-28 14:30:20 发布

阅读量590

点赞数 2

分类专栏： ES 从单体到高可用集群

本文链接：https://blog.csdn.net/qq_41346335/article/details/115590469

版权

从单体到高可用集群同时被 2 个专栏收录

36 篇文章 2 订阅

订阅专栏

8 篇文章 0 订阅

订阅专栏

文章目录

一、分词与内置分词器

默认的standard分词器。能够把英文单词进行拆分，同时会把大小的字母自动转化成小写。

url

http://47.107.41.60:9200/_analyze psot

传入的json

{
“analyzer”:“standard”,
“text”:“My name is Xiaohei”
}

返回的结果

{
“tokens”: [
{
“token”: “my”,
“start_offset”: 0,
“end_offset”: 2,
“type”: “”,
“position”: 0
},
{
“token”: “name”,
“start_offset”: 3,
“end_offset”: 7,
“type”: “”,
“position”: 1
},
{
“token”: “is”,
“start_offset”: 8,
“end_offset”: 10,
“type”: “”,
“position”: 2
},
{
“token”: “xiaohei”,
“start_offset”: 11,
“end_offset”: 18,
“type”: “”,
“position”: 3
}
]
}

simple分词器
会把数字和符号剔除掉，也会把大写转成小写
whitespace对空格进行拆分，不会把大写转化成小写
stop分词器会把没有意义的单词去除，比如a，the，is等等。
keyword不会进行拆分。

在这里插入图片描述

二、ik中文分词器

1. 安装

在github上搜索ik，并下载对应的zip压缩包，上传到linux上。
解压文件到es的pluigins下。
重启就可以使用了

2. ik_max_word

最细粒度进行拆分
在这里插入图片描述
结果

{
    "tokens": [
        {
            "token": "今天天气",
            "start_offset": 0,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "今天",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "天天",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "天气",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "很好",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 4
        }
    ]
}

3. ik_smart

最粗粒度进行拆分

结果:

{
    "tokens": [
        {
            "token": "今天天气",
            "start_offset": 0,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "很好",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 1
        }
    ]
}

4. 区别

在这里插入图片描述

三、自定义中文词库

有些时候我们可能需要自己添加词汇库
进入IKAnalyzer.cfg.xml
在这里插入图片描述

增加一个wy.dic文件

写上内容，并重启

在这里插入图片描述
结果如下：

{
    "tokens": [
        {
            "token": "骚年",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "年在",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "慕课网",
            "start_offset": 3,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "学习",
            "start_offset": 6,
            "end_offset": 8,
            "type": "CN_WORD",
            "position": 3
        }
    ]
}