ES-09-ElasticSearch分词器

最新推荐文章于 2023-05-27 10:29:46 发布

csdn_yasin

最新推荐文章于 2023-05-27 10:29:46 发布

阅读量1.3k

点赞数 1

分类专栏： ElasticSearch CentOS Linux 文章标签： elasticsearch 搜索引擎大数据分词器

本文链接：https://blog.csdn.net/csdn_yasin/article/details/123081240

版权

Linux 同时被 3 个专栏收录

56 篇文章 1 订阅

订阅专栏

CentOS

31 篇文章 0 订阅

订阅专栏

ElasticSearch

9 篇文章 0 订阅

订阅专栏

说明

ElasticSearch分词器
默认分词器（标准分词器）、ik分词器、ik分词器扩展字典自定义词语
关键词：keyword、text、ik_max_word、ik_smart、词条、词典、倒排表
官方文档：https://www.elastic.co/cn/
ik分词器文档：https://github.com/medcl/elasticsearch-analysis-ik

核心概念

》数据类型说明

keyword：关键词，不能被分词
text：普通文本，可以被分词

》分词器概念

词条：索引中最小的存储和查询单元
词典：字典，词条的集合。B+，hashMap
倒排表：词条和文档ID的对照关系表

》默认分词器

默认分词器：standard（标准分词器）
默认分词器对中文不友好，默认所有中文都会被分为单个汉字

》ik分词器

处理中文分词非常友好，会将中文分为词组
提供了细粒度分词（ik_max_word）、粗粒度分词（ik_smart）两种选项

》ik分词器扩展字典

有时候一些专用自定义词语分词器是无法正确分词的，需要我们自定义扩展字典，ik分词器提供了该功能。

操作步骤

》使用默认分词器分词

使用默认分词器尝试分词一句中文

请求示例

请求方式：GET

发送请求：

curl -X GET http://192.168.3.201:9200/_analyze -H 'Content-Type:application/json' -d '
{
    "analyzer": "standard",
    "text": "一句中文。"
}'

analyzer：分析器，不填默认就是standard（标准分析器）

响应结果：

{
    "tokens": [
        {
            "token": "一",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "句",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "中",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "文",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        }
    ]
}

》安装ik分词器

下载插件：https://github.com/medcl/elasticsearch-analysis-ik/releases/tag/v7.9.3
```
[root@192 ES]# ll
total 5124
-rw-r--r--. 1 501 games 4504423 Jan 23 15:40 elasticsearch-analysis-ik-7.9.3.zip
```
- 注意：下载的版本需要和你本地ES版本一致

解压缩到你的ES根目录下的plugins目录下，并更改所属用户和组为es

[root@192 plugins]# pwd
/usr/local/es/7.9.3/plugins

[root@192 plugins]# chown es:es elasticsearch-analysis-ik-7.9.3 -R

[root@192 plugins]# ll
total 0
drwx------. 3 es es 243 Jan 23 15:41 elasticsearch-analysis-ik-7.9.3

切换为es用户并重启ES服务（我之前已经停止，这里直接启动）

[es@192 7.9.3]$ pwd
/usr/local/es/7.9.3

[es@192 7.9.3]$ bin/elasticsearch
...
[2099-01-23T15:46:02,970][INFO ][o.e.p.PluginsService     ] [node-1] loaded plugin [analysis-ik]
...

通过启动日志可以看到已经成功加载了analysis-ik

》使用ik分词器分词

使用默认分词器尝试分词一句中文

请求示例

请求方式：GET

发送请求：

curl -X GET http://192.168.3.201:9200/_analyze -H 'Content-Type:application/json' -d '
{
    "analyzer": "ik_smart",
    "text": "一句中文。"
}'

响应结果：

{
    "tokens": [
        {
            "token": "一句",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "中文",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        }
    ]
}

》ik分词器扩展字典实现自定义词语分词

自定义词语：粉奶方配儿幼

切换到ik分词器配置文件夹

[es@192 config]$ pwd
/usr/local/es/7.9.3/plugins/elasticsearch-analysis-ik-7.9.3/config

新建扩展字典文件，并加入自定义词语

[es@192 config]$ vi custom.dic
粉奶方配儿幼

关联扩展字典

打开配置文件：

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
        <comment>IK Analyzer 扩展配置</comment>
        <!--用户可以在这里配置自己的扩展字典 -->
        <entry key="ext_dict"></entry>
        <!--用户可以在这里配置自己的扩展停止词字典-->
        <entry key="ext_stopwords"></entry>
        <!--用户可以在这里配置远程扩展字典 -->
        <!-- <entry key="remote_ext_dict">words_location</entry> -->
        <!--用户可以在这里配置远程扩展停止词字典-->
        <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

修改后的内容：

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
        <comment>IK Analyzer 扩展配置</comment>
        <!--用户可以在这里配置自己的扩展字典 -->
        <entry key="ext_dict">custom.dic</entry>
        <!--用户可以在这里配置自己的扩展停止词字典-->
        <entry key="ext_stopwords"></entry>
        <!--用户可以在这里配置远程扩展字典 -->
        <!-- <entry key="remote_ext_dict">words_location</entry> -->
        <!--用户可以在这里配置远程扩展停止词字典-->
        <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

注意修改了这一行：<entry key="ext_dict">custom.dic</entry>

切换为es用户并重启ES服务（我之前已经停止，这里直接启动）
```
[es@192 7.9.3]$ bin/elasticsearch
```

请求示例

Postman发送GET请求到如下URL：http://192.168.3.201:9200/_analyze
请求方式：GET

发送请求：

curl -X GET http://192.168.3.201:9200/_analyze -H 'Content-Type:application/json' -d '
{
    "analyzer": "ik_smart",
    "text": "粉奶方配儿幼。"
}'

响应结果：

{
    "tokens": [
        {
            "token": "粉奶方配儿幼",
            "start_offset": 0,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 0
        }
    ]
}

》ik分词器在索引文档中使用

创建一个索引：

curl -X PUT http://192.168.3.201:9200/index001

给索引创建mapping：

curl -X POST http://192.168.3.201:9200/index001/_mapping -H 'Content-Type:application/json' -d'
{
    "properties": {
        "content": {
            "type": "text",
            "analyzer": "ik_max_word",
            "search_analyzer": "ik_smart"
        }
    }
}'

创建文档：

curl -X POST http://192.168.3.201:9200/index001/_create/1 -H 'Content-Type:application/json' -d'
{
    "content": "中国人民万岁。"
}'

高亮查询：

curl -X POST http://192.168.3.201:9200/index001/_search  -H 'Content-Type:application/json' -d'
{
    "query": {
        "match": {
            "content": "中国"
        }
    },
    "highlight": {
        "pre_tags": [
            "<tag1>",
            "<tag2>"
        ],
        "post_tags": [
            "</tag1>",
            "</tag2>"
        ],
        "fields": {
            "content": {}
        }
    }
}'

响应结果：

{
    "took": 5,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.2876821,
        "hits": [
            {
                "_index": "index001",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.2876821,
                "_source": {
                    "content": "中国人民万岁。"
                },
                "highlight": {
                    "content": [
                        "<tag1>中国</tag1>人民万岁。"
                    ]
                }
            }
        ]
    }
}

csdn_yasin

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
ES-09-ElasticSearch分词器

说明ElasticSearch分词器默认分词器（标准分词器）、ik分词器、ik分词器扩展字典自定义词语关键词：keyword、text、ik_max_word、ik_smart、词条、词典、倒排表官方文档：https://www.elastic.co/cn/ik分词器文档：https://github.com/medcl/elasticsearch-analysis-ik核心概念》数据类型说明keyword：关键词，不能被分词text：普通文本，可以被分词》分词器概念词条：索
复制链接

扫一扫