ElasticStack学习笔记(三)

最新推荐文章于 2022-09-19 17:21:39 发布

LoveG_G

最新推荐文章于 2022-09-19 17:21:39 发布

阅读量140

点赞数

分类专栏：大数据

本文链接：https://blog.csdn.net/LoveG_G/article/details/115120000

版权

大数据专栏收录该内容

8 篇文章 0 订阅

订阅专栏

第三章 ElasticSearch

3.1 分词

3.1.1 分词简介

分词：语言识别，讲一句话分为多个单词就是分词。默认使用标准分词器；

分词器 接受一个字符串作为输入，将这个字符串拆分成独立的词或 语汇单元（token） （可能会丢弃一些标点符号等字符），然后输出一个 语汇单元流（token stream） 。

POST  http://172.31.132.130:9200/_analyze
{
    "analyzer":"standard",
    "text": "hell word"
}

{
    "tokens": [
        {
            "token": "hell",
            "start_offset": 0,
            "end_offset": 4,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "word",
            "start_offset": 5,
            "end_offset": 9,
            "type": "<ALPHANUM>",
            "position": 1
        }
    ]
}

标准分词器，英文分词比较好，中文分词就很一般，我是中国人，分词应该是：我 / 是中国人/ , 这样比较好，但是默认的一个一个字分开了,不是特别好的。

{
    "analyzer":"standard",
    "text": "我是中国人"
}

# 结果 
{
    "tokens": [
        {
            "token": "我",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "是",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "中",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "国",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
            "token": "人",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<IDEOGRAPHIC>",
            "position": 4
        }
    ]
}

指定索引分词：

http://172.31.132.130:9200/index_test_02/_analyze
{
    "analyzer":"standard",
    "field":"hobby",
    "text": "音乐"
}

3.1.1 中文分词

中文分词：容易产生歧义，不好理解，需要使用专用中文分词分析；

常见的中文分词器： IK分词器；

安装IK 分词器：需要版本和es版本一致： https://github.com/medcl/elasticsearch-analysis-ik/releases/tag/v7.3.0

下载IK包，解压到plugis目录下，重新启动即可

POST http://172.31.132.130:9200/index_test_02/_analyze

{
    "analyzer":"ik_smart",
    "text": "我是中国人"
}


{
    "tokens": [
        {
            "token": "我",
            "start_offset": 0,
            "end_offset": 1,
            "type": "CN_CHAR",
            "position": 0
        },
        {
            "token": "是",
            "start_offset": 1,
            "end_offset": 2,
            "type": "CN_CHAR",
            "position": 1
        },
        {
            "token": "中国人",
            "start_offset": 2,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 2
        }
    ]
}