Elasticsearch分词器

最新推荐文章于 2024-06-23 10:47:41 发布

xulong5000

最新推荐文章于 2024-06-23 10:47:41 发布

阅读量397

点赞数

分类专栏：微服务

本文链接：https://blog.csdn.net/xulong5000/article/details/118764500

版权

微服务专栏收录该内容

49 篇文章 0 订阅

订阅专栏

内置分词器

Standard

中文被分成单个词，英文以空格切分，自动转为小写。
请求示例：

GET 172.16.5.33:9200/_analyze
{
    "text": "上海市长宁区虹桥路2451号格林东方酒店, I like it very much.",
    "analyzer": "standard"
}

Whitespace

按空格分词，中文不再被分词，英文保持大小写不变。
请求示例：

GET 172.16.5.33:9200/_analyze
{
    "text": "上海市长宁区虹桥路2451号格林东方酒店, I like it very much.",
    "analyzer": "whitespace"
}

Simple

先按空格分词，中文不再被分词，英文转为小写。
请求示例：

GET 172.16.5.33:9200/_analyze
{
    "text": "上海市长宁区虹桥路2451号格林东方酒店, I like it very much.",
    "analyzer": "simple"
}

Stop

在Simple分词器的基础上，加入了停用词the, a等。
请求示例：

GET 172.16.5.33:9200/_analyze
{1
    "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
    "analyzer": "stop"
}

ik中文分词器

ES内置的分词器对中文没有良好的支持，因此使用第三方ik中文分词器对中文信息进行检索。

安装

进入Elasticsearch安装目录，输入以下命令

bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.10.1/elasticsearch-analysis-ik-7.10.1.zip

其中7.10.1需替换为安装的Elasticsearch版本。

编辑

使用

ik_analyzer提供了两种颗粒度的拆分

ik_smart

ik_smart会对文本做最粗粒度的拆分，适合Term Query。示例：

GET 172.16.5.33:9200/_analyze
{
    "text": "上海市长宁区虹桥路2451号格林东方酒店。",
    "analyzer": "ik_smart"
}

{
    "tokens": [
        {
            "token": "上海市",
            "start_offset": 0,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "长宁区",
            "start_offset": 3,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "虹桥路",
            "start_offset": 6,
            "end_offset": 9,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "2451号",
            "start_offset": 9,
            "end_offset": 14,
            "type": "TYPE_CQUAN",
            "position": 3
        },
        {
            "token": "格林",
            "start_offset": 14,
            "end_offset": 16,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "东方",
            "start_offset": 16,
            "end_offset": 18,
            "type": "CN_WORD",
            "position": 5
        },
        {
            "token": "酒店",
            "start_offset": 18,
            "end_offset": 20,
            "type": "CN_WORD",
            "position": 6
        }
    ]
}

ik_max_word

ik_max_word会将文本做最细粒度的拆分，适合Phrase Query。示例：

GET 172.16.5.33:9200/_analyze
{
    "text": "上海市长宁区虹桥路2451号格林东方酒店。",
    "analyzer": "ik_max_word"
}

{
    "tokens": [
        {
            "token": "上海市",
            "start_offset": 0,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "上海",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "海市",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "市长",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "长宁区",
            "start_offset": 3,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "长宁",
            "start_offset": 3,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 5
        },
        {
            "token": "区",
            "start_offset": 5,
            "end_offset": 6,
            "type": "CN_CHAR",
            "position": 6
        },
        {
            "token": "虹桥路",
            "start_offset": 6,
            "end_offset": 9,
            "type": "CN_WORD",
            "position": 7
        },
        {
            "token": "虹桥",
            "start_offset": 6,
            "end_offset": 8,
            "type": "CN_WORD",
            "position": 8
        },
        {
            "token": "路",
            "start_offset": 8,
            "end_offset": 9,
            "type": "CN_CHAR",
            "position": 9
        },
        {
            "token": "2451",
            "start_offset": 9,
            "end_offset": 13,
            "type": "ARABIC",
            "position": 10
        },
        {
            "token": "号",
            "start_offset": 13,
            "end_offset": 14,
            "type": "COUNT",
            "position": 11
        },
        {
            "token": "格林",
            "start_offset": 14,
            "end_offset": 16,
            "type": "CN_WORD",
            "position": 12
        },
        {
            "token": "林东",
            "start_offset": 15,
            "end_offset": 17,
            "type": "CN_WORD",
            "position": 13
        },
        {
            "token": "东方",
            "start_offset": 16,
            "end_offset": 18,
            "type": "CN_WORD",
            "position": 14
        },
        {
            "token": "酒店",
            "start_offset": 18,
            "end_offset": 20,
            "type": "CN_WORD",
            "position": 15
        }
    ]
}

xulong5000

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Elasticsearch分词器

内置分词器Standard中文被分成单个词，英文以空格切分，自动转为小写。请求示例：GET 172.16.5.33:9200/_analyze{ "text": "上海市长宁区虹桥路2451号格林东方酒店, I like it very much.", "analyzer": "standard"}Whitespace按空格分词，中文不再被分词，英文保持大小写不变。请求示例：GET 172.16.5.33:9200/_analyze{ "text
复制链接

扫一扫

专栏目录