Elasticsearch分词器

内置分词器


Standard


中文被分成单个词,英文以空格切分,自动转为小写。
请求示例:

GET 172.16.5.33:9200/_analyze
{
    "text": "上海市长宁区虹桥路2451号格林东方酒店, I like it very much.",
    "analyzer": "standard"
}

Whitespace


按空格分词,中文不再被分词,英文保持大小写不变。
请求示例:

GET 172.16.5.33:9200/_analyze
{
    "text": "上海市长宁区虹桥路2451号格林东方酒店, I like it very much.",
    "analyzer": "whitespace"
}

Simple

先按空格分词,中文不再被分词,英文转为小写。
请求示例:

GET 172.16.5.33:9200/_analyze
{
    "text": "上海市长宁区虹桥路2451号格林东方酒店, I like it very much.",
    "analyzer": "simple"
}

Stop


在Simple分词器的基础上,加入了停用词thea等。
请求示例:

GET 172.16.5.33:9200/_analyze
{1
    "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
    "analyzer": "stop"
}

ik中文分词器


ES内置的分词器对中文没有良好的支持,因此使用第三方ik中文分词器对中文信息进行检索。

安装


进入Elasticsearch安装目录,输入以下命令

bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.10.1/elasticsearch-analysis-ik-7.10.1.zip

其中7.10.1需替换为安装的Elasticsearch版本。


 编辑

使用

ik_analyzer提供了两种颗粒度的拆分

ik_smart

ik_smart会对文本做最粗粒度的拆分,适合Term Query。示例:

GET 172.16.5.33:9200/_analyze
{
    "text": "上海市长宁区虹桥路2451号格林东方酒店。",
    "analyzer": "ik_smart"
}
{
    "tokens": [
        {
            "token": "上海市",
            "start_offset": 0,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "长宁区",
            "start_offset": 3,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "虹桥路",
            "start_offset": 6,
            "end_offset": 9,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "2451号",
            "start_offset": 9,
            "end_offset": 14,
            "type": "TYPE_CQUAN",
            "position": 3
        },
        {
            "token": "格林",
            "start_offset": 14,
            "end_offset": 16,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "东方",
            "start_offset": 16,
            "end_offset": 18,
            "type": "CN_WORD",
            "position": 5
        },
        {
            "token": "酒店",
            "start_offset": 18,
            "end_offset": 20,
            "type": "CN_WORD",
            "position": 6
        }
    ]
}

ik_max_word

ik_max_word会将文本做最细粒度的拆分,适合Phrase Query。示例:

GET 172.16.5.33:9200/_analyze
{
    "text": "上海市长宁区虹桥路2451号格林东方酒店。",
    "analyzer": "ik_max_word"
}
{
    "tokens": [
        {
            "token": "上海市",
            "start_offset": 0,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "上海",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "海市",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "市长",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "长宁区",
            "start_offset": 3,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "长宁",
            "start_offset": 3,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 5
        },
        {
            "token": "区",
            "start_offset": 5,
            "end_offset": 6,
            "type": "CN_CHAR",
            "position": 6
        },
        {
            "token": "虹桥路",
            "start_offset": 6,
            "end_offset": 9,
            "type": "CN_WORD",
            "position": 7
        },
        {
            "token": "虹桥",
            "start_offset": 6,
            "end_offset": 8,
            "type": "CN_WORD",
            "position": 8
        },
        {
            "token": "路",
            "start_offset": 8,
            "end_offset": 9,
            "type": "CN_CHAR",
            "position": 9
        },
        {
            "token": "2451",
            "start_offset": 9,
            "end_offset": 13,
            "type": "ARABIC",
            "position": 10
        },
        {
            "token": "号",
            "start_offset": 13,
            "end_offset": 14,
            "type": "COUNT",
            "position": 11
        },
        {
            "token": "格林",
            "start_offset": 14,
            "end_offset": 16,
            "type": "CN_WORD",
            "position": 12
        },
        {
            "token": "林东",
            "start_offset": 15,
            "end_offset": 17,
            "type": "CN_WORD",
            "position": 13
        },
        {
            "token": "东方",
            "start_offset": 16,
            "end_offset": 18,
            "type": "CN_WORD",
            "position": 14
        },
        {
            "token": "酒店",
            "start_offset": 18,
            "end_offset": 20,
            "type": "CN_WORD",
            "position": 15
        }
    ]
}

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值