Elasticsearch 默认已经含有一个分词法standard,默认的分词器会把中文分成单个字来进行全文检索,不是我们想要的结果!
发送请求
post _analyze?
{
"text":"乱世程咬金",
"analyzer":"standard"
}
分词结果
{
"tokens": [
{
"token": "乱",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0
},
{
"token": "世",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "程",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "咬",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3
},
{
"token": "金",
"start_offset": 4,
"end_offset": 5,
"type": "<IDEOGRAPHIC>",
"position": 4
}
]
}
Elasticsearch-analysis-ik 中带有两个分词器:
ik_max_word :会将文本做最细粒度的拆分;尽可能多的拆分出词语
ik_smart:会做最粗粒度的拆分;已被分出的词语将不会再次被其它词语占有
说明:elasticsearch-6.2.3 对应下载版本 elasticsearch-analysis-ik-6.2.3
elasticsearch-analysis-ik-6.2.3 安装和配置
1) 下载对应版本的ik
2) 直接解压到 elasticsearch-6.2.3\plugins 目录中
3) 重新启动 elasticsearch
4) Kibana 发送请求
ik_max_word 方式
post _analyze?
{
"text":"中华人名共和国",
"analyzer":"ik_max_word"
}
请求结果
{
"tokens": [
{
"token": "中华",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "华人",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 1
},
{
"token": "人名",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 2
},
{
"token": "共和国",
"start_offset": 4,
"end_offset": 7,
"type": "CN_WORD",
"position": 3
},
{
"token": "共和",
"start_offset": 4,
"end_offset": 6,
"type": "CN_WORD",
"position": 4
},
{
"token": "国",
"start_offset": 6,
"end_offset": 7,
"type": "CN_CHAR",
"position": 5
}
]
}
ik_smart 方式
post _analyze?
{
"text":"中华人名共和国",
"analyzer":"ik_smart"
}
请求结果
{
"tokens": [
{
"token": "中华",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "人名",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 1
},
{
"token": "共和国",
"start_offset": 4,
"end_offset": 7,
"type": "CN_WORD",
"position": 2
}
]
}