前言
在使用ElasticSearch做搜索时,语句的倒排索引可以说是十分关键。所以如果针对中文段落时,如果进行正确的分词索引就是重中之重,接下来就介绍如何在ElasticSearch中安装ik中文索引。(后文均简称ES)
正文
安装步骤
插件下载:
解压配置
在ES_HOME/plugins/文件夹下新建ik文件夹
将压缩包内容解压缩放到ik中
项目文件结构
启动ES
此时启动ES应该可以看到已加载ik分词器
测试分词结果
普通分词
POST {{host}}:{{port}}/_analyze
{
"analyzer":"english",
"text":"使用搜索引擎"
}
分词结果:
{
"tokens": [
{
"token": "使",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0
},
{
"token": "用",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "搜",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "索",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3
},
{
"token": "引",
"start_offset": 4,
"end_offset": 5,
"type": "<IDEOGRAPHIC>",
"position": 4
},
{
"token": "擎",
"start_offset": 5,
"end_offset": 6,
"type": "<IDEOGRAPHIC>",
"position": 5
}
]
}
ik_smart分词
POST {{host}}:{{port}}/_analyze
{
"analyzer":"ik_smart",
"text":"使用搜索引擎"
}
{
"tokens": [
{
"token": "使用",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "搜索引擎",
"start_offset": 2,
"end_offset": 6,
"type": "CN_WORD",
"position": 1
}
]
}
ik_max_word
POST {{host}}:{{port}}/_analyze
{
"analyzer":"ik_max_word",
"text":"使用搜索引擎"
}
{
"tokens": [
{
"token": "使用",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "搜索引擎",
"start_offset": 2,
"end_offset": 6,
"type": "CN_WORD",
"position": 1
},
{
"token": "搜索",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 2
},
{
"token": "索引",
"start_offset": 3,
"end_offset": 5,
"type": "CN_WORD",
"position": 3
},
{
"token": "引擎",
"start_offset": 4,
"end_offset": 6,
"type": "CN_WORD",
"position": 4
}
]
}
搜索分词测试
// 创建index
PUT {{host}}:{{port}}/news
// 创建mapping 并设置分词器
POST {{host}}:{{port}}/news/sports/_mapping
{
"properties":{
"content":{
"type":"text",
"analyzer":"ik_max_word",
"index":"analyzed"
}
}
}
导入数据....
搜索引擎内数据
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 1,
"hits": [
{
"_index": "news",
"_type": "sports",
"_id": "AWCgyE7pGEKcCwwZuUe6",
"_score": 1,
"_source": {
"content": "热火形势一片大好"
}
},
{
"_index": "news",
"_type": "sports",
"_id": "AWCgx7fpGEKcCwwZuUe5",
"_score": 1,
"_source": {
"content": "火箭98-99不敌凯尔特人,惨遭四连败"
}
},
{
"_index": "news",
"_type": "sports",
"_id": "AWCgyOLYGEKcCwwZuUe7",
"_score": 1,
"_source": {
"content": "曼城18连胜,英超无人能挡"
}
},
{
"_index": "news",
"_type": "sports",
"_id": "AWCgxyxXGEKcCwwZuUe4",
"_score": 1,
"_source": {
"content": "巴萨3-0击败皇马赢下国家德比,梅西一球一助再获满分"
}
}
]
}
}
POST {{host}}:{{port}}/news/sports/_search
{
"query":{
"match":{
"content":"火箭队新闻"
}
}
}
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.6099695,
"hits": [
{
"_index": "news",
"_type": "sports",
"_id": "AWCgx7fpGEKcCwwZuUe5",
"_score": 0.6099695,
"_source": {
"content": "火箭98-99不敌凯尔特人,惨遭四连败"
}
}
]
}
}
POST {{host}}:{{port}}/news/sports/_search
{
"query":{
"match":{
"content":"火焰"
}
}
}
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
通过分词测试,可以看到中文分词会将带搜索字段分成更具中文含义的字段,而非每个字都分词。
通过搜索测试,可以看到保留了相关性的搜索结果,而过滤掉了不相关的结果,是的搜索更智能化。
参考文章
以下文章有关分词均做了更多的解释。如果想关注更多细节,可以查阅,本文不做更多介绍。
如何在Elasticsearch中安装中文分词器(IK+pinyin)
ik分词细节