1.IK分词插件的安装
浏览器访问: https://github.com/medcl/elasticsearch-analysis-ik
选择 releases 版本,是已经打包好的,解压就可以使用。
找到5.4.1版本,下载
进入elsaticsearch的plugins目录下,将ik分词插件解压后拷贝进去,重启elasticsearch
需要选择 elasticsearch-analysis-ik-5.4.1.zip 第一个
然后切换为root用户对elastic 用户重新授权
chown -R elastic /home/elasticsearch/elasticsearch-5.4.1
2.重启elasticsearch
ps -ef | grep elastic
kill -9 es进程。
重启es: ./bin/elasticsearch &
显示加载了IK分词器。
3.测试IK分词器
创建text索引 :
PUT test
获取test索引的细粒度分词 ik_max_word :
GET test/_analyze?analyzer=ik_max_word
{
"text":"武汉市长江大桥"
}
分词结果:
{
"tokens": [
{
"token": "武汉市",
"start_offset": 0,
"end_offset": 3,
"type": "CN_WORD",
"position": 0
},
{
"token": "武汉",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 1
},
{
"token": "汉",
"start_offset": 1,
"end_offset": 2,
"type": "CN_WORD",
"position": 2
},
{
"token": "市长",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 3
},
{
"token": "长江大桥",
"start_offset": 3,
"end_offset": 7,
"type": "CN_WORD",
"position": 4
},
{
"token": "长江",
"start_offset": 3,
"end_offset": 5,
"type": "CN_WORD",
"position": 5
},
{
"token": "江",
"start_offset": 4,
"end_offset": 5,
"type": "CN_WORD",
"position": 6
},
{
"token": "大桥",
"start_offset": 5,
"end_offset": 7,
"type": "CN_WORD",
"position": 7
},
{
"token": "桥",
"start_offset": 6,
"end_offset": 7,
"type": "CN_WORD",
"position": 8
}
]
}
获取test索引的粗粒度分词 ik_smart:
GET test/_analyze?analyzer=ik_smart
{
"text":"武汉市长江大桥"
}
分词结果:
{
"tokens": [
{
"token": "武汉市",
"start_offset": 0,
"end_offset": 3,
"type": "CN_WORD",
"position": 0
},
{
"token": "长江大桥",
"start_offset": 3,
"end_offset": 7,
"type": "CN_WORD",
"position": 1
}
]
}
4.使用ik分词器官网的测试案例
执行命令:
//创建index 索引
PUT /index
//创建type为fulltext 的_mapping
POST /index/fulltext/_mapping
{
"properties": {
"content": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_max_word"
}
}
}
//插入四条数据
POST /index/fulltext/1
{"content":"美国留给伊拉克的是个烂摊子吗"}
POST /index/fulltext/2
{"content":"公安部:各地校车将享最高路权"}
POST /index/fulltext/3
{"content":"中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"}
POST /index/fulltext/4
{"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}
//查询包含中国的文档
POST /index/fulltext/_search
{
"query" : { "match" : { "content" : "中国" }},
"highlight" : {
"pre_tags" : ["<tag1>", "<tag2>"],
"post_tags" : ["</tag1>", "</tag2>"],
"fields" : {
"content" : {}
}
}
}
查询结果:共有两条数据
{
"took": 96,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.5347766,
"hits": [
{
"_index": "index",
"_type": "fulltext",
"_id": "4",
"_score": 0.5347766, //评分
"_source": {
"content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
},
"highlight": {
"content": [
"<tag1>中国</tag1>驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
]
}
},
{
"_index": "index",
"_type": "fulltext",
"_id": "3",
"_score": 0.27638745,
"_source": {
"content": "中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"
},
"highlight": {
"content": [
"中韩渔警冲突调查:韩警平均每天扣1艘<tag1>中国</tag1>渔船"
]
}
}
]
}
}
5,扩展词库
新增newword.dic 到/home/elasticsearch/elasticsearch-5.4.1/plugins/elasticsearch-analysis-ik-5.4.1/config目录下
在IKAnalyzer.cfg.xml中指定自定义词典的位置
重启elasticsearch
扩展前,输入命令
GET test/_analyze?analyzer=ik_smart
{
"text":"厉害了我的我的哥"
}
GET test/_analyze?analyzer=ik_smart
{
"text":"蓝瘦香菇"
}
输出结果
结果1:
{
"tokens": [
{
"token": "厉",
"start_offset": 0,
"end_offset": 1,
"type": "CN_WORD",
"position": 0
},
{
"token": "害了",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 1
},
{
"token": "我",
"start_offset": 3,
"end_offset": 4,
"type": "CN_CHAR",
"position": 2
},
{
"token": "我",
"start_offset": 5,
"end_offset": 6,
"type": "CN_CHAR",
"position": 3
},
{
"token": "的哥",
"start_offset": 6,
"end_offset": 8,
"type": "CN_WORD",
"position": 4
}
]
}
结果2:
{
"tokens": [
{
"token": "蓝",
"start_offset": 0,
"end_offset": 1,
"type": "CN_WORD",
"position": 0
},
{
"token": "瘦",
"start_offset": 1,
"end_offset": 2,
"type": "CN_WORD",
"position": 1
},
{
"token": "香菇",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 2
}
]
}
扩展中:在custom目录下vi newdic.dic
加入两行,保存退出
将新加入的词典文件加入 IKAnalyzer.cfg.xml中
<entry key="ext_dict">custom/newdic.dic;custom/mydict.dic;custom/single_word_low_freq.dic</entry>
重启elastic和kibana
扩展后分词结果:
结果1:
{
"tokens": [
{
"token": "厉害了我的哥",
"start_offset": 0,
"end_offset": 6,
"type": "CN_WORD",
"position": 0
}
]
}
结果2:
{
"tokens": [
{
"token": "蓝瘦香菇",
"start_offset": 0,
"end_offset": 4,
"type": "CN_WORD",
"position": 0
}
]
}