一、分词与内置分词器
- 默认的standard分词器。能够把英文单词进行拆分,同时会把大小的字母自动转化成小写。
url
http://47.107.41.60:9200/_analyze psot
传入的json
{
“analyzer”:“standard”,
“text”:“My name is Xiaohei”
}
返回的结果
{
“tokens”: [
{
“token”: “my”,
“start_offset”: 0,
“end_offset”: 2,
“type”: “”,
“position”: 0
},
{
“token”: “name”,
“start_offset”: 3,
“end_offset”: 7,
“type”: “”,
“position”: 1
},
{
“token”: “is”,
“start_offset”: 8,
“end_offset”: 10,
“type”: “”,
“position”: 2
},
{
“token”: “xiaohei”,
“start_offset”: 11,
“end_offset”: 18,
“type”: “”,
“position”: 3
}
]
}
-
simple分词器
会把数字和符号剔除掉,也会把大写转成小写 -
whitespace对空格进行拆分,不会把大写转化成小写
-
stop分词器 会把没有意义的单词去除,比如a,the,is等等。
-
keyword不会进行拆分。
二、ik中文分词器
1. 安装
-
在github上搜索ik,并下载对应的zip压缩包,上传到linux上。
-
解压文件到es的pluigins下。
-
重启就可以使用了
2. ik_max_word
最细粒度进行拆分
结果
{
"tokens": [
{
"token": "今天天气",
"start_offset": 0,
"end_offset": 4,
"type": "CN_WORD",
"position": 0
},
{
"token": "今天",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 1
},
{
"token": "天天",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 2
},
{
"token": "天气",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 3
},
{
"token": "很好",
"start_offset": 4,
"end_offset": 6,
"type": "CN_WORD",
"position": 4
}
]
}
3. ik_smart
最粗粒度进行拆分
结果:
{
"tokens": [
{
"token": "今天天气",
"start_offset": 0,
"end_offset": 4,
"type": "CN_WORD",
"position": 0
},
{
"token": "很好",
"start_offset": 4,
"end_offset": 6,
"type": "CN_WORD",
"position": 1
}
]
}
4. 区别
三、自定义中文词库
有些时候我们可能需要自己添加词汇库
进入IKAnalyzer.cfg.xml
增加一个wy.dic文件
写上内容,并重启
结果如下:
{
"tokens": [
{
"token": "骚年",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "年在",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 1
},
{
"token": "慕课网",
"start_offset": 3,
"end_offset": 6,
"type": "CN_WORD",
"position": 2
},
{
"token": "学习",
"start_offset": 6,
"end_offset": 8,
"type": "CN_WORD",
"position": 3
}
]
}