分词器
将用户输入的一段文本,按照一定逻辑,分析成多个词语的一种工具
standard analyzer
标准分析器是默认分词器,如果未指定,则使用该分词器
curl -X POST "localhost:9200/_analyze" -H 'Content-Type:application/json' -d '
{
"analyzer":"standard",
"text":"The best 3-points shooter is Curry!"
}
'
返回一个json
{
"tokens": [
{
"token": "the",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "best",
"start_offset": 4,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "3",
"start_offset": 9,
"end_offset": 10,
"type": "<NUM>",
"position": 2
},
{
"token": "points",
"start_offset": 11,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "shooter",
"start_offset": 18,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "is",
"start_offset": 26,
"end_offset": 28,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "curry",
"start_offset": 29,
"end_offset": 34,
"type": "<ALPHANUM>",
"position": 6
}
]
}
simple analyzer
simple 分析器当它遇到只要不是字母的字符、就将文本解析成term、而且所有的term都是小写的【把所有的非字母都去掉】
curl -X PUT "localhost:9200/_analyze" -H 'Content-Type:application/json' -d '
{
"analyzer":"simple",
"text":"The best 3-points shooter is Curry!"
}
'
返回一个json
{
"tokens": [
{
"token": "the",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "best",
"start_offset": 4,
"end_offset": 8,
"type": "word",
"position": 1
},
{
"token": "points",
"start_offset": 11,
"end_offset": 17,
"type": "word",
"position": 2
},
{
"token": "shooter",
"start_offset": 18,
"end_offset": 25,
"type": "word",
"position": 3
},
{
"token": "is",
"start_offset": 26,
"end_offset": 28,
"type": "word",
"position": 4
},
{
"token": "curry",
"start_offset": 29,
"end_offset": 34,
"type": "word",
"position": 5
}
]
}
whitespace analyzer
whitespace 分析器,当它遇到空白字符时,就将文本解析成terms 【通过空格将语句分割】
curl -X PUT "localhost:9200/_analyze" -H 'Content-Type:application/json' -d '
{
"analyzer":"whitespace",
"text":"The best 3-points shooter is Curry!"
}
'
返回一个json
{
"tokens": [
{
"token": "The",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "best",
"start_offset": 4,
"end_offset": 8,
"type": "word",
"position": 1
},
{
"token": "3-points",
"start_offset": 9,
"end_offset": 17,
"type": "word",
"position": 2
},
{
"token": "shooter",
"start_offset": 18,
"end_offset": 25,
"type": "word",
"position": 3
},
{
"token": "is",
"start_offset": 26,
"end_offset": 28,
"type": "word",
"position": 4
},
{
"token": "Curry!",
"start_offset": 29,
"end_offset": 35,
"type": "word",
"position": 5
}
]
}
stop analyzer
stop 分析器和 simple 分析器很像,唯一不同的是,stop 分析器增加了对删除停止词的支持, 默认使用了 english 停止器
stopwords 预定义的停止词列表,比如 (the,an,a,this,of,at) 等等
curl -X PUT "localhost:9200/_analyze" -H 'Content-Type:application/json' -d '
{
"analyzer":"stop",
"text":"The best 3-points shooter is Curry!"
}
'
返回 json
{
"tokens": [
{
"token": "best",
"start_offset": 4,
"end_offset": 8,
"type": "word",
"position": 1
},
{
"token": "points",
"start_offset": 11,
"end_offset": 17,
"type": "word",
"position": 2
},
{
"token": "shooter",
"start_offset": 18,
"end_offset": 25,
"type": "word",
"position": 3
},
{
"token": "curry",
"start_offset": 29,
"end_offset": 34,
"type": "word",
"position": 5
}
]
}
language analyzer
(特定的语言的分词器,比如说,english,英文分词器),内置语言:arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish, frech, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norweigian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai
curl -X PUT "localhost:9200/_analyze" -H 'Content-Type:application/json' -d '
{
"analyzer":"stop",
"text":"The best 3-points shooter is Curry!"
}
'
返回一个json
{
"tokens": [
{
"token": "best",
"start_offset": 4,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "3",
"start_offset": 9,
"end_offset": 10,
"type": "<NUM>",
"position": 2
},
{
"token": "point",
"start_offset": 11,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "shooter",
"start_offset": 18,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "curri",
"start_offset": 29,
"end_offset": 34,
"type": "<ALPHANUM>",
"position": 6
}
]
}
pattern analyzer
用于正则表达式来将文本分割成terms,默认的正则表达式是 \W+ (非单词字符)
curl -X PUT "localhost:9200/_analyze" -H 'Content-Type:application/json' -d '
{
"analyzer":"pattern",
"text":"The best 3-points shooter is Curry!"
}
'
返回一个 json
{
"tokens": [
{
"token": "the",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "best",
"start_offset": 4,
"end_offset": 8,
"type": "word",
"position": 1
},
{
"token": "3",
"start_offset": 9,
"end_offset": 10,
"type": "word",
"position": 2
},
{
"token": "points",
"start_offset": 11,
"end_offset": 17,
"type": "word",
"position": 3
},
{
"token": "shooter",
"start_offset": 18,
"end_offset": 25,
"type": "word",
"position": 4
},
{
"token": "is",
"start_offset": 26,
"end_offset": 28,
"type": "word",
"position": 5
},
{
"token": "curry",
"start_offset": 29,
"end_offset": 34,
"type": "word",
"position": 6
}
]
}
练习
建立索引和分词器
curl -X PUT "localhost:9200/my_index" -H 'Content-Type:application/json' -d '
{
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"type":"whitespace"
}
}
}
},
"mappings":{
"properties":{
"name":{
"type":"text"
},
"team_name":{
"type":"text"
},
"position":{
"type":"text"
},
"play_year":{
"type":"long"
},
"jerse_no":{
"type":"keyword"
},
"title":{
"type":"text",
"analyzer":"my_analyzer"
}
}
}
}
'
添加数据
curl -X PUT 'localhost:9200/my_index/_doc/1' -H 'Content-Type:application/json' -d '
{
"name":"库里",
"team_name":"勇士",
"position":"控球后卫",
"play_year":10,
"jserse_no":"30",
"title":"The best 3-points shooter is Curry!"
}
'
查询
只有查询条件满足分词后的情况,才能查询出来
curl -X POST 'localhost:9200/my_index/_search' -H 'Content-Type:application/json' -d '
{
"query":{
"match":{
"title":"Curry"
}
}
}
'
curl -X PUT 'localhost:9200/my_index/_doc/1' -H 'Content-Type:application/json' -d '
{
"name":"库里",
"team_name":"勇士",
"position":"控球后卫",
"play_year":10,
"jserse_no":"30",
"title":"The best 3-points shooter is Curry!"
}
'
常见中文分词器的使用
使用默认分析器对 中文进行分词
curl -X POST "localhost:9200/_analyze" -H 'Content-Type:application/json' -d '
{
"analyzer":"standard",
"text": "火箭明年总冠军"
}
'
返回一个 json
{
"tokens": [
{
"token": "火",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0
},
{
"token": "箭",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "明",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "年",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3
},
{
"token": "总",
"start_offset": 4,
"end_offset": 5,
"type": "<IDEOGRAPHIC>",
"position": 4
},
{
"token": "冠",
"start_offset": 5,
"end_offset": 6,
"type": "<IDEOGRAPHIC>",
"position": 5
},
{
"token": "军",
"start_offset": 6,
"end_offset": 7,
"type": "<IDEOGRAPHIC>",
"position": 6
}
]
}
常见分词器 smartCN
一个简单的中文或英文混合文本的分词器
安装相关分词器
进入ES的安装bin目录下 /Users/xxx/workspace/elasticsearch-7.2.0/bin
sh elasticsearch-plugin install analysis-smartcn
安装后重新启动
使用
curl -X POST "localhost:9200/_analyze" -H 'Content-Type:application/json' -d '
{
"analyzer":"smartcn",
"text": "火箭明年总冠军"
}
'
返回一个 json
{
"tokens": [
{
"token": "火箭",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "明年",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 1
},
{
"token": "总",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 2
},
{
"token": "冠军",
"start_offset": 5,
"end_offset": 7,
"type": "word",
"position": 3
}
]
}
卸载分词器
sh elasticsearch-plugin remove analysis-smartcn
常见分词器 IK分词器
更智能更友好的中文分词器
下载地址 https://github.com/medcl/elasticsearch-analysis-ik/releases
下载与ES对应的版本, ES为7.2
解压文件,并移动到安装目录plugins
下
mv elasticsearch-analysis-ik-7.2.0 /Users/wuyihong/workspace/elasticsearch-7.2.0/plugins/
重新启动 ES
使用
curl -X POST "localhost:9200/_analyze" -H 'Content-Type:application/json' -d '
{
"analyzer":"ik_max_word",
"text": "火箭明年总冠军"
}
'
返回 json
{
"tokens": [
{
"token": "火箭",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "明年",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 1
},
{
"token": "总冠军",
"start_offset": 4,
"end_offset": 7,
"type": "CN_WORD",
"position": 2
},
{
"token": "冠军",
"start_offset": 5,
"end_offset": 7,
"type": "CN_WORD",
"position": 3
}
]
}