什么是分词器
将用户输入的一段文本,按照一定逻辑,分析成多个词语的一种工具
常用的内置分词器
standard analyzer
simple analyzer
whitespace analyzer
stop analyzer
language analyzer
pattern analyzer
standard analyzer
标准分词器,是默认分词器,如果未指定,则使用该分词器 。
post /_analyze
{
"analyzer":"standard",
"text": "The best 3-points shooter is Curry!"
}
//分词结果,注意:为省空间,人为的删了很多无关的东西,只看分词的结果
{
"the",
"best",
"3",
"points",
"shooter",
"is",
"curry"
}
simple analyzer
simple 分词器,当它遇到只要不是字母的字符,就将文本解析成term,而且所有的term都是小写的
post /_analyze
{
"analyzer":"simple",
"text": "The best 3-points shooter is Curry!"
}
//结果
{
"tokens" :
{
"the",
"best",
"points",
"shooter",
"is",
"curry",
}
}
whitespace analyzer
whitespace 分词器 ,当它遇到空白字符时,就将文本解析成terms
post /_analyze
{
"analyzer":"whitespace",
"text": "The best 3-points shooter is Curry!"
}
//结果
{
"tokens" :
{
"The",
"best",
"3-points",
"shooter",
"is",
"Curry!"
}
}
stop analyzer
stop分词器 和 simple分词器 很像,唯一不同的是,stop分词器 增加了对删除停止词的支持,默认使用了english停止词
stopwords 预定义的停止词列表,比如 (the,a,an,this,of,at)等
post /_analyze
{
"analyzer":"stop",
"text": "The best 3-points shooter is Curry!"
}
//结果
{
"tokens" :
{
"best",
"points",
"shooter",
"curry",
}
}
language analyzer
特定的语言的分词器 ,比如说,english,英语分词器
post /_analyze
{
"analyzer":"english",
"text": "The best 3-points shooter is Curry!"
}
//结果
{
"tokens" :
{
"best",
"3",
"point",
"shooter",
"curri",
}
}
pattern analyzer
用正则表达式来将文本分割成terms,默认的正则表达式是\W+(非单词字符)
post /_analyze
{
"analyzer":"pattern",
"text": "The best 3-points shooter is Curry!"
}
//结果
{
"tokens" :
{
"the",
"best",
"3",
"points",
"shooter",
"is",
"curry",
}
}
使用案例
新建一个索引库,在创建mapping时,使用自定义的分词器
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "whitespace"
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text"
},
"team_name": {
"type": "text"
},
"position": {
"type": "text"
},
"play_year": {
"type": "long"
},
"jerse_no": {
"type": "keyword"
},
"title": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
PUT /my_index/_doc/1
{
"name": "库⾥",
"team_name": "勇⼠",
"position": "控球后卫",
"play_year": 10,
"jerse_no": "30",
"title": "The best 3-points shooter is Curry!"
}
POST /my_index/_search
{
"query": {
"match": {
"title": "Curry!"
}
}
}