ES默认提供了八种内置的analyzer,针对不同的场景可以使用不同的analyzer;
1、keyword analyzer
1.1、keyword类型及分词效果
keyword analyzer视字符串为一个整体不进行分词处理
//测试keyword analyzer默认分词效果
//请求参数
POST _analyze
{
"analyzer": "keyword",
"text": "The aggregations framework helps provide aggregated data based on a search query"
}
//结果返回
{
"tokens" : [
{
"token" : "The aggregations framework helps provide aggregated data based on a search query",
"start_offset" : 0,
"end_offset" : 80,
"type" : "word",
"position" : 0
}
]
}
以上句子通过分词之后得到的词(term)为:
[The aggregations framework helps provide aggregated data based on a search query]
1.2、keyword analyzer的组成定义
序号 | 子构件 | 构件说明 |
---|---|---|
1 | Tokenizer | keyword tokenizer |
如果希望自定义一个与keyword类似的analyzer,只需要在在自定义analyzer时指定type为keyword,其它的可以按照需要进行配置(char filter/filter),如下示例:
//自定义kwyword analyzer
PUT custom_rebuild_keyword_analyzer_index
{
"settings": {
"analysis": {
"analyzer": {
"rebuild_keyword_analyzer":{
"tokenizer":"keyword",
"filter":[]
}
}
}
}
}
2、pattern analyzer
2.1、pattern类型及分词效果
pattern analyzer使用正则表达式作为文本分词规则,注意对正则的转义处理,避免正则匹配到正则本身的字符串,默认正则为\W+(匹配非单词字符)
//测试pattern analyzer默认分词效果
//请求参数
POST _analyze
{
"analyzer": "pattern",
"text": "It's a nice day"
}
//结果返回
{
"tokens" : [
{
"token" : "it",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "s",
"start_offset" : 3,
"end_offset" : 4,
"type" : "word",
"position" : 1
}