ES分词器作用
ES分词器可在索引创建之前将字段拆分为对应词元,用于建立对应倒排索引;查询时将查询关键词根据指定分词器进行分词,然后进行索引数据查询;ES内置分词器介绍.xmind
ES分词器组成
ES分词器包含三部分:
- char_filter:在分词之前对原字段字符进行过滤
- tokenizer,对输入文本进行处理,拆分成各个词元
- fliter,后置处理器,tokenizer拆分词元之后,filter进行后续处理,可新增或者删除词元,如中文的拼音分词器、同义词 就是使用此方式
所以,自定义analyzer内容如下:
{
"analysis": {
"filter": {
"filter_a": {
...
},
"filter_b": {
...
}
},
"analyzer": {
"analyzer_a": {
"tokenizer": "...",
"filter": "...",
"token_chars": "...",
...//其他属性配置
},
"analyzer_b": {
"tokenizer": "...",
"filter": "...",
"token_chars": "...",
...//其他属性配置
}
},
"tokenizer": {
"tokenizer_a": {
...
}
},
"char_filter": {
}
}
}
char_filter
在分词之前对原字段字符进行过滤,主要包含三种:
- html_strip,用于html标签的过滤,可通过escaped_tags属性排除需要过滤的标签;
html_strip属于内置字符过滤器,可直接使用:
{
"tokenizer": "keyword",
"char_filter": [ "html_strip" ]
}
也可自定义分词器,指定排除过滤的标签:
"char_filter": {
"my_char_filter": {
"type": "html_strip",
"escaped_tags": ["b"]
}
}
- mapping,字符映射filter
mapping类char_filter需自定义:
mappings:字符映射,将一个字符替换为另一个字符,空格字符需转义
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"٠ => 0",
"١ => 1",
"٢ => 2",
"٣ => 3",
"٤ => 4",
"٥ => 5",
"٦ => 6",
"٧ => 7",
"٨ => 8",
"٩ => 9"
]
}
}
mappings_path:指定映射文件路径,文件utf-8格式,内容与上面mappings一致
- pattern_replace ,正则表达式替换
pattern_replace 类char_filter需自定义:
pattern:java正则
replacement:需替换的内容
flags:正则模式
官网示例:
"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": "(\\d+)-(?=\\d)",
"replacement": "$1_"
}
}
tokenizer
tokenizer,对输入文本进行处理,拆分成各个词元;
ES内置分词器如下:
- standard 默认分析器,英文分词器,对中文不适用;
属性:
max_token_length:最大词元长度
"tokenizer": {
"my_tokenizer": {
"type": "standard",
"max_token_length": 5
}
}
- letter ,字符分析器,分词结果只包含字符:
POST _analyze
{
"tokenizer": "letter",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
分词结果:
[ The, QUICK, Brown, Foxes, jumped, over, the, lazy, dog, s, bone ]
- lowercase ,小写分析器,与letter分词结果一致,同时会将字符转成小写:
POST _analyze
{
"tokenizer": "lowercase",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
分词结果:
[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
- whitespace ,空白字符分析器,以空白字符来拆分输入文本:
POST _analyze
{
"tokenizer": "whitespace",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
分词结果:
[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]
- uax_url_email,url、email分析器,与standard类似,但是会将url和email当成一个词元
属性:
max_token_length:最大词元长度,参考standard
POST _analyze
{
"tokenizer": "uax_url_email",
"text": "Email me at john.smith@global-international.com"
}
分词结果:
[ Email, me, at, john.smith@global-international.com ]
standard分词时结果会是:
[ Email, me, at, john.smith, global, international.com ]
- classic 忽略
- Thai 泰语分析器,忽略
- ngram,将给定输入按照指定长度分成连续的词元,忽略语义,ngram分析器可用来代替模糊查询
属性:
min_gram:最小长度,默认1
max_gram:最大长度,默认2
token_chars:分词字符,枚举值如下:
letter:文本字符,如 a,b,c 京等;
digit :数字
whitespace:空白字符
punctuation:标点符号
symbol:符号
POST _analyze
{
"tokenizer": "ngram",
"text": "Quick Fox"
}
分词结果:
[ Q, Qu, u, ui, i, ic, c, ck, k, "k ", " ", " F", F, Fo, o, ox, x ]
- edge_ngram ,将给定输入按照空格或者字符拆分后,从前向后分成词元,edge_ngram可用于处理搜索建议词问题;
属性:与ngram一致
POST _analyze
{
"tokenizer": "edge_ngram",
"text": "Quick Fox"
}
分词结果:
[ Q, Qu ]
- keyword ,关键词分析器,即不做分词
属性:
buffer_size:keyword指定缓冲大小,默认255,超过255个字符忽略,不建议修改
- pattern,正则分析器
属性:
pattern:java正则,默认\W+;
flags:正则模式,
group:哪个分组的词作为正则内容,默认-1(split)
POST _analyze
{
"tokenizer": "pattern",
"text": "The foo_bar_size's default is 5."
}
分词结果:
[ The, foo_bar_size, s, default, is, 5 ]
自定义pattern分析器:以逗号拆分输入
"my_tokenizer": {
"type": "pattern",
"pattern": ","
}
- 其他,忽略
fliter
后置处理器,tokenizer拆分词元之后,filter进行后续处理,可新增或者删除词元,如中文的拼音分词器、同义词 就是使用此方式,在此仅介绍常用filter;
- length 长度过滤filter,将长度过大或者过小的词元移除;
属性:
min:最小长度 默认0
max:最大长度 默认Integer.MAX_VALUE
- lowercase 小写filter,将词元中的大写字符转成小写;
- uppercase 大写filter,将词元中小写字符转成大写;
- nGram,参考ngram tokenizer;
- edgeNGram ,参考edge_ngram tokenizer;
- stop 停用词 filter,移除词元中的停用词:
属性:
stopwords:停用词集合,默认_english_
stopwords_path:停用词文件路径
ignore_case:忽略大小写
remove_trailing:删除最后一个词元,false
"filter": {
"my_stop": {
"type": "stop",
"stopwords": ["and", "is", "the"]
}
}
- word_delimiter,单词分隔:
属性:
generate_word_parts:单词拆分,"PowerShot" ⇒ "Power" "Shot",默认true
generate_number_parts:数字拆分,"500-42" ⇒ "500" "42",默认true
catenate_words:连词拆分,"wi-fi" ⇒ "wifi",默认false
catenate_numbers:连续数字拆分:"500-42" ⇒ "50042",默认false
catenate_all:"wi-fi-4000" ⇒ "wifi4000"
split_on_case_change:大小写改变时拆分词元
preserve_original:保留原始值,"500-42" ⇒ "500-42" "500" "42",默认false
split_on_numerics:数字拆分,"j2se" ⇒ "j" "2" "se",默认true
- synonym ,同义词过滤器:
属性:
synonyms_path:同义词文件路径
synonyms:同义词配置
"filter" : {
"synonym" : {
"type" : "synonym",
"format" : "wordnet",
"synonyms" : [
"s(100000001,1,'abstain',v,1,0).",
"s(100000001,2,'refrain',v,1,0).",
"s(100000001,3,'desist',v,1,0)."
]
}
}
同义词文件配置规则:
utf-8文件
每行多个词以英文逗号分隔,表示双向同义词
单向同义词以 =>分隔,如 i-pod, i pod => ipod
- 其他filter,忽略;
自定义分词器
中文分词常用分词主要为ik和pinyin;
#ik分词器下载地址:
https://github.com/medcl/elasticsearch-analysis-ik
#拼音分词器下载地址:
https://github.com/medcl/elasticsearch-analysis-pinyin
- ik分词器包含analyzer:
ik_smart:分词粒度较大,
ik_max_word:拆词粒度细;
分词对比:
text:我是中国人
ik_smart:我,是,中国人
ik_max_word:我,是,中国人,中国,国人
ik_smart和ik_max_word根据本身业务需求选择;
- pinyin分词器包含pinyin tokenizer,pinyin analyzer,和pinyin filter;具体参数及用法,参考github;