standard 目前什么都不做;
2.ASCII Folding Token Filter
asciifolding 类型的词元过滤器,将不在前127个ASCII字符(“基本拉丁文”Unicode块)中的字母,数字和符号Unicode字符转换为ASCII等效项(如果存在)。
3.Length Token Filter
length用于去掉过长或者过短的单词;
min 定义最短长度
max 定义最长长度
输入:
GET _analyze
{
"tokenizer" : "standard",
"filter": [{"type": "length", "min":1, "max":3 }],
"text" : "this is a test"
}
输出:
"tokens": [
{
"token": "is",
"start_offset": 5,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "a",
"start_offset": 8,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 2
}
]
4.Lowercase Token Filter
lowercase 类型的词元过滤器,将词元文本规范化为小写
5. Uppercase Token Filter
uppercase类型的词元过滤器,将词元文本规范化为大写;
6.Porter Stem Token Filter(Porter Stem 词元过滤器)
porter_stem类型的词元过滤器,根据波特干扰算法转换词元流
注:输入数据必须已经是小写,以使其可以正常工作
作用解读:返回一个英文单词的词干
- 1> 处理复数,以及ed和ing结束的单词
- 2> 复数变为单数
- 3> 如果单词包含元音,并且以ys结尾,将ys改为i(测试的,貌似是
等等... 围绕提取词干展开
输入:
GET _analyze
{
"tokenizer" : "standard",
"filter": ["porter_stem"],
"text" : ["I readed books", "eys"]
}
输出:
"tokens": [
{
"token": "I",
"start_offset": 0,
"end_offset": 1,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "read",
"start_offset": 2,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "book",
"start_offset": 9,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "ei",
"start_offset": 15,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 3
}
]
7.Shingle Token Filter
single类型的词元过滤器用于创建词元的组合作为单个词元
输入:
GET _analyze
{
"tokenizer" : "standard",
"filter": [{"type": "shingle", "output_unigrams": "false"}],
"text" : ["this is a test"]
}
输出:
注:这里如果设置output_unigrams = “true” 切默认为true,则会将输入分词原样加入token
#将this is a test 变为 this is, is a, a test, 两两组合,其中组合数目可以自定义
[
{
"token": "this is",
"start_offset": 0,
"end_offset": 7,
"type": "shingle",
"position": 0
},
{
"token": "is a",
"start_offset": 5,
"end_offset": 9,
"type": "shingle",
"position": 1
},
{
"token": "a test",
"start_offset": 8,
"end_offset": 14,
"type": "shingle",
"position": 2
}
]
8.Stop Token Filter
stop 类型的词元过滤器用于将stowords所列的单词从token stream中移除
stopwords 一个词元列表,默认为_english_ stopwords
输入:
{
"tokenizer" : "standard",
"filter": [{"type": "stop", "stopwords": ["this", "a"]}],
"text" : ["this is a test"]
}
输出:
# stopwords中拦截词this, a 被过滤掉;
"tokens": [
{
"token": "is",
"start_offset": 5,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "test",
"start_offset": 10,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 3
}
]
9.Word Delimiter Token Filter
word_delimiter 词元分析器将单词分解为子词,并对子词进行可选的转换
命名为 word_delimiter,它将单词分解为子词,并对子词组进行可选的转换。 词被分为以下规则的子词:
拆分字内分隔符(默认情况下,所有非字母数字字符)。.
"Wi-Fi" → "Wi", "Fi"
按大小写转换拆分: "PowerShot" → "Power", "Shot"
按字母数字转换拆分: "SD500" → "SD", "500"
每个子词的前导和尾随的词内分隔符都被忽略: "//hello---there, dude" → "hello", "there", "dude"
每个子词都删除尾随的“s”: "O’Neil’s" → "O", "Neil"
参数包括:
generate_word_parts
true 导致生成单词部分:"PowerShot" ⇒ "Power" "Shot"。默认 true
generate_number_parts
true 导致生成数字子词:"500-42" ⇒ "500" "42"。默认 true
catenate_numbers
true 导致单词最大程度的链接到一起:"wi-fi" ⇒ "wifi"。默认 false
catenate_numbers
true 导致数字最大程度的连接到一起:"500-42" ⇒ "50042"。默认 false
catenate_all
true 导致所有的子词可以连接:"wi-fi-4000" ⇒ "wifi4000"。默认 false
split_on_case_change
true 导致 "PowerShot" 作为两个词元("Power-Shot" 作为两部分看待)。默认 true
preserve_original
true 在子词中保留原始词: "500-42" ⇒ "500-42" "500" "42"。默认 false
split_on_numerics
true 导致 "j2se" 成为三个词元: "j" "2" "se"。默认 true
stem_english_possessive
true 导致每个子词中的 "'s" 都会被移除:"O’Neil’s" ⇒ "O", "Neil"。默认 true
高级设置:
protected_words
被分隔时的受保护词列表。 一个数组,或者也可以将 protected_words_path 设置为配置有保护字的文件(每行一个)。 如果存在,则自动解析为基于 config/ 位置的位置。
输入:
GET _analyze
{
"tokenizer" : "standard",
"filter": [{"type": "word_delimiter"}],
"text" : ["PowerShot", "219-230"]
}
输出:
# PowerShot 和 219-230 被分解为 Power, Shot, 219, 230
"tokens": [
{
"token": "Power",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "Shot",
"start_offset": 5,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "219",
"start_offset": 10,
"end_offset": 13,
"type": "<NUM>",
"position": 2
},
{
"token": "230",
"start_offset": 14,
"end_offset": 17,
"type": "<NUM>",
"position": 3
}
]
10.Word Delimiter Graph Token Filter
略过
11.Stemmer Token Filter
stemmer词元过滤器,可以添加几乎所有的词元过滤器,所以是一个通用接口
用法如下:
PUT /my_index
{
"settings": {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "standard",
"filter" : ["standard", "lowercase", "my_stemmer"]
}
},
"filter" : {
"my_stemmer" : {
"type" : "stemmer",
"name" : "light_german"
}
}
}
}
}
The language
/name
parameter controls the stemmer with the following available values (the preferred filters are marked in bold):