世界上并没有完美的程序,但是我们并不因此而沮丧,因为写程序就是一个不断追求完美的过程。
自定义分析器 :
- Character filters :
1. 作用 : 字符的增、删、改转换
2. 数量限制 : 可以有0个或多个
3. 内建字符过滤器 :
1. HTML Strip Character filter : 去除html标签
2. Mapping Character filter : 映射替换
3. Pattern Replace Character filter : 正则替换 - Tokenizer :
1. 作用 :
1. 分词
2. 记录词的顺序和位置(短语查询)
3. 记录词的开头和结尾位置(高亮)
4. 记录词的类型(分类)
2. 数量限制 : 有且只能有一个
3. 分类 :
1. 完整分词 :
1. Standard
2. Letter
3. Lowercase
4. whitespace
5. UAX URL Email
6. Classic
7. Thai
2. 切词 :
1. N-Gram
2. Edge N-Gram
3. 文本 :
1. Keyword
2. Pattern
3. Simple Pattern
4. Char Group
5. Simple Pattern split
6. Path - Token filters :
1. 作用 : 分词的增、删、改转换
2. 数量限制 : 可以有0个或多个
3. 分类 :
1. apostrophe
2. asciifolding
3. cjk bigram
4. cjk width
今天演示各种有趣的token filter :
# apostrophe token filter
# 过滤掉撇号和它后面的字符
# 用于土耳其语
GET /_analyze
{
"tokenizer": "standard",
"filter": ["apostrophe"],
"text": ["Istanbul'a veya Istanbul'dan"]
}
# 结果
{
"tokens" : [
{
"token" : "Istanbul",
"start_offset" : 0,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "veya",
"start_offset" : 11,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "Istanbul",
"start_offset" : 16,
"end_offset" : 28,
"type" : "<ALPHANUM>",
"position" : 2
}
]
}
# asciifolding token filter
# 转为ascii
# 配置项 :
# preserve_original : 输出原始字符,默认 false
GET /_analyze
{
"tokenizer": "keyword",
"filter": {
"type" : "asciifolding",
"preserve_original" : true
},
"text": ["hello good 我是中国人 açaí à"]
}
# 结果
{
"tokens" : [
{
"token" : "hello good 我是中国人 acai a",
"start_offset" : 0,
"end_offset" : 23,
"type" : "word",
"position" : 0
},
{
"token" : "hello good 我是中国人 açaí à",
"start_offset" : 0,
"end_offset" : 23,
"type" : "word",
"position" : 0
}
]
}
# CJK bigram token filter
# 中日韩双字母组过滤器
# 只对分词结果过滤
# 配置项 :
# 1. ignored_scripts : 不进行双字母分组的内容
# 1. han : 汉语
# 2. hangul : 韩语
# 3. hiragana : 日语
# 4. katakana :
# 2. output_unigrams :
# 1. 为 true 则输出 bigram 和 unigram
# 2. 默认为 false
GET /_analyze
{
"tokenizer": "standard",
"filter": ["cjk_bigram"],
"text": ["我是中国人"]
}
# 结果
{
"tokens" : [
{
"token" : "我是",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<DOUBLE>",
"position" : 0
},
{
"token" : "是中",
"start_offset" : 1,
"end_offset" : 3,
"type" : "<DOUBLE>",
"position" : 1
},
{
"token" : "中国",
"start_offset" : 2,
"end_offset" : 4,
"type" : "<DOUBLE>",
"position" : 2
},
{
"token" : "国人",
"start_offset" : 3,
"end_offset" : 5,
"type" : "<DOUBLE>",
"position" : 3
}
]
}
# cjk_width token filter
# 将全宽的ascii转为等价的拉丁字符
GET /_analyze
{
"tokenizer": "standard",
"filter": ["cjk_width"],
"text" : ["我是中国人 fsdf"]
}
# 结果
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "是",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
},
{
"token" : "中",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<IDEOGRAPHIC>",
"position" : 2
},
{
"token" : "国",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 3
},
{
"token" : "人",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<IDEOGRAPHIC>",
"position" : 4
},
{
"token" : "fsdf",
"start_offset" : 6,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 5
}
]
}
好了,今天先到这里,明天继续