基于elasticsearch7.6.1 和 kibana7.6.1
一、基础知识
一个完整的分析器Analyzer包括如下三个组件:
- Character Filters: 对文本信息就行预处理,比如对原始文本删减或替换字符
- Token Filters: 分词后过滤,对分词后的词语列表进行处理(过滤,删除等)
- Tokenizer:分词器,即将原始文本按照一定规则切分为词语(token)
1. Character Filters
es内置的Character Filters包括:html_strip: 去除html标签, mapping:字符串替换|映射, pattern_replace: 正则匹配替换。
案例演示
# char_filter: "html_strip"去除 html 标签
POST _analyze
{
"char_filter": [
"html_strip"
],
"tokenizer": "standard",
"text": "<html><body><div><span>You Know, for Search<span></div><body><html>"
}
# char_filter: "mapping"字符串替换|映射
POST _analyze
{
"char_filter": [
{
"type": "mapping",
"mappings": [
"&=>and",
"&&=>and"
]
}
],
"tokenizer": "standard",
"text": "You & me, go && java"
}
# char_filter: "pattern_replace"正则匹配替换
POST _analyze
{
"char_filter": [
{
"type": "pattern_replace",
"pattern": "(v[0-9.]+)",
"replacement": "latest version"
}
],
"tokenizer": "standard",
"text": "kibana v7.6.1 and elasticsearch v7.6.1"
}
# char_filter: 多个组合使用
POST _analyze
{
"char_filter": [
"html_strip",
{
"type": "mapping",
"mappings": [
"&=>and",
"&&=>and"
]
},
{
"type": "pattern_replace",
"pattern": "(v[0-9.]+)",
"replacement": "latest version"
}
],
"tokenizer": "standard",
"text": "<html><body><div><span>kibana v7.6.1 && elasticsearch v7.6.1<span></div><body><html>"
}
2. Token Filters
Token Filters对Tokenizer分词后的结果进行再加工,包括字符处理,过滤,删除等操作,es内置的Token Filters包括: lowercase(转小写), stop(删除停止词), synonym(添加同义词)。
案例演示
# filter: "stop"
# 默认停用词列表全是小写词语,单词'A'虽然是停用词,但因为是大写的,单独使用stop这个token filter无法将其过滤掉。
POST _analyze
{
"tokenizer": "standard",
"filter": [
"stop"
],
"text": "A man sat alone on a stone bench"
}
# filter: "stop" and "lowercase"
# 注意两者的作用顺序
POST _analyze
{
"tokenizer": "standard",
"filter": [
"lowercase",
"stop"
],
"text": "A man sat alone on a stone bench"
}
二、在索引中自定义分析器Analyzer
案例演示
案例01
# 通过组合character filter, tokernizer, token filter来实现自定义的分析器
# standard_custom: 是自定义分析器的名称
PUT movies
{
"settings": {
"number_of_replicas": 1,
"number_of_shards": 1,
"analysis": {
"analyzer": {
"standard_custom": {
"type": "custom",
"char_filter": [
"html_strip"
],
"tokenizer": "standard",
"filter": [
"lowercase",
"stop"
]
}
}
}
}
}
# 使用自定义的分析器和标准分析器(注意观察两者结果的区别)
GET movies/_analyze
{
"analyzer": "standard_custom",
"text": "<html><body><div><span>A man sat alone on a stone bench<span></div><body><html>"
}
GET movies/_analyze
{
"analyzer": "standard",
"text": "<html><body><div><span>A man sat alone on a stone bench<span></div><body><html>"
}
案例02
PUT songs
{
"settings": {
"number_of_replicas": 1,
"number_of_shards": 1,
"analysis": {
"char_filter": {
"CF1": {
"type": "pattern_replace",
"pattern": "(v[0-9.]+)",
"replacement": "latest version"
},
"CF2": {
"type": "mapping",
"mappings": [
"&=>and",
"&&=>and"
]
}
},
"analyzer": {
"standard_custom": {
"type": "custom",
"char_filter": [
"html_strip",
"CF1",
"CF2"
],
"tokenizer": "standard",
"filter": [
"lowercase",
"stop"
]
}
}
}
}
}
# 使用自定义的分析器
GET songs/_analyze
{
"analyzer": "standard_custom",
"text": "<html><body><div><span>A man sat alone on a stone bench & kibana v7.6.1 && elasticsearch v7.6.1<span></div><body><html>"
}