【Elasticsearch】Elasticsearch中的 自定分析器
- char_filter:es中字符过滤器,字符过滤器用于在将字符流传递给标记器之前对其进行预处理。属于Character filters reference
官方文档,这里是我的一些简单小例子: char_filter使用 - filter:标记过滤器接受来自标记器的标记流, 并且可以修改标记(例如小写)、删除标记(例如删除停用词)或添加标记(例如同义词)。
官方文档 - tokenizer:分词器接收一个字符流,将其分解为单个 令牌(通常是单个单词),并输出一个令牌流
分词器,国人日常开发是用个 ik_max_word
官方文档
#自定义分词
DELETE custom_analysis_index
PUT custom_analysis_index
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"& => and",
"| => or"
]
}
},
"filter": {
"my_stopword": {
"type": "stop",
"stopwords": [
"who",
"the",
"are",
"at"
]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "[ ,.!?]"
}
},
"analyzer": {
"my_analyzer":{
"type":"custom",
"char_filter":["my_char_filter"],
"filter":["my_stopword"],
"tokenizer":"my_tokenizer"
}
}
}
}
}
GET custom_analysis_index/_analyze
{
"analyzer": "my_analyzer",
"text": ["what is your name? I am kerry","who are you? i am green, i am a student at school"]
}
通过上面的不断组合,我么实现自定义一个解析器;
通过解析我么获取到是已经 过滤掉的内容
{
"tokens": [
{
"token": "what",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "is",
"start_offset": 5,
"end_offset": 7,
"type": "word",
"position": 1
},
{
"token": "your",
"start_offset": 8,
"end_offset": 12,
"type": "word",
"position": 2
},
{
"token": "name",
"start_offset": 13,
"end_offset": 17,
"type": "word",
"position": 3
},
{
"token": "I",
"start_offset": 19,
"end_offset": 20,
"type": "word",
"position": 4
},
{
"token": "am",
"start_offset": 21,
"end_offset": 23,
"type": "word",
"position": 5
},
{
"token": "kerry",
"start_offset": 24,
"end_offset": 29,
"type": "word",
"position": 6
},
{
"token": "you",
"start_offset": 38,
"end_offset": 41,
"type": "word",
"position": 109
},
{
"token": "i",
"start_offset": 43,
"end_offset": 44,
"type": "word",
"position": 110
},
{
"token": "am",
"start_offset": 45,
"end_offset": 47,
"type": "word",
"position": 111
},
{
"token": "green",
"start_offset": 48,
"end_offset": 53,
"type": "word",
"position": 112
},
{
"token": "i",
"start_offset": 55,
"end_offset": 56,
"type": "word",
"position": 113
},
{
"token": "am",
"start_offset": 57,
"end_offset": 59,
"type": "word",
"position": 114
},
{
"token": "a",
"start_offset": 60,
"end_offset": 61,
"type": "word",
"position": 115
},
{
"token": "student",
"start_offset": 62,
"end_offset": 69,
"type": "word",
"position": 116
},
{
"token": "school",
"start_offset": 73,
"end_offset": 79,
"type": "word",
"position": 118
}
]
}
还需要注意:
在搜索时,通过下面参数依次检查搜索时使用的分词器:
- 搜索时指定analyzer参数 创建mapping时指定字段的search_analyzer属性
- 创建索引时指定setting的analysis.analyzer.default_search
- 查看创建索引时字段指定的analyzer属性
详细可以看下这篇博客。
https://blog.csdn.net/u011250186/article/details/125704364