分词器介绍
https://www.elastic.co/guide/en/elasticsearch/reference/8.7/analysis-analyzers.html
查看分词结果(standard-analyzer,es默认)
https://www.elastic.co/guide/en/elasticsearch/reference/8.7/analysis-standard-analyzer.html
post _analyze
{
"analyzer": "standard",
"text": "分词的数"
}
中文分词结果
post _analyze
{
"analyzer": "standard",
"text": "我喜欢看小说"
}
{
"tokens": [
{
"token": "我",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0
},
{
"token": "喜",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "欢",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "看",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3
},
{
"token": "小",
"start_offset": 4,
"end_offset": 5,
"type": "<IDEOGRAPHIC>",
"position": 4
},
{
"token": "说",
"start_offset": 5,
"end_offset": 6,
"type": "<IDEOGRAPHIC>",
"position": 5
}
]
}
英文分词结果
post _analyze
{
"analyzer": "standard",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
{
"tokens": [
{
"token": "the",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "2",
"start_offset": 4,
"end_offset": 5,
"type": "<NUM>",
"position": 1
},
{
"token": "quick",
"start_offset": 6,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "brown",
"start_offset": 12,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "foxes",
"start_offset": 18,
"end_offset": 23,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "jumped",
"start_offset": 24,
"end_offset": 30,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "over",
"start_offset": 31,
"end_offset": 35,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "the",
"start_offset": 36,
"end_offset": 39,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "lazy",
"start_offset": 40,
"end_offset": 44,
"type": "<ALPHANUM>",
"position": 8
},
{
"token": "dog's",
"start_offset": 45,
"end_offset": 50,
"type": "<ALPHANUM>",
"position": 9
},
{
"token": "bone",
"start_offset": 51,
"end_offset": 55,
"type": "<ALPHANUM>",
"position": 10
}
]
}
自定义分词器
https://www.elastic.co/guide/en/elasticsearch/reference/8.7/analysis-custom-analyzer.html
自定义分词器不是全局的,是归索引库
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}
POST my-index-000001/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "Is this <b>déjà vu</b>?"
}
type
分析仪类型。接受内置分析器类型。对于自定义分析器,使用 custom
或省略此参数。tokenizer
内置或定制的分词器。(必需的) https://www.elastic.co/guide/en/elasticsearch/reference/8.7/analysis-tokenizers.html 分词器(对字段进行切分) char_filter
一组可选的内置或自定义 字符过滤器。 https://www.elastic.co/guide/en/elasticsearch/reference/8.7/analysis-charfilters.html 字符过滤器(在一段文本进行分词之前,先进行预处理,比如过滤html标签等) filter
可选的内置或自定义 令牌过滤器数组。 https://www.elastic.co/guide/en/elasticsearch/reference/8.7/analysis-tokenfilters.html token过滤器(对切分的单词进行加工,如大小写转换等) position_increment_gap
当索引文本值数组时,Elasticsearch 在一个值的最后一项和下一个值的第一项之间插入一个假的“间隙”,以确保短语查询不匹配来自不同数组元素的两个项。默认为 100
. 查看position_increment_gap
更多。顺序: character filter -> tokenizer -> token filter
支持个数:char_filter(0个或多个)+tokenizer(一个)+filter(0个或多个)
POST /_analyze
{
"tokenizer": "ik_max_word",
"filter": ["lowercase","pinyin","asciifolding"],
"char_filter":["html_strip"],
"text": "你知道 Elasticesearch吗?"
}
POST /_analyze
{
"tokenizer": "ik_max_word",
"filter": ["lowercase"],
"char_filter":["html_strip"],
"text": "你知道 Elasticesearch吗?"
}
POST /_analyze
{
"tokenizer": "ik_max_word",
"filter": ["pinyin"],
"char_filter":["html_strip"],
"text": "你知道 Elasticesearch吗?"
}
POST /_analyze
{
"tokenizer": "ik_max_word",
"filter": ["pinyin","ik_smart"],
"char_filter":["html_strip"],
"text": "你知道 Elasticesearch吗?"
}
POST /_analyze
{
"tokenizer": "pinyin",
"char_filter":["html_strip"],
"text": "你知道 Elasticesearch吗?"
}
POST /_analyze
{
"tokenizer": "pinyin",
"char_filter":["html_strip"],
"text": "你知道 Elasticesearch吗?</html>"
}
POST /_analyze
{
"tokenizer": "ik_smart",
"char_filter":["html_strip"],
"text": "你知道 Elasticesearch吗?</html>"
}
#char_filter mappings的使用
GET /_analyze
{
"tokenizer": "keyword",
"char_filter": [
{
"type": "mapping",
"mappings": [
"٠ => 0",
"١ => 1",
"٢ => 2",
"٣ => 3",
"٤ => 4",
"٥ => 5",
"٦ => 6",
"٧ => 7",
"٨ => 8",
"٩ => 9"
]
}
],
"text": "My license plate is ٢٥٠١٥"
}
GET /_analyze
{
"tokenizer": "ik_smart",
"char_filter": [
{
"type": "mapping",
"mappings": [
"٠ => 0",
]
}
],
"text": "My license plate is ٠"
}
GET /_analyze
{
"tokenizer": "ik_smart",
"char_filter": [
{
"type": "pattern_replace",
"pattern": "(\\d+)-(?=\\d)",
"replacement": ""
}
],
"text": "My credit card is 123-456-789"
}