-
分词过程:
- character filter: 分词前的预处理,比如过滤html标签,&符号变为单词and等操作
- tokenizer: 分词,把一句话以某种方式分割成多个单词或词语
- token filter: 对单词进行normalization操作,比如大小写转换、同义词转换、单复数转换、时态转换、stop word过滤、无意义单词过滤等
- 完成对处理过的单词或词语进行倒排索引的建立
-
ES内置分词器
standard analyzer:默认分词器
simple analyzer: 简单分词器
whitespace analyzer:使用空格分词的分词器
language analyzer: 特定语言的分词器
-
测试分词器
GET _analyze
{
"analyzer": "standard",
"text": "My age is 18"
}
返回:
{
"tokens": [
{
"token": "my",
"start_offset": 0,
"end_offset": 2,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "age",
"start_offset": 3,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "is",
"start_offset": 7,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "18",
"start_offset": 10,
"end_offset": 12,
"type": "<NUM>",
"position": 3
}
]
}
-
设置分词器
1 PUT /test_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer":{
"type":"standard",
"stopwords":"_english_"
}
}
}
}
}
2 GET test_index/_analyze
{
"analyzer":"my_analyzer",
"text":"he is a boy"
}
返回:
{
"tokens": [
{
"token": "he",
"start_offset": 0,
"end_offset": 2,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "boy",
"start_offset": 8,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 3
}
]
}
通过此例可以看到分词语中没有了is这个单词,说明停用词已生效
-
自定义分词器
1 PUT test_index
{
"settings": {
"analysis": {
"char_filter": {
"&2and":{
"type":"mapping",
"mappings":["&=> and "]
}
},
"filter": {
"my_stopwords":{
"type":"stop",
"stopwords":["are"]
}
},
"analyzer": {
"my_analyzer":{
"type":"custom",
"char_filter": ["html_strip","&2and"],
"tokenizer":"standard",
"filter":["lowercase","my_stopwords"]
}
}
}
}
}
2 测试:
GET test_index/_analyze
{
"analyzer":"my_analyzer",
"text":"TOM & Jim are boy"
}
返回:
{
"tokens": [
{
"token": "tom",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "and",
"start_offset": 4,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "jim",
"start_offset": 6,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "boy",
"start_offset": 14,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 4
}
]
}
从结果可以看到,&转为了and,停用词are没有分词,TOM转为小写tom