analysis--文本分析是把全文本转化成一系列单词(term/tocken)的过程,也叫分词
analyzer --分词器是专门处理分词的组件,它有三部分组成
Character Filters(针对原始文本,比如去除html标签)
Tockenizer 按照规则切分为单词
Tocken Filter 将切分的单词进行加工,小写,删除stopwords(停用词)和增加同义词
es内置分词器
standard Analyzer 默认分词器,按词切分,小写处理
Stop Analyzer --小写处理,停用词过滤(the,a,is)
Simple analyzer 按照字母切分(符号被过滤掉),小写处理
whitespace Analyzer 按照空格切分,不转换小写
keyword Analyzer 不分词,直接将输入当做输出
Patter Analyzer --正则表达式,默认 \w+ (非字符串分隔)
Language --提供了30多种常见语言的分词器
三种常用的分词器API测试
1直接指定Analyze分词测试
GET _analyze
{
"analyzer": "standard",
"text": "Mastering English"
}
结果
{
"tokens": [
{
"token": "mastering",
"start_offset": 0,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "english",
"start_offset": 10,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 1
}
]
}
2指定索引的字段进行测试
POST /book/_Analyze
{
"field":"title",
"test":"Mastering English"
}
3自定义分词器测试
POST /_Analyze
{
"Tockenizer":"standard",
"filter":["Lowcase"],
"test":"Mastering English"
}
不同种类分词器演示
GET _analyze
{
"analyzer": "standard",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
结果
{
"tokens": [
{
"token": "2",
"start_offset": 0,
"end_offset": 1,
"type": "<NUM>",
"position": 0
},
{
"token": "running",
"start_offset": 2,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "quick",
"start_offset": 10,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "brown",
"start_offset": 16,
"end_offset": 21,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "foxes",
"start_offset": 22,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "leap",
"start_offset": 28,
"end_offset": 32,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "over",
"start_offset": 33,
"end_offset": 37,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "lazy",
"start_offset": 38,
"end_offset": 42,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "dogs",
"start_offset": 43,
"end_offset": 47,
"type": "<ALPHANUM>",
"position": 8
},
{
"token": "in",
"start_offset": 48,
"end_offset": 50,
"type": "<ALPHANUM>",
"position": 9
},
{
"token": "the",
"start_offset": 51,
"end_offset": 54,
"type": "<ALPHANUM>",
"position": 10
},
{
"token": "summer",
"start_offset": 55,
"end_offset": 61,
"type": "<ALPHANUM>",
"position": 11
},
{
"token": "evening",
"start_offset": 62,
"end_offset": 69,
"type": "<ALPHANUM>",
"position": 12
}
]
}
GET _analyze
{
"analyzer": "simple",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
{
"tokens": [
{
"token": "running",
"start_offset": 2,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "quick",
"start_offset": 10,
"end_offset": 15,
"type": "word",
"position": 1
},
{
"token": "brown",
"start_offset": 16,
"end_offset": 21,
"type": "word",
"position": 2
},
{
"token": "foxes",
"start_offset": 22,
"end_offset": 27,
"type": "word",
"position": 3
},
{
"token": "leap",
"start_offset": 28,
"end_offset": 32,
"type": "word",
"position": 4
},
{
"token": "over",
"start_offset": 33,
"end_offset": 37,
"type": "word",
"position": 5
},
{
"token": "lazy",
"start_offset": 38,
"end_offset": 42,
"type": "word",
"position": 6
},
{
"token": "dogs",
"start_offset": 43,
"end_offset": 47,
"type": "word",
"position": 7
},
{
"token": "in",
"start_offset": 48,
"end_offset": 50,
"type": "word",
"position": 8
},
{
"token": "the",
"start_offset": 51,
"end_offset": 54,
"type": "word",
"position": 9
},
{
"token": "summer",
"start_offset": 55,
"end_offset": 61,
"type": "word",
"position": 10
},
{
"token": "evening",
"start_offset": 62,
"end_offset": 69,
"type": "word",
"position": 11
}
]
}
GET _analyze
{
"analyzer": "stop",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
{
"tokens": [
{
"token": "running",
"start_offset": 2,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "quick",
"start_offset": 10,
"end_offset": 15,
"type": "word",
"position": 1
},
{
"token": "brown",
"start_offset": 16,
"end_offset": 21,
"type": "word",
"position": 2
},
{
"token": "foxes",
"start_offset": 22,
"end_offset": 27,
"type": "word",
"position": 3
},
{
"token": "leap",
"start_offset": 28,
"end_offset": 32,
"type": "word",
"position": 4
},
{
"token": "over",
"start_offset": 33,
"end_offset": 37,
"type": "word",
"position": 5
},
{
"token": "lazy",
"start_offset": 38,
"end_offset": 42,
"type": "word",
"position": 6
},
{
"token": "dogs",
"start_offset": 43,
"end_offset": 47,
"type": "word",
"position": 7
},
{
"token": "summer",
"start_offset": 55,
"end_offset": 61,
"type": "word",
"position": 10
},
{
"token": "evening",
"start_offset": 62,
"end_offset": 69,
"type": "word",
"position": 11
}
]
}
#stop
GET _analyze
{
"analyzer": "whitespace",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
{
"tokens": [
{
"token": "2",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "running",
"start_offset": 2,
"end_offset": 9,
"type": "word",
"position": 1
},
{
"token": "Quick",
"start_offset": 10,
"end_offset": 15,
"type": "word",
"position": 2
},
{
"token": "brown-foxes",
"start_offset": 16,
"end_offset": 27,
"type": "word",
"position": 3
},
{
"token": "leap",
"start_offset": 28,
"end_offset": 32,
"type": "word",
"position": 4
},
{
"token": "over",
"start_offset": 33,
"end_offset": 37,
"type": "word",
"position": 5
},
{
"token": "lazy",
"start_offset": 38,
"end_offset": 42,
"type": "word",
"position": 6
},
{
"token": "dogs",
"start_offset": 43,
"end_offset": 47,
"type": "word",
"position": 7
},
{
"token": "in",
"start_offset": 48,
"end_offset": 50,
"type": "word",
"position": 8
},
{
"token": "the",
"start_offset": 51,
"end_offset": 54,
"type": "word",
"position": 9
},
{
"token": "summer",
"start_offset": 55,
"end_offset": 61,
"type": "word",
"position": 10
},
{
"token": "evening.",
"start_offset": 62,
"end_offset": 70,
"type": "word",
"position": 11
}
]
}
#keyword
GET _analyze
{
"analyzer": "keyword",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
{
"tokens": [
{
"token": "2 running Quick brown-foxes leap over lazy dogs in the summer evening.",
"start_offset": 0,
"end_offset": 70,
"type": "word",
"position": 0
}
]
}
GET _analyze
{
"analyzer": "pattern",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
{
"tokens": [
{
"token": "2",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "running",
"start_offset": 2,
"end_offset": 9,
"type": "word",
"position": 1
},
{
"token": "quick",
"start_offset": 10,
"end_offset": 15,
"type": "word",
"position": 2
},
{
"token": "brown",
"start_offset": 16,
"end_offset": 21,
"type": "word",
"position": 3
},
{
"token": "foxes",
"start_offset": 22,
"end_offset": 27,
"type": "word",
"position": 4
},
{
"token": "leap",
"start_offset": 28,
"end_offset": 32,
"type": "word",
"position": 5
},
{
"token": "over",
"start_offset": 33,
"end_offset": 37,
"type": "word",
"position": 6
},
{
"token": "lazy",
"start_offset": 38,
"end_offset": 42,
"type": "word",
"position": 7
},
{
"token": "dogs",
"start_offset": 43,
"end_offset": 47,
"type": "word",
"position": 8
},
{
"token": "in",
"start_offset": 48,
"end_offset": 50,
"type": "word",
"position": 9
},
{
"token": "the",
"start_offset": 51,
"end_offset": 54,
"type": "word",
"position": 10
},
{
"token": "summer",
"start_offset": 55,
"end_offset": 61,
"type": "word",
"position": 11
},
{
"token": "evening",
"start_offset": 62,
"end_offset": 69,
"type": "word",
"position": 12
}
]
}
#english
GET _analyze
{
"analyzer": "english",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
{
"tokens": [
{
"token": "2",
"start_offset": 0,
"end_offset": 1,
"type": "<NUM>",
"position": 0
},
{
"token": "run",
"start_offset": 2,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "quick",
"start_offset": 10,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "brown",
"start_offset": 16,
"end_offset": 21,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "fox",
"start_offset": 22,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "leap",
"start_offset": 28,
"end_offset": 32,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "over",
"start_offset": 33,
"end_offset": 37,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "lazi",
"start_offset": 38,
"end_offset": 42,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "dog",
"start_offset": 43,
"end_offset": 47,
"type": "<ALPHANUM>",
"position": 8
},
{
"token": "summer",
"start_offset": 55,
"end_offset": 61,
"type": "<ALPHANUM>",
"position": 11
},
{
"token": "even",
"start_offset": 62,
"end_offset": 69,
"type": "<ALPHANUM>",
"position": 12
}
]
}
POST _analyze
{
"analyzer": "icu_analyzer",
"text": "他说的确实在理”"
}
没有安装
POST _analyze
{
"analyzer": "standard",
"text": "他说的确实在理”"
}
{
"tokens": [
{
"token": "他",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0
},
{
"token": "说",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "的",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "确",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3
},
{
"token": "实",
"start_offset": 4,
"end_offset": 5,
"type": "<IDEOGRAPHIC>",
"position": 4
},
{
"token": "在",
"start_offset": 5,
"end_offset": 6,
"type": "<IDEOGRAPHIC>",
"position": 5
},
{
"token": "理",
"start_offset": 6,
"end_offset": 7,
"type": "<IDEOGRAPHIC>",
"position": 6
}
]
}
POST _analyze
{
"analyzer": "icu_analyzer",
"text": "这个苹果不大好吃"
}
Ik分词器
POST _analyze
{
"analyzer": "ik_smart",
"text": "中华人民共和国国歌"
}
POST _analyze
{
"analyzer": "ik_max_word",
"text": "中华人民共和国国歌"
}
ik_max_word: 会将文本做最细粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”,会穷尽各种可能的组合,适合 Term Query;
ik_smart: 会做最粗粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”,适合 Phrase 查询