Analyzer

最新推荐文章于 2024-02-20 10:48:29 发布

laimao8079

最新推荐文章于 2024-02-20 10:48:29 发布

阅读量845

点赞数

分类专栏： es

本文链接：https://blog.csdn.net/laimao8079/article/details/105069224

版权

es 专栏收录该内容

21 篇文章 0 订阅

订阅专栏

analysis--文本分析是把全文本转化成一系列单词(term/tocken)的过程,也叫分词

analyzer --分词器是专门处理分词的组件,它有三部分组成

Character Filters(针对原始文本,比如去除html标签)

Tockenizer 按照规则切分为单词

Tocken Filter 将切分的单词进行加工,小写,删除stopwords(停用词)和增加同义词

es内置分词器

standard Analyzer 默认分词器,按词切分,小写处理

Stop Analyzer --小写处理,停用词过滤(the,a,is)

Simple analyzer 按照字母切分(符号被过滤掉),小写处理

whitespace Analyzer 按照空格切分,不转换小写

keyword Analyzer 不分词,直接将输入当做输出

Patter Analyzer --正则表达式,默认 \w+ (非字符串分隔)

Language --提供了30多种常见语言的分词器

三种常用的分词器API测试

1直接指定Analyze分词测试

GET _analyze
{
"analyzer": "standard",
"text": "Mastering English"
}

结果

{
"tokens": [
{
"token": "mastering",
"start_offset": 0,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "english",
"start_offset": 10,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 1
}
]
}

2指定索引的字段进行测试

POST /book/_Analyze

{

"field":"title",

"test":"Mastering English"

}

3自定义分词器测试

POST /_Analyze

{

"Tockenizer":"standard",

"filter":["Lowcase"],

"test":"Mastering English"

}

不同种类分词器演示

GET _analyze
{
"analyzer": "standard",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

结果

{
"tokens": [
{
"token": "2",
"start_offset": 0,
"end_offset": 1,
"type": "<NUM>",
"position": 0
},
{
"token": "running",
"start_offset": 2,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "quick",
"start_offset": 10,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "brown",
"start_offset": 16,
"end_offset": 21,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "foxes",
"start_offset": 22,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "leap",
"start_offset": 28,
"end_offset": 32,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "over",
"start_offset": 33,
"end_offset": 37,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "lazy",
"start_offset": 38,
"end_offset": 42,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "dogs",
"start_offset": 43,
"end_offset": 47,
"type": "<ALPHANUM>",
"position": 8
},
{
"token": "in",
"start_offset": 48,
"end_offset": 50,
"type": "<ALPHANUM>",
"position": 9
},
{
"token": "the",
"start_offset": 51,
"end_offset": 54,
"type": "<ALPHANUM>",
"position": 10
},
{
"token": "summer",
"start_offset": 55,
"end_offset": 61,
"type": "<ALPHANUM>",
"position": 11
},
{
"token": "evening",
"start_offset": 62,
"end_offset": 69,
"type": "<ALPHANUM>",
"position": 12
}
]
}

GET _analyze
{
"analyzer": "simple",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

{
"tokens": [
{
"token": "running",
"start_offset": 2,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "quick",
"start_offset": 10,
"end_offset": 15,
"type": "word",
"position": 1
},
{
"token": "brown",
"start_offset": 16,
"end_offset": 21,
"type": "word",
"position": 2
},
{
"token": "foxes",
"start_offset": 22,
"end_offset": 27,
"type": "word",
"position": 3
},
{
"token": "leap",
"start_offset": 28,
"end_offset": 32,
"type": "word",
"position": 4
},
{
"token": "over",
"start_offset": 33,
"end_offset": 37,
"type": "word",
"position": 5
},
{
"token": "lazy",
"start_offset": 38,
"end_offset": 42,
"type": "word",
"position": 6
},
{
"token": "dogs",
"start_offset": 43,
"end_offset": 47,
"type": "word",
"position": 7
},
{
"token": "in",
"start_offset": 48,
"end_offset": 50,
"type": "word",
"position": 8
},
{
"token": "the",
"start_offset": 51,
"end_offset": 54,
"type": "word",
"position": 9
},
{
"token": "summer",
"start_offset": 55,
"end_offset": 61,
"type": "word",
"position": 10
},
{
"token": "evening",
"start_offset": 62,
"end_offset": 69,
"type": "word",
"position": 11
}
]
}

GET _analyze
{
"analyzer": "stop",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

{
"tokens": [
{
"token": "running",
"start_offset": 2,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "quick",
"start_offset": 10,
"end_offset": 15,
"type": "word",
"position": 1
},
{
"token": "brown",
"start_offset": 16,
"end_offset": 21,
"type": "word",
"position": 2
},
{
"token": "foxes",
"start_offset": 22,
"end_offset": 27,
"type": "word",
"position": 3
},
{
"token": "leap",
"start_offset": 28,
"end_offset": 32,
"type": "word",
"position": 4
},
{
"token": "over",
"start_offset": 33,
"end_offset": 37,
"type": "word",
"position": 5
},
{
"token": "lazy",
"start_offset": 38,
"end_offset": 42,
"type": "word",
"position": 6
},
{
"token": "dogs",
"start_offset": 43,
"end_offset": 47,
"type": "word",
"position": 7
},
{
"token": "summer",
"start_offset": 55,
"end_offset": 61,
"type": "word",
"position": 10
},
{
"token": "evening",
"start_offset": 62,
"end_offset": 69,
"type": "word",
"position": 11
}
]
}

#stop
GET _analyze
{
"analyzer": "whitespace",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

{
"tokens": [
{
"token": "2",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "running",
"start_offset": 2,
"end_offset": 9,
"type": "word",
"position": 1
},
{
"token": "Quick",
"start_offset": 10,
"end_offset": 15,
"type": "word",
"position": 2
},
{
"token": "brown-foxes",
"start_offset": 16,
"end_offset": 27,
"type": "word",
"position": 3
},
{
"token": "leap",
"start_offset": 28,
"end_offset": 32,
"type": "word",
"position": 4
},
{
"token": "over",
"start_offset": 33,
"end_offset": 37,
"type": "word",
"position": 5
},
{
"token": "lazy",
"start_offset": 38,
"end_offset": 42,
"type": "word",
"position": 6
},
{
"token": "dogs",
"start_offset": 43,
"end_offset": 47,
"type": "word",
"position": 7
},
{
"token": "in",
"start_offset": 48,
"end_offset": 50,
"type": "word",
"position": 8
},
{
"token": "the",
"start_offset": 51,
"end_offset": 54,
"type": "word",
"position": 9
},
{
"token": "summer",
"start_offset": 55,
"end_offset": 61,
"type": "word",
"position": 10
},
{
"token": "evening.",
"start_offset": 62,
"end_offset": 70,
"type": "word",
"position": 11
}
]
}

#keyword
GET _analyze
{
"analyzer": "keyword",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

{
"tokens": [
{
"token": "2 running Quick brown-foxes leap over lazy dogs in the summer evening.",
"start_offset": 0,
"end_offset": 70,
"type": "word",
"position": 0
}
]
}

GET _analyze
{
"analyzer": "pattern",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

{
"tokens": [
{
"token": "2",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "running",
"start_offset": 2,
"end_offset": 9,
"type": "word",
"position": 1
},
{
"token": "quick",
"start_offset": 10,
"end_offset": 15,
"type": "word",
"position": 2
},
{
"token": "brown",
"start_offset": 16,
"end_offset": 21,
"type": "word",
"position": 3
},
{
"token": "foxes",
"start_offset": 22,
"end_offset": 27,
"type": "word",
"position": 4
},
{
"token": "leap",
"start_offset": 28,
"end_offset": 32,
"type": "word",
"position": 5
},
{
"token": "over",
"start_offset": 33,
"end_offset": 37,
"type": "word",
"position": 6
},
{
"token": "lazy",
"start_offset": 38,
"end_offset": 42,
"type": "word",
"position": 7
},
{
"token": "dogs",
"start_offset": 43,
"end_offset": 47,
"type": "word",
"position": 8
},
{
"token": "in",
"start_offset": 48,
"end_offset": 50,
"type": "word",
"position": 9
},
{
"token": "the",
"start_offset": 51,
"end_offset": 54,
"type": "word",
"position": 10
},
{
"token": "summer",
"start_offset": 55,
"end_offset": 61,
"type": "word",
"position": 11
},
{
"token": "evening",
"start_offset": 62,
"end_offset": 69,
"type": "word",
"position": 12
}
]
}

#english
GET _analyze
{
"analyzer": "english",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

{
"tokens": [
{
"token": "2",
"start_offset": 0,
"end_offset": 1,
"type": "<NUM>",
"position": 0
},
{
"token": "run",
"start_offset": 2,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "quick",
"start_offset": 10,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "brown",
"start_offset": 16,
"end_offset": 21,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "fox",
"start_offset": 22,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "leap",
"start_offset": 28,
"end_offset": 32,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "over",
"start_offset": 33,
"end_offset": 37,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "lazi",
"start_offset": 38,
"end_offset": 42,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "dog",
"start_offset": 43,
"end_offset": 47,
"type": "<ALPHANUM>",
"position": 8
},
{
"token": "summer",
"start_offset": 55,
"end_offset": 61,
"type": "<ALPHANUM>",
"position": 11
},
{
"token": "even",
"start_offset": 62,
"end_offset": 69,
"type": "<ALPHANUM>",
"position": 12
}
]
}

POST _analyze
{
"analyzer": "icu_analyzer",
"text": "他说的确实在理”"
}
没有安装

POST _analyze
{
"analyzer": "standard",
"text": "他说的确实在理”"
}

{
"tokens": [
{
"token": "他",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0
},
{
"token": "说",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "的",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "确",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3
},
{
"token": "实",
"start_offset": 4,
"end_offset": 5,
"type": "<IDEOGRAPHIC>",
"position": 4
},
{
"token": "在",
"start_offset": 5,
"end_offset": 6,
"type": "<IDEOGRAPHIC>",
"position": 5
},
{
"token": "理",
"start_offset": 6,
"end_offset": 7,
"type": "<IDEOGRAPHIC>",
"position": 6
}
]
}

POST _analyze
{
"analyzer": "icu_analyzer",
"text": "这个苹果不大好吃"
}

Ik分词器

POST _analyze
{
  "analyzer": "ik_smart",
  "text": "中华人民共和国国歌"
}

POST _analyze
{
  "analyzer": "ik_max_word",
  "text": "中华人民共和国国歌"
}

ik_max_word: 会将文本做最细粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”，会穷尽各种可能的组合，适合 Term Query；

ik_smart: 会做最粗粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”，适合 Phrase 查询