ES学习笔记-内置分析器(Analyzer )，扩展分词器及如何自定义分析器

最新推荐文章于 2024-07-28 17:12:40 发布

Qazink

最新推荐文章于 2024-07-28 17:12:40 发布

阅读量1.3k

点赞数 2

分类专栏： elasticsearch 文章标签： filter elasticsearch

本文链接：https://blog.csdn.net/weixin_43197795/article/details/107963272

版权

elasticsearch 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

ES学习笔记-内置分析器(Analyzer )，扩展分词器及如何自定义分析器

内置分析器

es在索引文档时，会通过各种类型 Analyzer 对text类型字段做分析，不同的 Analyzer 会有不同的分词结果，内置的分词器有以下几种，基本上内置的 Analyzer 包括 Language Analyzers 在内，对中文的分词都不够友好。中文分词需要安装其它 Analyzer

分析器	描述	分词对象	结果
standard	标准分析器是默认的分析器，如果没有指定，则使用该分析器。它提供了基于文法的标记化(基于 Unicode 文本分割算法，如 Unicode 标准附件 # 29所规定) ，并且对大多数语言都有效。	The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.	[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog’s, bone ]
simple	简单分析器将文本分解为任何非字母字符的标记，如数字、空格、连字符和撇号、放弃非字母字符，并将大写字母更改为小写字母。	The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.	[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
whitespace	空格分析器在遇到空白字符时将文本分解为术语	The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.	[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog’s, bone. ]
stop	停止分析器与简单分析器相同，但增加了删除停止字的支持。默认使用的是 `_english_` 停止词。	The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.	[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]
keyword	不分词，把整个字段当做一个整体返回	The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.	[The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.]
pattern	模式分析器使用正则表达式将文本拆分为术语。正则表达式应该匹配令牌分隔符，而不是令牌本身。正则表达式默认为 `w+` (或所有非单词字符)。	The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.	[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
多种西语系 arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english等等	一组旨在分析特定语言文本的分析程序。

中文扩展分析器

中文分词器最简单的是ik分词器，还有jieba分词，哈工大分词器等

分词器	描述	分词对象	结果
ik_smart	ik分词器中的简单分词器，支持自定义字典，远程字典	学如逆水行舟，不进则退	[学如逆水行舟,不进则退]
ik_max_word	ik_分词器的全量分词器，支持自定义字典，远程字典	学如逆水行舟，不进则退	[学如逆水行舟,学如逆水,逆水行舟,逆水,行舟,不进则退,不进,则,退]

自定义分析器

如果现有的分析器不能满足需求，可以针对索引自定义分析器，一个完整的分析器包括如下三部分

0个或多个 character filters (字符过滤器 )
1个分词器tokenizer(分词器)
0个或多个token filters
分析器的工作顺序也是如此，首先 character filters 工作，过滤掉规则匹配的无效字符，然后进行分词，最后对分词进行过滤

简单demo

可以使用 POST _analyze 接口来测试自定义的分析器

POST _analyze
{
  "char_filter":["html_strip"],
  "tokenizer":"whitespace",
  "filter": ["lowercase"], 
  "text": "<b>Hello World</b>"
}

该分析器使用html_strip过滤字符中的html标记，使用whitespace进行分词，最后，应用lowercase将分词结果转化小写

内建的 char_filter

字符过滤器	描述	demo	效果
html_strip	html_strip 字符过滤器会去掉像 `<b>` 这样的 HTML 元素，并对像 `&` 这样的 HTML 实体进行解码。
mapping	映射字符筛选器接受键和值的映射。每当它遇到与键相同的字符串时，它都会将它们替换为与该键关联的值。匹配是贪婪的; 在给定点上最长的模式匹配获胜。允许替换为空字符串。映射过滤器使用 Lucene 的 MappingCharFilter。
pattern_replace	pattern_replace 字符筛选器使用正则表达式来匹配应该用指定的替换字符串替换的字符。替换字符串可以引用正则表达式中的捕获组。

内建的 tokenizer

分词器	描述	demo	效果
char_group	每当 char _ group tokenizer 遇到已定义集合中的字符时，它就将文本分解为术语。对于需要进行简单的自定义标记化，并且不能接受模式标记化器的使用开销的情况，它通常是有用的。

更多tokenizer

完整的自定义分析器

PUT /test_index1
{
    "settings": {
        "analysis": {
            "char_filter": {
                "&_to_and": {
                    "type": "mapping",
                    "mappings": [
                        "& => 和"
                    ]
                }
            },
            "filter": {
                "my_stopwords": {
                    "type": "stop",
                    "stopwords": [
                        "的",
                        "我",
                        "你"
                    ]
                }
            },
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "ik_max_word",
                    "type": "custom",
                    "char_filter": [
                        "html_strip",
                        "&_to_and"
                    ],
                    "filter": [
                        "my_stopwords"
                    ]
                }
            }
        }
    }
}

自定义了char_filter &_to_and，将&替换为"和"
自定义my_stopwords，过滤 [我，你，的]

ik_max_word 分析器结果如下

GET test_index1/_analyze
{
  "analyzer": "ik_max_word",
  "text": "我的苹果&香蕉"
}
// 返回值

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "的",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "苹果",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "香蕉",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 3
    }
  ]
}

自定义分析器结果如下

GET test_index1/_analyze
{
  "analyzer": "my_analyzer",
  "text": "我的苹果&香蕉"
}
// 结果如下
{
  "tokens" : [
    {
      "token" : "苹果",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "和",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "香蕉",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 4
    }
  ]
}