elasticsearch 深入 —— 分析器执行顺序与Mapping自定义分析器配置

最新推荐文章于 2024-07-30 01:17:14 发布

gmHappy

最新推荐文章于 2024-07-30 01:17:14 发布

阅读量5.3k

点赞数 1

分类专栏： ELK elasticsearch 文章标签：自定义分析器分析器执行顺序短语查询分析器 search_quote_analyzer

本文链接：https://blog.csdn.net/ctwy291314/article/details/81391514

版权

ELK 同时被 2 个专栏收录

47 篇文章 8 订阅

订阅专栏

elasticsearch

46 篇文章 29 订阅

订阅专栏

默认分析器

虽然我们可以在字段层级指定分析器，但是如果该层级没有指定任何的分析器，那么我们如何能确定这个字段使用的是哪个分析器呢？

分析器可以从三个层面进行定义：按字段（per-field）、按索引（per-index）或全局缺省（global default）。Elasticsearch 会按照以下顺序依次处理，直到它找到能够使用的分析器。索引时的顺序如下：

字段映射里定义的 analyzer ，否则
索引设置中名为 default 的分析器，默认为
standard 标准分析器

在搜索时，顺序有些许不同：

查询自己定义的 analyzer ，否则
字段映射里定义的 analyzer ，否则
索引设置中名为 default 的分析器，默认为
standard 标准分析器

有时，在索引时和搜索时使用不同的分析器是合理的。我们可能要想为同义词建索引（例如，所有 quick出现的地方，同时也为 fast 、 rapid 和 speedy 创建索引）。但在搜索时，我们不需要搜索所有的同义词，取而代之的是寻找用户输入的单词是否是 quick 、 fast 、 rapid 或 speedy 。

为了区分，Elasticsearch 也支持一个可选的 search_analyzer 映射，它仅会应用于搜索时（ analyzer 还用于索引时）。还有一个等价的 default_search 映射，用以指定索引层的默认配置。

如果考虑到这些额外参数，一个搜索时的完整顺序会是下面这样：

查询自己定义的 analyzer ，否则
字段映射里定义的 search_analyzer ，否则
字段映射里定义的 analyzer ，否则
索引设置中名为 default_search 的分析器，默认为
索引设置中名为 default 的分析器，默认为
standard 标准分析器

分析器配置实践

为特定字段指定分析器最简单的方法是在字段映射中定义它，如下所示：

PUT /my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "text": { ①
          "type": "text",
          "fields": {
            "english": { ②
              "type":     "text",
              "analyzer": "english"
            }
          }
        }
      }
    }
  }
}

GET my_index/_analyze 
{
  "field": "text",
  "text": "The quick Brown Foxes." ③
}

GET my_index/_analyze 
{
  "field": "text.english",
  "text": "The quick Brown Foxes." ④
}

	`text`字段使用默认`standard`分析器`。
	`text.english` 字段使用英文分析器，删除停止词并应用词干
	tokens返回：[`the`，`quick`，`brown`，`foxes`]。
	tokens返回：[`quick`，`brown`，`fox`]。

`search_quote_analyzer`

该search_quote_analyzer设置允许您为短语指定分析器，这在处理禁用短语查询的停用词时特别有用。

要禁用短语的停用词，需要使用三个分析器设置的字段：

一个analyzer用于索引的所有条款，包括停止词设置
一个search_analyzer非短语查询，将删除停止词设置
search_quote_analyzer短语查询的设置，不会删除停用词

PUT my_index
{
   "settings":{
      "analysis":{
         "analyzer":{ ①
            "my_analyzer":{ 
               "type":"custom",
               "tokenizer":"standard",
               "filter":[
                  "lowercase"
               ]
            },
            "my_stop_analyzer":{ ②
               "type":"custom",
               "tokenizer":"standard",
               "filter":[
                  "lowercase",
                  "english_stop"
               ]
            }
         },
         "filter":{
            "english_stop":{ 
               "type":"stop",
               "stopwords":"_english_"
            }
         }
      }
   },
   "mappings":{
      "_doc":{
         "properties":{
            "title": {
               "type":"text",
               "analyzer":"my_analyzer",  ③
               "search_analyzer":"my_stop_analyzer",  ④
               "search_quote_analyzer":"my_analyzer"  ⑤
            }
         }
      }
   }
}

PUT my_index/_doc/1
{
   "title":"The Quick Brown Fox"
}

PUT my_index/_doc/2
{
   "title":"A Quick Brown Fox"
}

GET my_index/_search
{
   "query":{
      "query_string":{
         "query":"\"the quick brown fox\""  ⑥
      }
   }
}

	`my_analyzer` 分析器，包括停用词在内的所有术语
	`my_stop_analyzer` 分析器，删除停用词
	`analyzer`设置指向`my_analyzer`将在索引时使用的分析器
	`search_analyzer`设置指向`my_stop_analyzer`和删除非短语查询的停用词
	`search_quote_analyzer`设置指向`my_analyzer`分析器并确保不会从短语查询中删除停用词
	由于查询包含在引号中，因此它被检测为短语查询，因此`search_quote_analyzer`启动并确保不从查询中删除停用词。所述`my_analyzer`然后分析器将返回以下令牌[ `the`，`quick`，`brown`，`fox`]将匹配的文件之一。同时，将使用`my_stop_analyzer`分析器分析术语查询，该分析器将过滤掉停用词。因此，对于任何一个搜索 `The quick brown fox`或`A quick brown fox`将返回两个文件，因为这两个文件包含以下标记[ `quick`，`brown`，`fox`。没有`search_quote_analyzer`它就不可能对短语查询进行精确匹配，因为短语查询中的停用词将被删除，从而导致两个文档匹配。

"query":"\"the quick brown fox\""比较 "query":"the quick brown fox"

其他自定义分析器配置

创建索引及配置分析器

PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": {
                "&_to_and": {
                    "type":       "mapping",
                    "mappings": [ "& => and "]
            }},
            "filter": {
                "my_stopwords": {
                    "type":       "stop",
                    "stopwords": [ "the", "a" ]
            }},
            "analyzer": {
                "my_analyzer": {
                    "type":         "custom",
                    "char_filter":  [ "html_strip", "&_to_and" ],
                    "tokenizer":    "standard",
                    "filter":       [ "lowercase", "my_stopwords" ]
            }}
        } 
    }
}

创建索引类型与Mapping使用分析器

PUT /my_index/_mapping/_doc
{
	"_doc": {
		"properties": {
			"title": {
				"type": "text",
				"analyzer": "my_analyzer",
				"search_analyzer": "my_analyzer",
				"search_quote_analyzer": "my_analyzer"
			}
		}
	}
	
}

插入数据

POST /my_index/_doc/1

{
"title":"the a <a>你好</a> & "
}

检索

POST /my_index/_search

{
	"query": {
	    "match": {
	      "title": "你好"
	    }
	}
}

&替换为and

POST /my_index/_search

{
	"query": {
	    "match": {
	      "title": "and"
	    }
	}
}

the a过滤停止词

POST /my_index/_search

{
	"query": {
	    "match": {
	      "title": "the a"
	    }
	}
}

gmHappy

关注

1
点赞
踩
8

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录