分析器执行顺序与Mapping自定义分析器配置

最新推荐文章于 2022-09-15 23:40:09 发布

coder麻雀

最新推荐文章于 2022-09-15 23:40:09 发布

阅读量202

点赞数

文章标签： elasticsearch 分析器

默认分析器
虽然我们可以在字段层级指定分析器，但是如果该层级没有指定任何的分析器，那么我们如何能确定这个字段使用的是哪个分析器呢？

分析器可以从三个层面进行定义：按字段（per-field）、按索引（per-index）或全局缺省（global default）。Elasticsearch 会按照以下顺序依次处理，直到它找到能够使用的分析器。索引时的顺序如下：

字段映射里定义的 analyzer ，否则
索引设置中名为 default 的分析器，默认为
standard 标准分析器

在搜索时，顺序有些许不同：

查询自己定义的 analyzer ，否则
字段映射里定义的 analyzer ，否则
索引设置中名为 default 的分析器，默认为
standard 标准分析器

有时，在索引时和搜索时使用不同的分析器是合理的。我们可能要想为同义词建索引（例如，所有 quick出现的地方，同时也为 fast 、 rapid 和 speedy 创建索引）。但在搜索时，我们不需要搜索所有的同义词，取而代之的是寻找用户输入的单词是否是 quick 、 fast 、 rapid 或 speedy 。

为了区分，Elasticsearch 也支持一个可选的 search_analyzer 映射，它仅会应用于搜索时（ analyzer 还用于索引时）。还有一个等价的 default_search 映射，用以指定索引层的默认配置。

如果考虑到这些额外参数，一个搜索时的完整顺序会是下面这样：

查询自己定义的 analyzer ，否则
字段映射里定义的 search_analyzer ，否则
字段映射里定义的 analyzer ，否则
索引设置中名为 default_search 的分析器，默认为
索引设置中名为 default 的分析器，默认为
standard 标准分析器

分析器配置实践
为特定字段指定分析器最简单的方法是在字段映射中定义它，如下所示：

PUT /my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "text": { ①
          "type": "text",
          "fields": {
            "english": { ②
              "type":     "text",
              "analyzer": "english"
            }
          }
        }
      }
    }
  }
}
 
GET my_index/_analyze 
{
  "field": "text",
  "text": "The quick Brown Foxes." ③
}
 
GET my_index/_analyze 
{
  "field": "text.english",
  "text": "The quick Brown Foxes." ④
}

text字段使用默认standard分析器`。
text.english 字段使用英文分析器，删除停止词并应用词干
tokens返回：[the，quick，brown，foxes]。
tokens返回：[quick，brown，fox]。

search_quote_analyzer
该search_quote_analyzer设置允许您为短语指定分析器，这在处理禁用短语查询的停用词时特别有用。

要禁用短语的停用词，需要使用三个分析器设置的字段：

一个analyzer用于索引的所有条款，包括停止词设置
一个search_analyzer非短语查询，将删除停止词设置
search_quote_analyzer短语查询的设置，不会删除停用词

PUT my_index
{
   "settings":{
      "analysis":{
         "analyzer":{ ①
            "my_analyzer":{ 
               "type":"custom",
               "tokenizer":"standard",
               "filter":[
                  "lowercase"
               ]
            },
            "my_stop_analyzer":{ ②
               "type":"custom",
               "tokenizer":"standard",
               "filter":[
                  "lowercase",
                  "english_stop"
               ]
            }
         },
         "filter":{
            "english_stop":{ 
               "type":"stop",
               "stopwords":"_english_"
            }
         }
      }
   },
   "mappings":{
      "_doc":{
         "properties":{
            "title": {
               "type":"text",
               "analyzer":"my_analyzer",  ③
               "search_analyzer":"my_stop_analyzer",  ④
               "search_quote_analyzer":"my_analyzer"  ⑤
            }
         }
      }
   }
}
 
PUT my_index/_doc/1
{
   "title":"The Quick Brown Fox"
}
 
PUT my_index/_doc/2
{
   "title":"A Quick Brown Fox"
}
 
GET my_index/_search
{
   "query":{
      "query_string":{
         "query":"\"the quick brown fox\""  ⑥
      }
   }
}

my_analyzer 分析器，包括停用词在内的所有术语
my_stop_analyzer 分析器，删除停用词
analyzer设置指向my_analyzer将在索引时使用的分析器
search_analyzer设置指向my_stop_analyzer和删除非短语查询的停用词
search_quote_analyzer设置指向my_analyzer分析器并确保不会从短语查询中删除停用词

由于查询包含在引号中，因此它被检测为短语查询，因此search_quote_analyzer启动并确保不从查询中删除停用词。所述my_analyzer然后分析器将返回以下令牌[ the，quick，brown，fox]将匹配的文件之一。同时，将使用my_stop_analyzer分析器分析术语查询，该分析器将过滤掉停用词。因此，对于任何一个搜索 The quick brown fox或A quick brown fox将返回两个文件，因为这两个文件包含以下标记[ quick，brown，fox。没有search_quote_analyzer它就不可能对短语查询进行精确匹配，因为短语查询中的停用词将被删除，从而导致两个文档匹配。

"query":"\"the quick brown fox\""比较 "query":"the quick brown fox"

其他自定义分析器配置
创建索引及配置分析器

PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": {
                "&_to_and": {
                    "type":       "mapping",
                    "mappings": [ "& => and "]
            }},
            "filter": {
                "my_stopwords": {
                    "type":       "stop",
                    "stopwords": [ "the", "a" ]
            }},
            "analyzer": {
                "my_analyzer": {
                    "type":         "custom",
                    "char_filter":  [ "html_strip", "&_to_and" ],
                    "tokenizer":    "standard",
                    "filter":       [ "lowercase", "my_stopwords" ]
            }}
        } 
    }
}

创建索引类型与Mapping使用分析器

PUT /my_index/_mapping/_doc
{
    "_doc": {
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "my_analyzer",
                "search_analyzer": "my_analyzer",
                "search_quote_analyzer": "my_analyzer"
            }
        }
    }
    
}

插入数据

POST /my_index/_doc/1
 
{
"title":"the a <a>你好</a> & "
}

检索

POST /my_index/_search
{
    "query": {
        "match": {
          "title": "你好"
        }
    }
}

&替换为and

POST /my_index/_search 
{
    "query": {
        "match": {
          "title": "and"
        }
    }
}

the a过滤停止词

POST /my_index/_search 
{
    "query": {
        "match": {
          "title": "the a"
        }
    }
}

coder麻雀

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫