elasticsearch-部分匹配_elasticsearch匹配包含关系-CSDN博客

本文链接：https://blog.csdn.net/qq_22271479/article/details/88174201

elasticsearch-部分匹配

参考文章：

https://www.elastic.co/guide/cn/elasticsearch/guide/current/scoring-theory.html

短语匹配

一个被认定为和短语 quick brown fox 匹配的文档，必须满足以下这些要求：

quick 、 brown 和 fox 需要全部出现在域中。
brown 的位置应该比 quick 的位置大 1 。
fox 的位置应该比 quick 的位置大 2 。

GET /my_index/my_type/_search
{
    "query": {
        "match_phrase": {
            "title": "quick brown fox"
        }
    }
}

or

"match": {
    "title": {
        "query": "quick brown fox",
        "type":  "phrase"
    }
}

位置问题

GET /my_index/my_type/_search
{
    "query": {
        "match_phrase": {
            "title": {
                "query": "quick fox",
                "slop":  1
            }
        }
    }
}

slop决定了短语匹配中词的间隔和顺序
搜索 fox quick
那么slop至为2才能搜索出 quick fox的文本

            Pos 1         Pos 2         Pos 3
-----------------------------------------------
Doc:        quick         brown         fox
-----------------------------------------------
Query:      fox           quick
Slop 1:     fox|quick  ↵//这一步fox和quick位置一样 
Slop 2:     quick      ↳  fox
Slop 3:     quick                 ↳     fox

多值字段

PUT /my_index/groups/1
{
    "names": [ "John Abraham", "Lincoln Smith"]
}

短语查询

GET /my_index/groups/_search
{
    "query": {
        "match_phrase": {
            "names": "Abraham Lincoln"
        }
    }
}

此时的position如下：

position 1:John
position 2:Abraham
position 3:Lincoln
position 4:Smith

那么这里就会出现错误的搜索结果，该条记录将被命中

解决办法

    "properties": {
        "names": {
            "type":                "string",
            "position_increment_gap": 100
        }
    }
}

此时的position

position 1:John
position 2:Abraham
position 103:Lincoln
position 104:Smith

此时slop要为100才能被命中

slop的分数(越近越好)

当你给了slop一个交大的值，那么词越近的时候，分数越高

POST /my_index/my_type/_search
{
   "query": {
      "match_phrase": {
         "title": {
            "query": "quick dog",
            "slop":  50 
         }
      }
   }
}

{
  "hits": [
     {
        "_id":      "3",
        "_score":   0.75, 
        "_source": {
           "title": "The quick brown fox jumps over the quick dog"
        }
     },
     {
        "_id":      "2",
        "_score":   0.28347334, 
        "_source": {
           "title": "The quick brown fox jumps over the lazy dog"
        }
     }
  ]
}

提高短语匹配的相关性范围

GET /my_index/my_type/_search
{
  "query": {
    "bool": {
      "must": {
        "match": { 
          "title": {
            "query":                "quick brown fox",
            "minimum_should_match": "30%"
          }
        }
      },
      "should": {
        "match_phrase": { 
          "title": {
            "query": "quick brown fox",
            "slop":  50
          }
        }
      }
    }
  }
}

先找出至少30%匹配的文档
通过should的短语匹配来增加分数

由于短语匹配，所有的词必须出现，且他们的位置也有相关要求，所以为了提高范围，通过bool操作来实现。

分析查询和邻近查询的优化

结果重新评分

在先前的章节中，我们讨论了而使用邻近查询来调整相关度，而不是使用它将文档从结果列表中添加或者排除。一个查询可能会匹配成千上万的结果，但我们的用户很可能只对结果的前几页感兴趣。

一个简单的 match 查询已经通过排序把包含所有含有搜索词条的文档放在结果列表的前面了。事实上，我们只想对这些顶部文档重新排序，来给同时匹配了短语查询的文档一个额外的相关度升级。

search API 通过重新评分明确支持该功能。重新评分阶段支持一个代价更高的评分算法–比如 phrase 查询–只是为了从每个分片中获得前 K 个结果。然后会根据它们的最新评分重新排序。

该请求如下所示：

GET /my_index/my_type/_search
{
    "query": {
    //match 查询决定哪些文档将包含在最终结果集中，并通过 TF/IDF 排序。
        "match": {  
            "title": {
                "query":                "quick brown fox",
                "minimum_should_match": "30%"
            }
        }
    },
    "rescore": {
       //window_size 是每一分片进行重新评分的顶部文档数量。
        "window_size": 50, 
        "query": {  
        //	目前唯一支持的重新打分算法就是另一个查询，但是以后会有计划增加更多的算法。
            "rescore_query": {
                "match_phrase": {
                    "title": {
                        "query": "quick brown fox",
                        "slop":  50
                    }
                }
            }
        }
    }
}

部分匹配

select * from t1 where t1.a like ‘%tom%’

prefix 前缀查询

GET /my_index/address/_search
{
    "query": {
        "prefix": {
            "postcode": "W1"
        }
    }
}

为了支持前缀匹配，查询会做以下事情：

扫描词列表并查找到第一个以 W1 开始的词。
搜集关联的文档 ID 。
移动到下一个词。
如果这个词也是以 W1 开头，查询跳回到第二步再重复执行，直到下一个词不以 W1 为止。

当字段中词的集合很小时，可以放心使用，但是它的伸缩性并不好，会对我们的集群带来很多压力。可以使用较长的前缀来限制这种影响，减少需要访问的量。

通配符

? 匹配任意字符， * 匹配 0 或多个字符

GET /my_index/address/_search
{
    "query": {
        "wildcard": {
            "postcode": "W?F*HW" 
        }
    }
}

这个正则表达式要求词必须以 W 开头，紧跟 0 至 9 之间的任何一个数字，然后接一或多个其他字符。


GET /my_index/address/_search
{
    "query": {
        "regexp": {
            "postcode": "W[0-9].+" 
        }
    }
}

注意：wildcard 和 regexp 查询的工作方式与 prefix 查询完全一样，性能问题

这些部分匹配都是词查询，都是基于词项查询的。

输入即查询（输入提示相关搜索）

短语前缀匹配

{
    "match_phrase_prefix" : {
        "brand" : {
            "query": "walker johnnie bl", 
            "slop":  10
        }
    }
}

johnnie
跟着 walker
跟着以 bl 开始的词

//explain
"johnnie walker bl*"

参数 max_expansions 控制着可以与前缀匹配的词的数量，它会先查找第一个与前缀 bl 匹配的词，然后依次查找搜集与之匹配的词（按字母顺序），直到没有更多可匹配的词或当数量超过 max_expansions 时结束。

不要忘记，当用户每多输入一个字符时，这个查询又会执行一遍，所以查询需要快，如果第一个结果集不是用户想要的，他们会继续输入直到能搜出满意的结果为止。

{
    "match_phrase_prefix" : {
        "brand" : {
            "query":          "johnnie walker bl",
            "max_expansions": 50
        }
    }
}

Ngrams-部分匹配

quick

长度 1（unigram）： [ q, u, i, c, k ]
长度 2（bigram）： [ qu, ui, ic, ck ]
长度 3（trigram）： [ qui, uic, ick ]
长度 4（four-gram）： [ quic, uick ]
长度 5（five-gram）： [ quick ]

索引

token过滤器
{
    "filter": {
        "autocomplete_filter": {
            "type":     "edge_ngram",
            "min_gram": 1,
            "max_gram": 20
        }
    }
}

设计分析器
{
    "analyzer": {
        "autocomplete": {
            "type":      "custom",
            "tokenizer": "standard",
            "filter": [
                "lowercase",
                "autocomplete_filter" 
            ]
        }
    }
}

创建索引
PUT /my_index
{
    "settings": {
        "number_of_shards": 1, 
        "analysis": {
            "filter": {
                "autocomplete_filter": { 
                    "type":     "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 20
                }
            },
            "analyzer": {
                "autocomplete": {
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "autocomplete_filter" 
                    ]
                }
            }
        }
    }
}

索引
PUT /my_index/_mapping/my_type
{
    "my_type": {
        "properties": {
            "name": {
                "type":     "string",
                "analyzer": "autocomplete"
            }
        }
    }
}


查询
GET /my_index/my_type/_search
{
    "query": {
        "match": {
            "name": {
                "query":    "brown fo",
                "analyzer": "standard" 
            }
        }
    }
}

复合词的使用

PUT /my_index
{
    "settings": {
        "analysis": {
            "filter": {
                "trigrams_filter": {
                    "type":     "ngram",
                    "min_gram": 3,
                    "max_gram": 3
                }
            },
            "analyzer": {
                "trigrams": {
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter":   [
                        "lowercase",
                        "trigrams_filter"
                    ]
                }
            }
        }
    },
    "mappings": {
        "my_type": {
            "properties": {
                "text": {
                    "type":     "string",
                    "analyzer": "trigrams" 
                }
            }
        }
    }
}

GET /my_index/my_type/_search
{
    "query": {
        "match": {
            "text": {
                "query":                "Gesundheit",
                "minimum_should_match": "80%"
            }
        }
    }
}

参考文章：

https://www.elastic.co/guide/cn/elasticsearch/guide/current/scoring-theory.html