elasticsearch笔记_近似匹配_部分匹配(七)

最新推荐文章于 2024-08-11 15:22:00 发布

-yanhui-

最新推荐文章于 2024-08-11 15:22:00 发布

阅读量3k

点赞数 1

文章标签： elasticsearch 近似匹配部分匹配模糊查询

本文链接：https://blog.csdn.net/xyh930929/article/details/72313439

版权

Elasticsearch 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

短语匹配

一个被认定为和短语 quick brown fox 匹配的文档，必须满足以下这些要求：

quick 、 brown 和 fox 需要全部出现在域中。
brown 的位置应该比 quick 的位置大 1 。
fox 的位置应该比 quick 的位置大 2 。

如果以上任何一个选项不成立，则该文档不能认定为匹配。

match_phrase查询

GET /my_index/my_type/_search
{
    "query": {
        "match_phrase": {
            "title": "quick brown fox"
        }
    }
}

类似 match 查询， match_phrase 查询首先将查询字符串解析成一个词项列表，然后对这些词项进行搜索，但只保留那些包含全部搜索词项，且位置与搜索词项相同的文档。

当一个字符串被分词后，这个分析器不但会返回一个词项列表，而且还会返回各词项在原始字符串中的位置或者顺序关系.
GET /_analyze?analyzer=standard Quick brown fox
返回结果 : 

{
   "tokens": [
      {
         "token": "quick",
         "start_offset": 0,
         "end_offset": 5,
         "type": "<ALPHANUM>",
         "position": 1  //词条在短语中的位置.
      },
      {
         "token": "brown",
         "start_offset": 6,
         "end_offset": 11,
         "type": "<ALPHANUM>",
         "position": 2 
      },
      {
         "token": "fox",
         "start_offset": 12,
         "end_offset": 15,
         "type": "<ALPHANUM>",
         "position": 3 
      }
   ]
}

match_phrase查询(slop 参数)

如果想要包含 “quick brown fox” 的文档也能够匹配“quick fox,” 。需要用到slop参数,slop参数的意思是告诉 match_phrase 查询词条相隔多远时仍然能将文档视为匹配 .
GET /my_index/my_type/_search
{
    "query": {
        "match_phrase": {
            "title": {
                "query": "quick fox",
                "slop":  1
            }
        }
    }
}

match_phrase查询(多值字段的小问题)

假设现在有一个文档如下:
{
    "names": [ "John Abraham", "Lincoln Smith"]
}
执行下面这个查询
GET /my_index/groups/_search
{
    "query": {
        "match_phrase": {
            "names": "Abraham Lincoln"
        }
    }
}
即使 Abraham 和 Lincoln 在 names 数组里属于两个不同的人名，我们的文档也匹配了查询。这一切的原因在Elasticsearch数组的索引方式。

在分析 John Abraham 的时候，产生了如下信息：
Position 1: john
Position 1: john
Position 2: abraham

然后在分析 Lincoln Smith 的时候，产生了：
Position 3: lincoln
Position 4: smith

Elasticsearch对以上数组分析生成了与分析单个字符串 John Abraham Lincoln Smith 一样几乎完全相同的语汇单元。我们的查询示例寻找相邻的 lincoln 和 abraham ，而且这两个词条确实存在，并且它们俩正好相邻，所以这个查询匹配了。解决这个问题的技巧使用position_increment_gap参数 .

position_increment_gap

DELETE /my_index/groups/ 

PUT /my_index/_mapping/groups 
{
    "properties": {
        "names": {
            "type":                "string",
            "position_increment_gap": 100
        }
    }
}

position_increment_gap 设置告诉 Elasticsearch 应该为数组中每个新元素增加当前词条 position 的指定值。所以现在当我们再索引 names 数组时，会产生如下的结果：

Position 1: john
Position 2: abraham
Position 103: lincoln
Position 104: smith

现在我们的短语查询可能无法匹配该文档因为 abraham 和 lincoln 之间的距离为 100 。为了匹配这个文档你必须添加值为 100 的 slop 。

slop参数的设置会影响对文档的评分,短语的词条离的越近,评分越高.

例如 : 对 quick dog 的邻近查询匹配以下两个文档 :

{
  "hits": [
     {
        "_id":      "3",
        "_score":   0.75, 
        "_source": {
           "title": "The quick brown fox jumps over the quick dog"
        }
     },
     {
        "_id":      "2",
        "_score":   0.28347334, 
        "_source": {
           "title": "The quick brown fox jumps over the lazy dog"
        }
     }
  ]
}

可以看到文档1的评分要高于文档2 , 因为文档1里面的quick 和 dog 离更近一些 .

小技巧

有时候可能会遇见这样的情况 : 如果七个词条中有六个匹配，那么这个文档对用户而言就已经足够相关了，但是 match_phrase 查询可能会将它排除在外。

可以这样做 :

将一个简单的 match 查询作为一个 must 子句。这个查询将决定哪些文档需要被包含到结果集中。我们可以用 minimum_should_match 参数去除长尾。然后我们可以以 should 子句的形式添加更多特定查询。每一个匹配成功的都会增加匹配文档的相关度。

GET /my_index/my_type/_search
{
  "query": {
    "bool": {
      "must": {
        "match": { 
          "title": { "query": "quick brown fox", "minimum_should_match": "30%" } }
      },
      "should": {
        "match_phrase": { 
          "title": { "query": "quick brown fox", "slop": 50 } }
      }
    }
  }
}

寻找相关词

上面所有的查询都没法解决这样一个问题:两个子句 I’m not happy I’m working 和 I’m happy I’m not working 包含相同的单词，也拥有相同的邻近度，但含义截然不同。

解决思路 :
对句子 Sue ate the alligator ，不仅要将每一个单词（或者 unigram ）作为词项索引:
["sue", "ate", "the", "alligator"]
也要将每个单词以及它的邻近词作为单个词项索引：
["sue ate", "ate the", "the alligator"]
这些单词对（或者 bigrams ）被称为 shingles 。

Shingles 不限于单词对；你也可以索引三个单词（ trigrams ）
["sue ate the", "ate the alligator"]
Trigrams 提供了更高的精度，但是也大大增加了索引中唯一词项的数量。在大多数情况下，Bigrams 就够了。
DELETE /my_index
PUT /my_index
{
    "settings": {
        "number_of_shards": 1,  
        "analysis": {
            "filter": {
                "my_shingle_filter": {
                    "type":             "shingle",
                    "min_shingle_size": 2,  //默认最小/最大的 shingle 大小是 2 ，所以实际上不需要设置。
                    "max_shingle_size": 2, 
                    "output_unigrams":  false //shingle 语汇单元过滤器默认输出 unigrams ，但是我们想让 unigrams 和 bigrams 分开。 }
            },
            "analyzer": {
                "my_shingle_analyzer": {
                    "type":             "custom",
                    "tokenizer":        "standard",
                    "filter": [ "lowercase", "my_shingle_filter" //my_shingle_analyzer 使用我们常规的 my_shingles_filter 语汇单元过滤器。 ] }
            }
        }
    }
}

部分匹配

    WHERE text LIKE "%quick%" AND text LIKE "%brown%" AND text LIKE "%fox%"

为了实现上述sql语句的功能 , elasticsearch提供了三种方式:

prefix前缀查询

GET /my_index/address/_search
{
    "query": {
        "prefix": {
            "postcode": "W1"
        }
    }
}

通配符

GET /my_index/address/_search
{
    "query": {
        "wildcard": {
            "postcode": "W?F*HW" 
        }
    }
}
//它使用标准的 shell 通配符查询： ? 匹配任意字符， * 匹配 0 或多个字符。
//? 可以匹配 1 和 2 ， * 可以与空格及 7 和 8 匹配。

正则表达regexp

GET /my_index/address/_search
{
    "query": {
        "regexp": {
            "postcode": "W[0-9].+" 
        }
    }
}

//这个正则表达式要求词必须以 W 开头，紧跟 0 至 9 之间的任何一个数字，然后接一或多个其他字符。