13 搜索之DSL--高级查询

最新推荐文章于 2024-03-03 23:10:47 发布

wyaoyao93

最新推荐文章于 2024-03-03 23:10:47 发布

阅读量232

点赞数

分类专栏： elastic-search

本文链接：https://blog.csdn.net/wyaoyao93/article/details/112112150

版权

elastic-search 专栏收录该内容

22 篇文章 1 订阅

订阅专栏

文章目录

1 环境准备

新建一个索引库item

PUT item
{
    "mappings":{
        "properties":{
            "id":{
                "type":"long"
            },
            "title":{
                "type":"text",
                "analyzer":"ik_max_word"
            },
            "content":{
                "type":"text",
                "analyzer":"ik_max_word"
            },
            "price":{
                "type":"float"
            },
            "category":{
                "type":"keyword"
            }
        }
    }
}

插入数据

PUT item/_doc/1
{
  "id":1,
  "title":"小米手机",
  "content":"手机中的性价比之王",
  "price":1000.00,
  "category":"手机"
}

PUT item/_doc/2
{
  "id":2,
  "title":"小米电视",
  "content":"电视中的性价比之王",
  "price":1005.00,
  "category":"电视"
}

PUT item/_doc/3
{
  "id":3,
  "title":"华为电视盒子",
  "content":"电视盒直播网络机顶盒4K高清华为海思芯片机顶盒WIFI宽带电视盒子家用电视合猫播放器",
  "price":1005.00,
  "category":"电视"
}

PUT item/_doc/4
{
  "id":4,
  "title":"海信冰箱",
  "content":"食品保险冷冻首先农品",
  "price":3005.00,
  "category":"冰箱"
}

PUT item/_doc/5
{
  "id":5,
  "title":"华为手机",
  "content":"首款5g手机",
  "price":4005.00,
  "category":"手机"
}

2 布尔查询（bool）

bool把各种其它查询通过must（与）、must_not（非）、should（或）的方式进行组合

must：必须出现在匹配文档中，并且会影响匹配得分
filter：必须出现在匹配文档中，匹配得分将会被忽略（filter不会影响得分）
should：应该出现在匹配文档中，在布尔查询中，如果没有must或filter子句，文档必须匹配一个或者多个should子句。应该匹配的should子句的最小数量可以通过minimum_should_match参数进行设置
must_not：不能出现在匹配的文档中。

布尔查询采取匹配的越多越好的方式，每个匹配的子句的得分都会被加在一起，为每个文档提供最终得分（_score）

演示

比如要搜手机，价格必须在1000到20000，是否支持5g均可，品牌为华为

GET item/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "category": "手机"
          }
        },
        {
          "range": {
            "price": {
              "gte": 1000,
              "lte": 20000
            }
          }
        }
      ],
      "should": [
        {
          "match": {
            "content": "5g"
          }
        }
      ],
      "filter": {
        "term": {
          "title": "华为"
        }
      }
    }
  }
}

GET item/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "category": "手机"
          }
        },
        {
          "range": {
            "price": {
              "gte": 1000,
              "lte": 20000
            }
          }
        }
      ],
      "should": [
        {
          "match": {
            "content": "5g"
          }
        }
      ],
      "filter": {
        "term": {
          "title": "华为"
        }
      }
    }
  }
}

如果去掉filter：

GET item/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "category": "手机"
          }
        },
        {
          "range": {
            "price": {
              "gte": 1000,
              "lte": 20000
            }
          }
        }
      ],
      "should": [
        {
          "match": {
            "content": "5g"
          }
        }
      ]
    }
  }
}

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 7.009481,
    "hits" : [
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 7.009481,
        "_source" : {
          "id" : 5,
          "title" : "华为手机",
          "content" : "首款5g手机",
          "price" : 4005.0,
          "category" : "手机"
        }
      },
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.8754687,
        "_source" : {
          "id" : 1,
          "title" : "小米手机",
          "content" : "手机中的性价比之王",
          "price" : 1000.0,
          "category" : "手机"
        }
      }
    ]
  }
}

发现两次查询到华为手机的结果的评分（_score）字段的都是7.009481—>filter不会影响得分

3 最佳匹配字段

3.1 引入

PUT /my_index/_doc/1
{
    "title": "Quick brown rabbits",
    "body":  "Brown rabbits are commonly seen."
}

PUT /my_index/_doc/2
{
    "title": "Keeping pets healthy",
    "body":  "My quick brown fox eats rabbits on a regular basis."
}

让我们运行下面的bool查询：

GET my_index/_search
{
    "query": {
        "bool": {
            "should": [
                { "match": { "title": "Brown fox" }},
                { "match": { "body":  "Brown fox" }}
            ]
        }
    }
}

{
  "took" : 544,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.90425634,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.90425634,
        "_source" : {
          "title" : "Quick brown rabbits",
          "body" : "Brown rabbits are commonly seen."
        }
      },
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.77041256,
        "_source" : {
          "title" : "Keeping pets healthy",
          "body" : "My quick brown fox eats rabbits on a regular basis."
        }
      }
    ]
  }
}

从查询得分来看文档1的得分比文档2的得分要高，但是我们看来搜索Brown fox，文档2的匹配度更高一点（Brown fox文档2 的body字段完整的包含了Brown fox）

bool查询是如何计算得到其分值的：

运行should子句中的两个查询
相加查询返回的分值
将相加得到的分值乘以匹配的查询子句的数量
除以总的查询子句的数量

文档1在两个字段中都包含了brown，因此两个match查询都匹配成功并拥有了一个分值。文档2在body字段中包含了brown以及fox，但是在title字段中没有出现任何搜索的单词。因此对body字段查询得到的高分加上对title字段查询得到的零分，然后在乘以匹配的查询子句数量1，最后除以总的查询子句数量2，导致整体分值比文档1的低。

在这个例子中，title和body字段是互相竞争的。我们想要找到一个最佳匹配(Best-matching)的字段。

如果我们不是合并来自每个字段的分值，而是使用最佳匹配子句的分值作为整个查询的整体分值呢？这就会让包含有我们寻找的两个单词的字段有更高的权重，而不是在不同的字段中重复出现的相同单词。

3.2 dis_max查询

相比使用bool查询，我们可以使用dis_max查询(Disjuction Max Query)。Disjuction的意思"OR"(而Conjunction的意思是"AND")，因此Disjuction Max Query的意思就是返回匹配了任何查询的文档，并且分值是产生了最佳匹配的查询所对应的分值：

GET my_index/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Brown fox" }},
                { "match": { "body":  "Brown fox" }}
            ]
        }
    }
}

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.77041256,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.77041256,
        "_source" : {
          "title" : "Keeping pets healthy",
          "body" : "My quick brown fox eats rabbits on a regular basis."
        }
      },
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.6931472,
        "_source" : {
          "title" : "Quick brown rabbits",
          "body" : "Brown rabbits are commonly seen."
        }
      }
    ]
  }
}

3.3 tie_breaker参数

如果搜索的是"quick pets"，那么会发生什么呢？两份文档都包含了单词quick，但是只有文档2包含了单词pets，文档1没有包含。两份文档都没能在一个字段中同时包含搜索的两个单词：

quick：文档1的title字段包含了quick。文档2的body字段包含了quick
pets：文档1不包含pets，文档2的title字段包含了pets

一个像下面那样的简单dis_max查询会选择出拥有最佳匹配字段的查询子句，而忽略其他的查询子句的得分：

GET my_index/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Quick pets" }},
                { "match": { "body":  "Quick pets" }}
            ]
        }
    }
}

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.6931472,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.6931472,
        "_source" : {
          "title" : "Quick brown rabbits",
          "body" : "Brown rabbits are commonly seen."
        }
      },
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.6931472,
        "_source" : {
          "title" : "Keeping pets healthy",
          "body" : "My quick brown fox eats rabbits on a regular basis."
        }
      }
    ]
  }
}

可以发现，两份文档的分值是一模一样的。

我们期望的是同时匹配了title字段和body字段的文档能够拥有更高的排名，但是结果并非如此。需要记住：dis_max查询只是简单的使用最佳匹配查询子句得到的_score。

使用tie_breaker参数将其它匹配的查询子句考虑进来

GET my_index/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Quick pets" }},
                { "match": { "body":  "Quick pets" }}
            ],
            "tie_breaker": 0.3
        }
    }
}

{
  "hits": [
     {
        "_id": "2",
        "_score": 0.14757764, 
        "_source": {
           "title": "Keeping pets healthy",
           "body": "My quick brown fox eats rabbits on a regular basis."
        }
     },
     {
        "_id": "1",
        "_score": 0.124275915, 
        "_source": {
           "title": "Quick brown rabbits",
           "body": "Brown rabbits are commonly seen."
        }
     }
   ]
}

现在文档2的分值比文档1稍高一些，就比较符合我们的期望值

tie_breaker参数会让dis_max查询的行为更像是dis_max和bool的一种折中。它会通过下面的方式改变分值计算过程：

取得最佳匹配查询子句的_score。
将其它每个匹配的子句的分值乘以tie_breaker。
将以上得到的分值进行累加并规范化。
通过tie_breaker参数，所有匹配的子句都会起作用，只不过最佳匹配子句的作用更大。

tie_breaker的取值范围是0到1之间的浮点数，取0时即为仅使用最佳匹配子句(译注：和不使用tie_breaker参数的dis_max查询效果相同)，取1则会将所有匹配的子句一视同仁。它的确切值需要根据你的数据和查询进行调整，但是一个合理的值会靠近0，(比如，0.1 -0.4)，来确保不会压倒dis_max查询具有的最佳匹配性质。

4 过滤(filter)

4.1 入门

条件查询中进行过滤
所有的查询都会影响到文档的评分及排名。如果我们需要在查询结果中进行过滤，并且不希望过滤条件影响评分，那么就不要把过滤条件作为查询条件来用。而是使用filter方式：
上面已经介绍过了：

GET item/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "category": "手机"
          }
        },
        {
          "range": {
            "price": {
              "gte": 1000,
              "lte": 20000
            }
          }
        }
      ],
      "should": [
        {
          "match": {
            "content": "5g"
          }
        }
      ],
      "filter": {
        "term": {
          "title": "华为"
        }
      }
    }
  }
}

4.2 constant_score

如果一次查询只有过滤，没有查询条件，不希望进行评分，我们可以使用constant_score取代只有 filter 语句的 bool 查询。在性能上是完全相同的，但对于提高查询简洁性和清晰度有很大帮助。

GET item/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "terms": {
          "title": [
            "冰箱",
            "手机"
          ]
        }
      },
      "boost": 1.2
    }
  }
}

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.2,
    "hits" : [
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.2,
        "_source" : {
          "id" : 1,
          "title" : "小米手机",
          "content" : "手机中的性价比之王",
          "price" : 1000.0,
          "category" : "手机"
        }
      },
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 1.2,
        "_source" : {
          "id" : 4,
          "title" : "海信冰箱",
          "content" : "食品保险冷冻首先农品",
          "price" : 3005.0,
          "category" : "冰箱"
        }
      },
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 1.2,
        "_source" : {
          "id" : 5,
          "title" : "华为手机",
          "content" : "首款5g手机",
          "price" : 4005.0,
          "category" : "手机"
        }
      }
    ]
  }
}

5 高亮

5.1 入门

通过highlight进行设置，查询字段高亮

GET item/_search
{
  "query": {
    "term": {
      "title": "手机"
    }
  },
  "highlight": {
    "fields": {
      "title":{}
    }
  }
}

{
  "took" : 135,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.9395274,
    "hits" : [
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.9395274,
        "_source" : {
          "id" : 1,
          "title" : "小米手机",
          "content" : "手机中的性价比之王",
          "price" : 1000.0,
          "category" : "手机"
        },
        "highlight" : {
          "title" : [
            "小米<em>手机</em>"
          ]
        }
      },
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 0.9395274,
        "_source" : {
          "id" : 5,
          "title" : "华为手机",
          "content" : "首款5g手机",
          "price" : 4005.0,
          "category" : "手机"
        },
        "highlight" : {
          "title" : [
            "华为<em>手机</em>"  # 默认使用<em>标签
          ]
        }
      }
    ]
  }
}

5.2 自定义高亮标签

es默认使用的是标签标记关键字

GET item/_search
{
  "query": {
    "term": {
      "title": "手机"
    }
  },
 "highlight": {
   "fields": {
     "title": {
        "pre_tags": ["<strong>"],
        "post_tags": ["<strong>"]
     }
   }
 }
}

{
  "took" : 11,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.9395274,
    "hits" : [
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.9395274,
        "_source" : {
          "id" : 1,
          "title" : "小米手机",
          "content" : "手机中的性价比之王",
          "price" : 1000.0,
          "category" : "手机"
        },
        "highlight" : {
          "title" : [
            "小米<strong>手机<strong>"
          ]
        }
      },
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 0.9395274,
        "_source" : {
          "id" : 5,
          "title" : "华为手机",
          "content" : "首款5g手机",
          "price" : 4005.0,
          "category" : "手机"
        },
        "highlight" : {
          "title" : [
            "华为<strong>手机<strong>"
          ]
        }
      }
    ]
  }
}

5.3 多字段高亮

比如搜索title字段的时候，也希望content的字段也会高亮，使用require_field_match,默认是true

GET item/_search
{
  "query": {
    "term": {
      "title": "手机"
    }
  },
 "highlight": {
   "require_field_match": "false", 
   "fields": {
     "title": {
        "pre_tags": ["<strong>"],
        "post_tags": ["<strong>"]
     },
     "content": {}
   }
 }
}

{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.9395274,
    "hits" : [
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.9395274,
        "_source" : {
          "id" : 1,
          "title" : "小米手机",
          "content" : "手机中的性价比之王",
          "price" : 1000.0,
          "category" : "手机"
        },
        "highlight" : {
          "title" : [
            "小米<strong>手机<strong>"
          ],
          "content" : [
            "<em>手机</em>中的性价比之王"
          ]
        }
      },
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 0.9395274,
        "_source" : {
          "id" : 5,
          "title" : "华为手机",
          "content" : "首款5g手机",
          "price" : 4005.0,
          "category" : "手机"
        },
        "highlight" : {
          "title" : [
            "华为<strong>手机<strong>"
          ],
          "content" : [
            "首款5g<em>手机</em>"
          ]
        }
      }
    ]
  }
}

5.4 高亮性能分析

es提供了三个高亮器

highlighter：默认
- highlighter实现高亮功能需要对_source保存的原始文档进行二次分析，速度最慢，优点是不需要额外的存储空间
postings-highlighter
- 不需要对_source保存的原始文档进行二次分析，但是需要在字段映射中设置index_options，取值为offsets，保存关键词的偏移量
fast-vector-highlighter
- 速度最快。但是需要在字段映射中设置with_positions_offsets，取值为offsets，保存关键词的未知和偏移信息，占用存储空间最大

6 排序

默认排序

es按照查询和文档的相关度进行排序的，默认按照评分降序排序：

GET item/_search
{
  "query": {
    "term": {
      "title": "手机"
    }
  },
  "sort": [
    {
      "_score": {
        "order": "desc"
      }
    }
  ]
}

对应match_all，由于只返回所有文档，不需要评分（返回都是1），就是按照添加的顺序进行排序

GET item/_search
{
  "query": {
    "match_all": {}
  }
}

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "id" : 1,
          "title" : "小米手机",
          "content" : "手机中的性价比之王",
          "price" : 1000.0,
          "category" : "手机"
        }
      },
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "id" : 2,
          "title" : "小米电视",
          "content" : "电视中的性价比之王",
          "price" : 1005.0,
          "category" : "电视"
        }
      },
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "id" : 3,
          "title" : "华为电视盒子",
          "content" : "电视盒直播网络机顶盒4K高清华为海思芯片机顶盒WIFI宽带电视盒子家用电视合猫播放器",
          "price" : 1005.0,
          "category" : "电视"
        }
      },
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 1.0,
        "_source" : {
          "id" : 4,
          "title" : "海信冰箱",
          "content" : "食品保险冷冻首先农品",
          "price" : 3005.0,
          "category" : "冰箱"
        }
      },
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 1.0,
        "_source" : {
          "id" : 5,
          "title" : "华为手机",
          "content" : "首款5g手机",
          "price" : 4005.0,
          "category" : "手机"
        }
      }
    ]
  }
}

6.2 多字段排序

比如先按照价格升序，在按照id降序

GET item/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "price": {
        "order": "asc"
      }
    },
    {
      "id": {
        "order": "desc"
      }
    }
  ]
}

wyaoyao93

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
13 搜索之DSL--高级查询

文章目录1 环境准备2 布尔查询（bool）演示3 最佳匹配字段3.1 引入3.2 dis_max查询3.3 tie_breaker参数4 过滤(filter)4.1 入门4.2 constant_score5 高亮5.1 入门5.2 自定义高亮标签5.3 多字段高亮5.4 高亮性能分析6 排序默认排序6.2 多字段排序1 环境准备新建一个索引库itemPUT item{ "mappings":{ "properties":{ "id":{
复制链接

扫一扫

专栏目录