13 搜索之DSL--高级查询


1 环境准备

  1. 新建一个索引库item
PUT item
{
    "mappings":{
        "properties":{
            "id":{
                "type":"long"
            },
            "title":{
                "type":"text",
                "analyzer":"ik_max_word"
            },
            "content":{
                "type":"text",
                "analyzer":"ik_max_word"
            },
            "price":{
                "type":"float"
            },
            "category":{
                "type":"keyword"
            }
        }
    }
}
  1. 插入数据
PUT item/_doc/1
{
  "id":1,
  "title":"小米手机",
  "content":"手机中的性价比之王",
  "price":1000.00,
  "category":"手机"
}
PUT item/_doc/2
{
  "id":2,
  "title":"小米电视",
  "content":"电视中的性价比之王",
  "price":1005.00,
  "category":"电视"
}
PUT item/_doc/3
{
  "id":3,
  "title":"华为电视盒子",
  "content":"电视盒直播网络机顶盒4K高清华为海思芯片机顶盒WIFI宽带电视盒子家用电视合猫播放器",
  "price":1005.00,
  "category":"电视"
}
PUT item/_doc/4
{
  "id":4,
  "title":"海信冰箱",
  "content":"食品保险冷冻首先农品",
  "price":3005.00,
  "category":"冰箱"
}
PUT item/_doc/5
{
  "id":5,
  "title":"华为手机",
  "content":"首款5g手机",
  "price":4005.00,
  "category":"手机"
}

2 布尔查询(bool)

bool把各种其它查询通过must(与)、must_not(非)、should(或)的方式进行组合

  • must:必须出现在匹配文档中,并且会影响匹配得分
  • filter:必须出现在匹配文档中,匹配得分将会被忽略(filter不会影响得分)
  • should:应该出现在匹配文档中,在布尔查询中,如果没有must或filter子句,文档必须匹配一个或者多个should子句。应该匹配的should子句的最小数量可以通过minimum_should_match参数进行设置
  • must_not:不能出现在匹配的文档中。

布尔查询采取匹配的越多越好的方式,每个匹配的子句的得分都会被加在一起,为每个文档提供最终得分(_score)

演示

比如要搜手机,价格必须在1000到20000,是否支持5g均可,品牌为华为

GET item/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "category": "手机"
          }
        },
        {
          "range": {
            "price": {
              "gte": 1000,
              "lte": 20000
            }
          }
        }
      ],
      "should": [
        {
          "match": {
            "content": "5g"
          }
        }
      ],
      "filter": {
        "term": {
          "title": "华为"
        }
      }
    }
  }
}
GET item/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "category": "手机"
          }
        },
        {
          "range": {
            "price": {
              "gte": 1000,
              "lte": 20000
            }
          }
        }
      ],
      "should": [
        {
          "match": {
            "content": "5g"
          }
        }
      ],
      "filter": {
        "term": {
          "title": "华为"
        }
      }
    }
  }
}

如果去掉filter:

GET item/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "category": "手机"
          }
        },
        {
          "range": {
            "price": {
              "gte": 1000,
              "lte": 20000
            }
          }
        }
      ],
      "should": [
        {
          "match": {
            "content": "5g"
          }
        }
      ]
    }
  }
}
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 7.009481,
    "hits" : [
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 7.009481,
        "_source" : {
          "id" : 5,
          "title" : "华为手机",
          "content" : "首款5g手机",
          "price" : 4005.0,
          "category" : "手机"
        }
      },
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.8754687,
        "_source" : {
          "id" : 1,
          "title" : "小米手机",
          "content" : "手机中的性价比之王",
          "price" : 1000.0,
          "category" : "手机"
        }
      }
    ]
  }
}

发现两次查询到华为手机的结果的评分(_score)字段的都是7.009481—>filter不会影响得分

3 最佳匹配字段

3.1 引入

PUT /my_index/_doc/1
{
    "title": "Quick brown rabbits",
    "body":  "Brown rabbits are commonly seen."
}
PUT /my_index/_doc/2
{
    "title": "Keeping pets healthy",
    "body":  "My quick brown fox eats rabbits on a regular basis."
}

让我们运行下面的bool查询:

GET my_index/_search
{
    "query": {
        "bool": {
            "should": [
                { "match": { "title": "Brown fox" }},
                { "match": { "body":  "Brown fox" }}
            ]
        }
    }
}
{
  "took" : 544,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.90425634,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.90425634,
        "_source" : {
          "title" : "Quick brown rabbits",
          "body" : "Brown rabbits are commonly seen."
        }
      },
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.77041256,
        "_source" : {
          "title" : "Keeping pets healthy",
          "body" : "My quick brown fox eats rabbits on a regular basis."
        }
      }
    ]
  }
}

从查询得分来看文档1的得分比文档2的得分要高,但是我们看来搜索Brown fox,文档2的匹配度更高一点(Brown fox文档2 的body字段完整的包含了Brown fox)

bool查询是如何计算得到其分值的:

  1. 运行should子句中的两个查询
  2. 相加查询返回的分值
  3. 将相加得到的分值乘以匹配的查询子句的数量
  4. 除以总的查询子句的数量

文档1在两个字段中都包含了brown,因此两个match查询都匹配成功并拥有了一个分值。文档2在body字段中包含了brown以及fox,但是在title字段中没有出现任何搜索的单词。因此对body字段查询得到的高分加上对title字段查询得到的零分,然后在乘以匹配的查询子句数量1,最后除以总的查询子句数量2,导致整体分值比文档1的低。

在这个例子中,title和body字段是互相竞争的。我们想要找到一个最佳匹配(Best-matching)的字段。

如果我们不是合并来自每个字段的分值,而是使用最佳匹配子句的分值作为整个查询的整体分值呢?这就会让包含有我们寻找的两个单词的字段有更高的权重,而不是在不同的字段中重复出现的相同单词。

3.2 dis_max查询

相比使用bool查询,我们可以使用dis_max查询(Disjuction Max Query)。Disjuction的意思"OR"(而Conjunction的意思是"AND"),因此Disjuction Max Query的意思就是返回匹配了任何查询的文档,并且分值是产生了最佳匹配的查询所对应的分值:

GET my_index/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Brown fox" }},
                { "match": { "body":  "Brown fox" }}
            ]
        }
    }
}
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.77041256,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.77041256,
        "_source" : {
          "title" : "Keeping pets healthy",
          "body" : "My quick brown fox eats rabbits on a regular basis."
        }
      },
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.6931472,
        "_source" : {
          "title" : "Quick brown rabbits",
          "body" : "Brown rabbits are commonly seen."
        }
      }
    ]
  }
}

3.3 tie_breaker参数

如果搜索的是"quick pets",那么会发生什么呢?两份文档都包含了单词quick,但是只有文档2包含了单词pets,文档1没有包含。两份文档都没能在一个字段中同时包含搜索的两个单词:

  • quick:文档1的title字段包含了quick。文档2的body字段包含了quick
  • pets:文档1不包含pets,文档2的title字段包含了pets

一个像下面那样的简单dis_max查询会选择出拥有最佳匹配字段的查询子句,而忽略其他的查询子句的得分:

GET my_index/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Quick pets" }},
                { "match": { "body":  "Quick pets" }}
            ]
        }
    }
}
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.6931472,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.6931472,
        "_source" : {
          "title" : "Quick brown rabbits",
          "body" : "Brown rabbits are commonly seen."
        }
      },
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.6931472,
        "_source" : {
          "title" : "Keeping pets healthy",
          "body" : "My quick brown fox eats rabbits on a regular basis."
        }
      }
    ]
  }
}

可以发现,两份文档的分值是一模一样的。

我们期望的是同时匹配了title字段和body字段的文档能够拥有更高的排名,但是结果并非如此。需要记住:dis_max查询只是简单的使用最佳匹配查询子句得到的_score。

使用tie_breaker参数将其它匹配的查询子句考虑进来

GET my_index/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Quick pets" }},
                { "match": { "body":  "Quick pets" }}
            ],
            "tie_breaker": 0.3
        }
    }
}
{
  "hits": [
     {
        "_id": "2",
        "_score": 0.14757764, 
        "_source": {
           "title": "Keeping pets healthy",
           "body": "My quick brown fox eats rabbits on a regular basis."
        }
     },
     {
        "_id": "1",
        "_score": 0.124275915, 
        "_source": {
           "title": "Quick brown rabbits",
           "body": "Brown rabbits are commonly seen."
        }
     }
   ]
}

现在文档2的分值比文档1稍高一些,就比较符合我们的期望值

tie_breaker参数会让dis_max查询的行为更像是dis_max和bool的一种折中。它会通过下面的方式改变分值计算过程:

  1. 取得最佳匹配查询子句的_score。
  2. 将其它每个匹配的子句的分值乘以tie_breaker。
  3. 将以上得到的分值进行累加并规范化。
  4. 通过tie_breaker参数,所有匹配的子句都会起作用,只不过最佳匹配子句的作用更大。

tie_breaker的取值范围是0到1之间的浮点数,取0时即为仅使用最佳匹配子句(译注:和不使用tie_breaker参数的dis_max查询效果相同),取1则会将所有匹配的子句一视同仁。它的确切值需要根据你的数据和查询进行调整,但是一个合理的值会靠近0,(比如,0.1 -0.4),来确保不会压倒dis_max查询具有的最佳匹配性质。

4 过滤(filter)

4.1 入门

条件查询中进行过滤
所有的查询都会影响到文档的评分及排名。如果我们需要在查询结果中进行过滤,并且不希望过滤条件影响评分,那么就不要把过滤条件作为查询条件来用。而是使用filter方式:
上面已经介绍过了:

GET item/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "category": "手机"
          }
        },
        {
          "range": {
            "price": {
              "gte": 1000,
              "lte": 20000
            }
          }
        }
      ],
      "should": [
        {
          "match": {
            "content": "5g"
          }
        }
      ],
      "filter": {
        "term": {
          "title": "华为"
        }
      }
    }
  }
}

4.2 constant_score

如果一次查询只有过滤,没有查询条件,不希望进行评分,我们可以使用constant_score取代只有 filter 语句的 bool 查询。在性能上是完全相同的,但对于提高查询简洁性和清晰度有很大帮助。

GET item/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "terms": {
          "title": [
            "冰箱",
            "手机"
          ]
        }
      },
      "boost": 1.2
    }
  }
}
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.2,
    "hits" : [
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.2,
        "_source" : {
          "id" : 1,
          "title" : "小米手机",
          "content" : "手机中的性价比之王",
          "price" : 1000.0,
          "category" : "手机"
        }
      },
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 1.2,
        "_source" : {
          "id" : 4,
          "title" : "海信冰箱",
          "content" : "食品保险冷冻首先农品",
          "price" : 3005.0,
          "category" : "冰箱"
        }
      },
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 1.2,
        "_source" : {
          "id" : 5,
          "title" : "华为手机",
          "content" : "首款5g手机",
          "price" : 4005.0,
          "category" : "手机"
        }
      }
    ]
  }
}

5 高亮

5.1 入门

通过highlight进行设置,查询字段高亮

GET item/_search
{
  "query": {
    "term": {
      "title": "手机"
    }
  },
  "highlight": {
    "fields": {
      "title":{}
    }
  }
}
{
  "took" : 135,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.9395274,
    "hits" : [
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.9395274,
        "_source" : {
          "id" : 1,
          "title" : "小米手机",
          "content" : "手机中的性价比之王",
          "price" : 1000.0,
          "category" : "手机"
        },
        "highlight" : {
          "title" : [
            "小米<em>手机</em>"
          ]
        }
      },
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 0.9395274,
        "_source" : {
          "id" : 5,
          "title" : "华为手机",
          "content" : "首款5g手机",
          "price" : 4005.0,
          "category" : "手机"
        },
        "highlight" : {
          "title" : [
            "华为<em>手机</em>"  # 默认使用<em>标签
          ]
        }
      }
    ]
  }
}

5.2 自定义高亮标签

es默认使用的是标签标记关键字

GET item/_search
{
  "query": {
    "term": {
      "title": "手机"
    }
  },
 "highlight": {
   "fields": {
     "title": {
        "pre_tags": ["<strong>"],
        "post_tags": ["<strong>"]
     }
   }
 }
}
{
  "took" : 11,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.9395274,
    "hits" : [
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.9395274,
        "_source" : {
          "id" : 1,
          "title" : "小米手机",
          "content" : "手机中的性价比之王",
          "price" : 1000.0,
          "category" : "手机"
        },
        "highlight" : {
          "title" : [
            "小米<strong>手机<strong>"
          ]
        }
      },
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 0.9395274,
        "_source" : {
          "id" : 5,
          "title" : "华为手机",
          "content" : "首款5g手机",
          "price" : 4005.0,
          "category" : "手机"
        },
        "highlight" : {
          "title" : [
            "华为<strong>手机<strong>"
          ]
        }
      }
    ]
  }
}

5.3 多字段高亮

比如搜索title字段的时候,也希望content的字段也会高亮,使用require_field_match,默认是true

GET item/_search
{
  "query": {
    "term": {
      "title": "手机"
    }
  },
 "highlight": {
   "require_field_match": "false", 
   "fields": {
     "title": {
        "pre_tags": ["<strong>"],
        "post_tags": ["<strong>"]
     },
     "content": {}
   }
 }
}
{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.9395274,
    "hits" : [
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.9395274,
        "_source" : {
          "id" : 1,
          "title" : "小米手机",
          "content" : "手机中的性价比之王",
          "price" : 1000.0,
          "category" : "手机"
        },
        "highlight" : {
          "title" : [
            "小米<strong>手机<strong>"
          ],
          "content" : [
            "<em>手机</em>中的性价比之王"
          ]
        }
      },
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 0.9395274,
        "_source" : {
          "id" : 5,
          "title" : "华为手机",
          "content" : "首款5g手机",
          "price" : 4005.0,
          "category" : "手机"
        },
        "highlight" : {
          "title" : [
            "华为<strong>手机<strong>"
          ],
          "content" : [
            "首款5g<em>手机</em>"
          ]
        }
      }
    ]
  }
}

5.4 高亮性能分析

es提供了三个高亮器

  • highlighter: 默认
    • highlighter实现高亮功能需要对_source保存的原始文档进行二次分析,速度最慢,优点是不需要额外的存储空间
  • postings-highlighter
    • 不需要对_source保存的原始文档进行二次分析,但是需要在字段映射中设置index_options,取值为offsets,保存关键词的偏移量
  • fast-vector-highlighter
    • 速度最快。但是需要在字段映射中设置with_positions_offsets,取值为offsets,保存关键词的未知和偏移信息,占用存储空间最大

6 排序

默认排序

es按照查询和文档的相关度进行排序的,默认按照评分降序排序:

GET item/_search
{
  "query": {
    "term": {
      "title": "手机"
    }
  },
  "sort": [
    {
      "_score": {
        "order": "desc"
      }
    }
  ]
}

对应match_all,由于只返回所有文档,不需要评分(返回都是1),就是按照添加的顺序进行排序

GET item/_search
{
  "query": {
    "match_all": {}
  }
}
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "id" : 1,
          "title" : "小米手机",
          "content" : "手机中的性价比之王",
          "price" : 1000.0,
          "category" : "手机"
        }
      },
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "id" : 2,
          "title" : "小米电视",
          "content" : "电视中的性价比之王",
          "price" : 1005.0,
          "category" : "电视"
        }
      },
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "id" : 3,
          "title" : "华为电视盒子",
          "content" : "电视盒直播网络机顶盒4K高清华为海思芯片机顶盒WIFI宽带电视盒子家用电视合猫播放器",
          "price" : 1005.0,
          "category" : "电视"
        }
      },
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 1.0,
        "_source" : {
          "id" : 4,
          "title" : "海信冰箱",
          "content" : "食品保险冷冻首先农品",
          "price" : 3005.0,
          "category" : "冰箱"
        }
      },
      {
        "_index" : "item",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 1.0,
        "_source" : {
          "id" : 5,
          "title" : "华为手机",
          "content" : "首款5g手机",
          "price" : 4005.0,
          "category" : "手机"
        }
      }
    ]
  }
}

6.2 多字段排序

比如先按照价格升序,在按照id降序

GET item/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "price": {
        "order": "asc"
      }
    },
    {
      "id": {
        "order": "desc"
      }
    }
  ]
}
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值