DSL语言基本语法

最新推荐文章于 2024-08-26 22:25:12 发布

liupenglove

最新推荐文章于 2024-08-26 22:25:12 发布

阅读量951

点赞数 23

文章标签： elasticsearch 搜索引擎 lucene 自动驾驶数据仓库

本文链接：https://blog.csdn.net/liupenglove/article/details/137205228

版权

查询语句

要使用这种查询表达式，只需将查询语句传递给 query 参数

GET /_search { "query": YOUR_QUERY_HERE }

空查询语句：

GET /_search { "query": { "match_all": {} } }

你可以使用 match 查询语句来查询 tweet 字段中包含elasticsearch 的文档

GET /_search 
{
  "query": {
    "match": {
      "tweet": "elasticsearch"
    }
  }
}

bool查询

bool 查询是一种非常强大的查询，允许你组合多个查询条件以满足复杂的搜索需求。bool 查询包括四个主要部分：

must，should，must_not 和 filter，例如：

{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "field1": "value1"
          }
        },
        {
          "match": {
            "field2": "value2"
          }
        }
      ],
      "should": [
        {
          "match": {
            "field3": "value3"
          }
        }
      ],
      "must_not": [
        {
          "match": {
            "field4": "value4"
          }
        }
      ],
      "filter": [
        {
          "term": {
            "field5": "value5"
          }
        }
      ]
    }
  }
}

must查询

文档必须匹配这些条件才能被包含进来。

must_not

文档必须不匹配这些条件才能被包含进来。

should

如果满足这些语句中的任意语句，将增加 _score ，否则，无任何影响。它们主要用于修正每个文档的相关性得分。

filter

必须匹配，但它以不评分、过滤模式来进行。这些语句对评分没有贡献，只是根据过滤标准来排除或包含文档。

由于这是我们看到的第一个包含多个查询的查询，所以有必要讨论一下相关性得分是如何组合的。每一个子查询都独自地计算文档的相关性得分。一旦他们的得分被计算出来， bool 查询就将这些得分进行合并并且返回一个代表整个布尔操作的得分。

下面的查询用于查找 title 字段匹配 how to make millions 并且不被标识为 spam 的文档。那些被标识为 starred 或在2014之后的文档，将比另外那些文档拥有更高的排名。如果两者都满足，那么它排名将更高：

{
  "bool": {
    "must": {
      "match": {
        "title": "how to make millions"
      }
    },
    "must_not": {
      "match": {
        "tag": "spam"
      }
    },
    "should": [
      {
        "match": {
          "tag": "starred"
        }
      },
      {
        "range": {
          "date": {
            "gte": "2014-01-01"
          }
        }
      }
    ]
  }
}

增加带过滤器（filtering）的查询

如果我们不想因为文档的时间而影响得分，可以用 filter 语句来重写前面的例子：

{
  "bool": {
    "must": {
      "match": {
        "title": "how to make millions"
      }
    },
    "must_not": {
      "match": {
        "tag": "spam"
      }
    },
    "should": [
      {
        "match": {
          "tag": "starred"
        }
      }
    ],
    "filter": {
      "range": {
        "date": {
          "gte": "2014-01-01"
        }
      }
    }
  }
}

通过将 range 查询移到 filter 语句中，我们将它转成不评分的查询，将不再影响文档的相关性排名。由于它现在是一个不评分的查询，可以使用各种对 filter 查询有效的优化手段来提升性能。

所有查询都可以借鉴这种方式。将查询移到 bool 查询的 filter 语句中，这样它就自动的转成一个不评分的 filter 了。

如果你需要通过多个不同的标准来过滤你的文档，bool 查询本身也可以被用做不评分的查询。简单地将它放置到 filter 语句中并在内部构建布尔逻辑：

{
  "bool": {
    "must": {
      "match": {
        "title": "how to make millions"
      }
    },
    "must_not": {
      "match": {
        "tag": "spam"
      }
    },
    "should": [
      {
        "match": {
          "tag": "starred"
        }
      }
    ],
    "filter": {
      "bool": {
        "must": [
          {
            "range": {
              "date": {
                "gte": "2014-01-01"
              }
            }
          },
          {
            "range": {
              "price": {
                "lte": 29.99
              }
            }
          }
        ],
        "must_not": [
          {
            "term": {
              "category": "ebooks"
            }
          }
        ]
      }
    }
  }
}

合并查询

查询语句(Query clauses) 就像一些简单的组合块，这些组合块可以彼此之间合并组成更复杂的查询

叶子语句（Leaf clauses） (就像 match 语句) 被用于将查询字符串和一个字段（或者多个字段）对比。
复合(Compound) 语句主要用于合并其它查询语句。比如，一个 bool 语句允许在你需要的时候组合其它语句，无论是 must 匹配、 must_not 匹配还是 should 匹配，同时它可以包含不评分的过滤器（filters）：

{
  "bool": {
    "must": {
      "match": {
        "tweet": "elasticsearch"
      }
    },
    "must_not": {
      "match": {
        "name": "mary"
      }
    },
    "should": {
      "match": {
        "tweet": "full text"
      }
    },
    "filter": {
      "range": {
        "age": {
          "gt": 30
        }
      }
    }
  }
}

match_all 查询

匹配所有文档。在没有指定查询方式时，它是默认的查询：

{ "match_all": {}}

match 查询

无论你在任何字段上进行的是全文搜索还是精确查询，match 查询是你可用的标准查询。

如果你在一个全文字段上使用 match 查询，在执行查询前，它将用正确的分析器去分析查询字符串：

{ "match": { "tweet": "About Search" }}

multi_match 查询

multi_match 查询可以在多个字段上执行相同的 match 查询：

{
  "multi_match": {
    "query": "full text search",
    "fields": [
      "title",
      "body"
    ]
  }
}

range 查询

range 查询找出那些落在指定区间内的数字或者时间：

{ "range": { "age": { "gte": 20, "lt": 30 } } }

被允许的操作符如下：

gt ：大于

gte：大于等于

lt：小于

lte：小于等于

term 查询

term 查询被用于精确值匹配，这些精确值可能是数字、时间、布尔或者那些 not_analyzed 的字符串：

{
  "term": {
    "age": 26
  }
}{
  "term": {
    "date": "2014-09-01"
  }
}{
  "term": {
    "public": true
  }
}{
  "term": {
    "tag": "full_text"
  }
}

term 查询对于输入的文本不分析，所以它以给定的值进行精确查询。

terms 查询

terms 查询和 term 查询一样，但它允许你指定多值进行匹配。如果这个字段包含了指定值中的任何一个值，那么这个文档满足条件：

{ "terms": { "tag": [ "search", "full_text", "nosql" ] }}

和 term 查询一样，terms 查询对于输入的文本不分析。它查询那些精确匹配的值（包括在大小写、重音、空格等方面的差异）。

exists 查询和 missing 查询

exists 查询和 missing 查询被用于查找那些指定字段中有值 (exists) 或无值 (missing) 的文档。这与SQL中的 IS_NULL (

missing) 和 NOT IS_NULL (exists) 在本质上具有共性：

{ "exists": { "field": "title" } }

constant_score 查询

可以使用它来取代只有 filter 语句的 bool 查询。在性能上是完全相同的，但对于提高查询简洁性和清晰度有很大帮助。

{ "constant_score": { "filter": { "term": { "category": "ebooks" } } } }

验证查询

validate/query API 可以用来验证查询语句是否合法。

GET /gb/tweet/_validate/query 
{
  "query": {
    "tweet": {
      "match": "really powerful"
    }
  }
}

{ "valid" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 } }

以上 validate 请求的应答告诉我们这个查询是不合法的，为了找出查询不合法的原因，可以将 explain 参数加到查询字符串中：

GET /gb/tweet/_validate/query?explain 
{
  "query": {
    "tweet": {
      "match": "really powerful"
    }
  }
}

{
  "valid": false,
  "_shards": {
    ...
  },
  "explanations": [
    {
      "index": "gb",
      "valid": false,
      "error": "org.elasticsearch.index.query.QueryParsingException: [gb] No query registered for [tweet]"
    }
  ]
}

很明显，我们将查询类型(match)与字段名称 (tweet)搞混了。

理解查询语句

对于合法查询，使用 explain 参数将返回可读的描述，这对准确理解 Elasticsearch 是如何解析你的 query 是非常有用的：

GET /_validate/query?explain { "query": { "match" : { "tweet" : "really powerful" } } }

我们查询的每一个 index 都会返回对应的 explanation ，因为每一个 index 都有自己的映射和分析器：

{
  "valid": true,
  "_shards": {
    ...
  },
  "explanations": [
    {
      "index": "us",
      "valid": true,
      "explanation": "tweet:really tweet:powerful"
    },
    {
      "index": "gb",
      "valid": true,
      "explanation": "tweet:realli tweet:power"
    }
  ]
}

从 explanation 中可以看出，匹配 really powerful 的 match 查询被重写为两个针对 tweet 字段的 single-term 查询，一个single-term查询对应查询字符串分出来的一个term。

当然，对于索引 us ，这两个 term 分别是 really 和 powerful ，而对于索引 gb ，term 则分别是 realli 和 power 。之所以出现这个情况，是由于我们将索引 gb 中 tweet 字段的分析器修改为 english 分析器。

排序

在 Elasticsearch 中，相关性得分由一个浮点数进行表示，并在搜索结果中通过 _score 参数返回，默认排序是 _score 降序。

GET /_search { "query" : { "bool" : { "filter" : { "term" : { "user_id" : 1 } } } } }

这里没有一个有意义的分数：因为我们使用的是 filter （过滤），这表明我们只希望获取匹配

user_id: 1 的文档，并没有试图确定这些文档的相关性。实际上文档将按照随机顺序返回，并且每个文档都会评为零分。

如果评分为零对你造成了困扰，你可以使用 constant_score 查询进行替代：

GET /_search 
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "user_id": 1
        }
      }
    }
  }
}

这将让所有文档应用一个恒定分数（默认为 1 ）。它将执行与前述查询相同的查询，并且所有的文档将像之前一样随机返回，这些文档只是有了一个分数而不是零分。

按照字段的值排序

在这个案例中，通过时间来对 tweets 进行排序，最新的 tweets 排在最前。我们可以使用 sort 参数进行实现：

GET /_search 
{
  "query": {
    "bool": {
      "filter": {
        "term": {
          "user_id": 1
        }
      }
    }
  },
  "sort": {
    "date": {
      "order": "desc"
    }
  }
}

你会注意到结果中的两个不同点：

"hits": {
  "total": 6,
  "max_score": null,
  "hits": [
    {
      "_index": "us",
      "_type": "tweet",
      "_id": "14",
      "_score": null,
      "_source": {
        "date": "2014-09-24"
      },
      "sort": [
        1411516800000
      ]
    }
  }

_score 不被计算, 因为它并没有用于排序。

首先我们在每个结果中有一个新的名为 sort 的元素，它包含了我们用于排序的值。在这个案例中，我们按照 date 进行排序，在内部被索引为自时间戳毫秒数。 long 类型数 1411516800000 等价于日期字符串 2014-09-24 00:00:00 UTC 。

多级排序

假定我们想要结合使用 date 和 _score 进行查询，并且匹配的结果首先按照日期排序，然后按照相关性排序：

GET /_search 
{
  "query": {
    "bool": {
      "must": {
        "match": {
          "tweet": "manage text search"
        }
      },
      "filter": {
        "term": {
          "user_id": 2
        }
      }
    }
  },
  "sort": [
    {
      "date": {
        "order": "desc"
      }
    },
    {
      "_score": {
        "order": "desc"
      }
    }
  ]
}

排序条件的顺序是很重要的。结果首先按第一个条件排序，仅当结果集的第一个sort 值完全相同时才会按照第二个条件进行排序，以此类推。

多级排序并不一定包含 _score 。你可以根据一些不同的字段进行排序，如地理距离或是脚本计算的特定值。

多值字段的排序

一种情形是字段有多个值的排序，需要记住这些值并没有固有的顺序；一个多值的字段仅仅是多个值的包装，这时应该选择哪个进行排序呢？

对于数字或日期，你可以将多值字段减为单值，这可以通过使用 min 、 max 、 avg 或是 sum 排序模式。例如你可以按照每个 date 字段中的最早日期进行排序，通过以下方法：

"sort": { "dates": { "order": "asc", "mode": "min" } }

liupenglove

关注

23
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫