ES学习笔记四-Query DSL

最新推荐文章于 2024-06-20 13:52:27 发布

shangmin1990

最新推荐文章于 2024-06-20 13:52:27 发布

阅读量2.6k

点赞数

分类专栏： JAVA ES

本文链接：https://blog.csdn.net/shangmin1990/article/details/43894137

版权

JAVA 同时被 2 个专栏收录

11 篇文章 0 订阅

订阅专栏

10 篇文章 0 订阅

订阅专栏

本文介绍了Elasticsearch的查询DSL和过滤DSL，强调了两者的主要区别和应用场景。重点讲解了bool查询、match查询、multi_match查询以及如何处理null值、缓存、过滤器顺序、全文搜索和分析器配置。并指出在实践中的重要注意事项，如避免在节点级别配置分析器，而应使用索引设置来管理。

摘要由CSDN通过智能技术生成

queries and filters

Although we refer to the query DSL, in reality there are two DSLs: the query DSL and the filter DSL.Query clauses and filter clauses are similar in nature, but have slightly different purposes.

filter：结果是或否，查询速度快，可以被缓存，一般用在真实值的查找上。

query：查询结果与搜索内容的相关性怎样，不能被缓存，一般用在全文检索上。

most important queries and filters

term filter

{query：{

"term":"value"

}}

terms filer

{

query:{

"terms":["a","b"]

}

range filter

{
    "range": {
        "age": {
            "gte":  20,
            "lt":   30
        }
    }
}

exists and missing filter

The exists and missing filters are used to find documents in which the specified field either has one or more values ( exists ) or doesn’t have any values ( missing ). It is similar in nature to IS_NULL ( missing ) and NOT IS_NULL ( exists )in SQL

bool filter

用于复合查询

must should must_not

{

"query":{

"bool":{

must:{

"query":{

"match":{

"text":"fadsfdasfds"

}

QUERYS:

MATCH

The match query should be the standard query that you reach for whenever you want to query for a full-text or exact value in almost any field.

If you run a match query against a full-text field, it will analyze the query string by using the correct analyzer for that field before executing the search:

{ "match": { "tweet": "About Search" }}

VIEW IN SENSE

If you use it on a field containing an exact value, such as a number, a date, a Boolean, or a not_analyzed string field, then it will search for that exact value:

{ "match": { "age":    26           }}
{ "match": { "date":   "2014-09-01" }}
{ "match": { "public": true         }}
{ "match": { "tag":    "full_text"  }}

For exact-value searches, you probably want to use a filter instead of a query, as a filter will be cached.

MULTI_MATCH

bool query

combining queries with filters

GET /_search
{
    "query": {
        "filtered": {
            "query":  { "match": { "email": "business opportunity" }},
            "filter": { "term": { "folder": "inbox" }}
        }
    }
}

just a filter

While in query context, if you need to use a filter without a query (for instance, to match all emails in the inbox), you can just omit the query:

GET /_search
{
    "query": {
        "filtered": {
            "filter":   { "term": { "folder": "inbox" }}
        }
    }
}

You seldom need to use a query as a filter, but we have included it for completeness' sake. The only time you may need it is when you need to use full-text matching while in filter context.

finding multiple exact values

GET /my_store/products/_search
{
    "query" : {
        "filtered" : {
            "filter" : {
                "terms" : { 
                    "price" : [20, 30]
                }
            }
        }
    }
}

contains, but does not equal

GET /my_index/my_type/_search
{
    "query": {
        "filtered" : {
            "filter" : {
                 "bool" : {
                    "must" : [
                        { "term" : { "tags" : "search" } }, 
                        { "term" : { "tag_count" : 1 } } 
                    ]
                }
            }
        }
    }
}

When used on date fields, the range filter supports date math operations. For example, if we want to find all documents that have a timestamp sometime in the last hour:

"range" : {
    "timestamp" : {
        "gt" : "now-1h"
    }
}

When used on date fields, the range filter supports date math operations. For example, if we want to find all documents that have a timestamp sometime in the last hour:

"range" : {
    "timestamp" : {
        "gt" : "now-1h"
    }
}

Less than January 1, 2014 plus one month

dealing with null values

GET /my_index/posts/_search
{
    "query" : {
        "filtered" : {
            "filter" : {
                "exists" : { "field" : "tags" }
            }
        }
    }
}

GET /my_index/posts/_search
{
    "query" : {
        "filtered" : {
            "filter": {
                "missing" : { "field" : "tags" }
            }
        }
    }
}

all about caching

cache 是实时的，所以不用担心缓存的有效期问题。

Leaf filters have to consult the inverted index on disk, so it makes sense to cache them. Compound filters, on the other hand, use fast bit logic to combine the bitsets resulting from their inner clauses, so it is efficient to recalculate them every time.

Certain leaf filters, however, are not cached by default, because it doesn’t make sense to do so:

某些页节点的过滤器不会被缓存，因为缓存他们并没有意义。

例如

Script filters

The results from script filters cannot be cached because the meaning of the script is opaque to Elasticsearch.

Geo-filters

The geolocation filters, which we cover in more detail in Geolocation , are usually used to filter results based on the geolocation of a specific user. Since each user has a unique geolocation, it is unlikely that geo-filters will be reused, so it makes no sense to cache them.

Date ranges

Date ranges that use the now function (for example "now-1h"), result in values accurate to the millisecond. Every time the filter is run, now returns a new time. Older filters will never be reused, so caching is disabled by default. However, when using now with rounding (for example, now/d rounds to the nearest day), caching is enabled by default. Sometimes the default caching strategy is not correct. Perhaps you have a complicated bool expression that is reused several times in the same query. Or you have a filter on a date field that will never be reused. The default caching strategy can be overridden on almost any filter by setting the _cache flag:

{
    "range" : {
        "timestamp" : {
            "gt" : "2014-01-02 16:15:14" 
        },
        "_cache": false 
    }
}

filter order

过滤条件越精确的过滤器应该排在前边。例如 a filter返回1w个结果，b filter返回10个结果，则应将b过滤器置于a之前。

Cached filters are very fast, so they should be placed before filters that are not cacheable.

被缓存的过滤器非常快，应该放在为被缓存的之前。

full-text search

Term-based queries

Queries like the term or fuzzy queries are low-level queries that have no analysis phase. They operate on a single term. A term query for the term Foo looks for that exact term in the inverted index and calculates the TF/IDF relevance _score for each document that contains the term.

It is important to remember that the term query looks in the inverted index for the exact term only; it won’t match any variants like foo or FOO. It doesn’t matter how the term came to be in the index, just that it is. If you were to index ["Foo","Bar"] into an exact value not_analyzedfield, or Foo Bar into an analyzed field with the whitespace analyzer, both would result in having the two terms Foo and Bar in the inverted index.

Full-text queries

Queries like the match or query_string queries are high-level queries that understand the mapping of a field:

If you use them to query a date or integer field, they will treat the query string as a date or integer, respectively.
If you query an exact value (not_analyzed) string field, they will treat the whole query string as a single term.
But if you query a full-text (analyzed) field, they will first pass the query string through the appropriate analyzer to produce the list of terms to be queried.

a single-word queryedit

Our first example explains what happens when we use the match query to search within a full-text field for a single word:

GET /my_index/my_type/_search
{
    "query": {
        "match": {
            "title": "QUICK!"
        }
    }
}

VIEW IN SENSE

Elasticsearch executes the preceding match query as follows:

Check the field type.

The title field is a full-text (analyzed) string field, which means that the query string should be analyzed too.
Analyze the query string.

The query string QUICK! is passed through the standard analyzer, which results in the single term quick. Because we have a just a single term, the match query can be executed as a single low-level term query.
Find matching docs.

The term query looks up quick in the inverted index and retrieves the list of documents that contain that term—in this case, documents 1, 2, and 3.
Score each doc.

The term query calculates the relevance _score for each matching document, by combining the term frequency (how often quick appears in the title field of each document), with the inverse document frequency (how often quick appears in the titlefield in all documents in the index), and the length of each field (shorter fields are considered more relevant). See What Is Relevance?.

multiword queries

GET /my_index/my_type/_search
{
    "query": {
        "match": {
            "title": {      
                "query":    "BROWN DOG!",
                "operator": "and"
            }
        }
    }
}

controlling precision

GET /my_index/my_type/_search
{
  "query": {
    "match": {
      "title": {
        "query":                "quick brown dog",
        "minimum_should_match": "75%"
      }
    }
  }
}

controlling precision

GET /my_index/my_type/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "title": "brown" }},
        { "match": { "title": "fox"   }},
        { "match": { "title": "dog"   }}
      ],
      "minimum_should_match": 2 
    }
  }
}

上边的查询语句等价于

{query:

"match":{

"title":{

"query": " brown fox dog",(operator 默认为or)

"minimum_should_match": "66%"

}

boosting query clauses

评分相关，如果某个字段完全匹配，如何让它得到更多的评分。boost

GET /_search
{
    "query": {
        "bool": {
            "must": {
                "match": {  
                    "content": {
                        "query":    "full text search",
                        "operator": "and"
                    }
                }
            },
            "should": [
                { "match": {
                    "content": {
                        "query": "Elasticsearch",
                        "boost": 3 
                    }
                }},
                { "match": {
                    "content": {
                        "query": "Lucene",
                        "boost": 2 
                    }
                }}
            ]
        }
    }
}

The boost parameter is used to increase the relative weight of a clause (with a boost greater than 1 ) or decrease the relative weight (with a boost between 0 and 1 ), but the increase or decrease is not linear. In other words, a boost of 2 does not result in double the _score .

增加某个词搜索的权重大于1就增大权重，介于0-1之前就是减小权重。注意 boost的值会影响查询结果的评分，但不是线性关系。比如boost是2 不代表得分是上个查询的两倍。

controlling analysis

GET /my_index/my_type/_validate/query?explain
{
    "query": {
        "bool": {
            "should": [
                { "match": { "title":         "Foxes"}},
                { "match": { "english_title": "Foxes"}}
            ]
        }
    }
}

validate-query API 可以检查查询语句是否正确，可以查看分词效果。

索引一篇文档如何找到合适的analyzer

analyzer的等级层次结构

he analyzer defined in the field mapping, else 在field-mapping中指定的
The analyzer defined in the _analyzer field of the document, else 在document中指定的
The default analyzer for the type, which defaults to type中指定的
The analyzer named default in the index settings, which defaults to index中指定的
The analyzer named default at node level, which defaults to 节点中的默认配置为standard 分词器
The standard analyzer

At search time, the sequence is slightly different: 在搜索的时候，顺序有点不同

The analyzer defined in the query itself, else 查询语句本身定义的analyzer
The analyzer defined in the field mapping, else field-mapping中定义的analyzer
The default analyzer for the type, which defaults to type中定义的
The analyzer named default in the index settings, which defaults to index中定义的
The analyzer named default at node level, which defaults to 节点默认配置为standard分词器
The standard analyzer

configuring analyzers in practice

use index settings, not config filesedit

The first thing to remember is that, even though you may start out using Elasticsearch for a single purpose or a single application such as logging, chances are that you will find more use cases and end up running several distinct applications on the same cluster. Each index needs to be independent and independently configurable. You don’t want to set defaults for one use case, only to have to override them for another use case later.

This rules out configuring analyzers at the node level. Additionally, configuring analyzers at the node level requires changing the config file on every node and restarting every node, which becomes a maintenance nightmare. It’s a much better idea to keep Elasticsearch running and to manage settings only via the API.

用indexsetting 而不要去更改es的配置文件。如果启动多个node，需要更改es默认配置，不太方便。推荐使用index级别的analyzer.

relevance is broken!

However, for performance reasons, Elasticsearch doesn’t calculate the IDF across all documents in the index. Instead, each shard calculates a local IDF for the documents contained in that shard .

每个分片单独计算查询结果的评分，Because our documents are well distributed, the IDF for both shards will be the same. Now imagine instead that five of the foo documents are on shard 1, and the sixth document is on shard 2. In this scenario, the term foo is very common on one shard (and so of little importance), but rare on the other shard (and so much more important). These differences in IDF can produce incorrect results.

好吧。我直接说结论，结论就是你的数据不够多。如果你具有了非常的多的数据，每个shard可以代表整个index的文档分布情况，(离散数学，概率论？)保证你的es中有足够多的数据就可以了。

shangmin1990

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
ES学习笔记四-Query DSL

queries and filtersAlthough we refer to the query DSL, in reality there are two DSLs: the query DSL and the filter DSL.Query clauses and filter clauses are similar in nature, but have slightly d
复制链接

扫一扫

专栏目录