触类旁通Elasticsearch：搜索

最新推荐文章于 2023-04-17 16:49:22 发布

wzy0623

最新推荐文章于 2023-04-17 16:49:22 发布

阅读量947

点赞数 1

分类专栏： NoSQL 触类旁通Elasticsearch

本文链接：https://blog.csdn.net/wzy0623/article/details/86679996

版权

NoSQL 同时被 2 个专栏收录

35 篇文章 2 订阅

订阅专栏

触类旁通Elasticsearch

12 篇文章 6 订阅

订阅专栏

《Elasticsearch In Action》学习笔记。

ES的搜索请求执行流程如图1所示。图中索引包含两个分片，每个分片有一个副本分片。在给文档定位和评分后，缺省只会获取排名前10的文档。REST API搜索请求被发送到所连接的节点，该节点根据要查询的索引，将这个请求依次发送到所有的相关分片（主分片或者副本分片）。从所有分片收集到足够的排序和排名信息后，只有包含所需文档的分片被要求返回相关内容。这种搜索路由的行为是可配置的，图1展示的默认行为，称为查询后获取（query_then_fetch）。

一、搜索请求的结构

ES的搜索是基于JSON文档或者是基于URL的请求。

1. 确定搜索范围

所有的REST搜索请求使用_search的REST端点，既可以是GET请求，也可以是POST请求。既可以搜索整个集群，也可以通过在搜索URL中指定索引或类型的名称来限制范围：

# 无条件搜索整个集群
curl '172.16.1.127:9200/_search?pretty'
curl '172.16.1.127:9200/_all/_search?pretty'
curl '172.16.1.127:9200/*/_search?pretty'

# 无条件搜索get-together索引，类似于SQL中的select * from get-together;
curl '172.16.1.127:9200/get-together/_search?pretty'

# 在ES6中已经废弃了type的概念，所以功能同上
curl '172.16.1.127:9200/get-together/_doc/_search?pretty'

# 无条件搜索get-together、dbinfo两个索引
curl '172.16.1.127:9200/get-together,dbinfo/_doc/_search?pretty'

# 模糊匹配索引名称，包含get-toge开头的索引，但不包括get-together
curl '172.16.1.127:9200/+get-toge*,-get-together/_search?pretty'

和DB类似，为了获得更好的性能，尽可能地将查询限制在最小数量索引。每个搜索请求必须发送到所有索引分片（类似于DB中的全索引扫描），发送到越多的索引，就会涉及越多的分片。

2. 搜索请求的基本模块

类比SQL查询语句：

select ...
  from ...
 where ...
 order by ...
 limit ...

        where <-> query
   select ... <-> _source 
  size + from <-> limit
     order by <-> sort

搜索请求的基本模块如下：

query：配置查询和过滤器DSL，限制搜索的条件，类似于SQL查询中的where子句。
size：返回文档的数量，类似于SQL查询中的limit子句中的数量。
from：和size一起使用，from用于分页操作，类似于SQL查询中的limit子句中的偏移量。如果结果集合不断增加，获取某些靠后的翻页将会成为代价高昂的操作。（SQL中延迟关联的思想应该也可用于ES，先搜索出某一页的ID，再通过ID查询字段。）
_source：指定_source字段如何返回，默认返回完整的_source字段，类似于SQL中的select *。通过配置_source，将过滤返回的字段。
sort：类似于SQL中的order by子句，用于排序，默认的排序是基于文档的得分。

下面看一些简单的例子。
（1）返回第2页的10个结果

# ES的from从0开始
curl '172.16.1.127:9200/get-together/_search?from=10&size=10&pretty'

（2）按日期升序排列，返回前10项结果

curl '172.16.1.127:9200/get-together/_search?sort=date:asc&pretty'

（3）按日期升序排列，返回前10项结果中title、date的两个字段

curl '172.16.1.127:9200/get-together/_search?sort=date:asc&_source=title,date&pretty'

（4）请求匹配了所有标题中含有“elasticsearch”的文档（按小写比较），按日期升序返回

curl '172.16.1.127:9200/get-together/_search?sort=date:asc&q=title:elasticsearch&pretty'

3. 基于请求主体的搜索请求

前面的搜索请求都是基于URL的。当执行更多高级搜索的时候，采用基于请求主体的搜索会拥有更多的灵活性和选择性。ES推荐使用基于请求主体的搜索请求。

（1）返回第2页的10个结果

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "query": {
    "match_all": {}
  },
  "from": 10,
  "size": 10
}'

（2）返回指定字段

# 只返回name和date字段
curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "query": {
    "match_all": {}
  },
  "_source": [
    "name",
    "date"
  ]
}'

（3）_source中使用通配符返回字段

# 返回location开头的字段和日期字段，但不返回location.geolocation字段
curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "query": {
    "match_all": {}
  },
  "_source": {
    "include": [
      "location.*",
      "date"
    ],
    "exclude": [
      "location.geolocation"
    ]
  }
}'

（4）结果排序

# 类似于SQL中的order by created_on asc, name desc, _score
curl -XPOST "172.16.1.127:9200/get-together/_mapping/_doc?pretty" -H 'Content-Type: application/json' -d'
{
  "properties": {
    "name": {
      "type": "text",
      "fielddata": "true"
    }
  }
}'

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "created_on": "asc"
    },
    {
      "name": "desc"
    },
    "_score"
  ]
}'

（5）综合搜索基础模块

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "query": {
    "match_all": {}
  },
  "from": 0,
  "size": 10,
  "_source": [
    "name",
    "organizer",
    "description"
  ],
  "sort": [
    {
      "created_on": "desc"
    }
  ]
}'

类似于如下SQL查询：

select name, organizer, description
  from get-together
 order by created_on desc
 limit 0, 10;

注意，如果在返回结果中某些字段的值为null，缺省在ES返回的_source中根本就不会出现该字段名称，这点与SQL是不同的。

4. 回复的结构

下面看一下ES搜索返回的数据结构。

curl '172.16.1.127:9200/_search?q=title:elasticsearch&_source=title,date&pretty'

结果返回：

{
  "took" : 13,                                       # 查询执行所用的毫秒数
  "timed_out" : false,                               # 是否超时
  "_shards" : {
    "total" : 28,                                    # 搜索的分片数
    "successful" : 28,                               # 成功的分片数
    "skipped" : 0,                                   # 跳过的分片数
    "failed" : 0                                     # 失败的分片数
  },
  "hits" : {
    "total" : 7,                                     # 匹配的文档数
    "max_score" : 1.0128567,                         # 最高文档得分
    "hits" : [                                       # 命中文档的数组
      {
        "_index" : "get-together",                   # 文档所属索引
        "_type" : "_doc",                            # 文档所属类型
        "_id" : "103",                               # 文档ID
        "_score" : 1.0128567,                        # 相关性得分
        "_routing" : "2",                            # 文档所属的分片号
        "_source" : {                                # 请求的_source字段
          "date" : "2013-04-17T19:00",
          "title" : "Introduction to Elasticsearch"
        }
      },
      {
        "_index" : "get-together",
        "_type" : "_doc",
        "_id" : "105",
        "_score" : 1.0128567,
        "_routing" : "2",
        "_source" : {
          "date" : "2013-07-17T18:30",
          "title" : "Elasticsearch and Logstash"
        }
      },
      ...
    ]
  }
}

如果没有存储文档的_source或者是fields，那么将无法从ES中获取数值！

二、查询和过滤器

查询和过滤器功能上类似于SQL查询中的where子句，都是起到按查询条件筛选文档的作用，但它们在评分就机制和搜索行为的性能上有所不同。不像查询会为特定的词条计算得分，搜索的过滤器只是为“文档是否匹配这个查询”，返回简的“是”或“否”的答案。图2展示了查询和过滤器之间的主要差别。

由于这个差异，过滤器可以比普通的查询更快，而且还可以被缓存。

1. match

（1）match_all
匹配所有文档，类似于SQL中的无where条件查询。

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "query": {
    "match_all": {}
  }
}'

在ES6中，match_all查询返回文档的_score都为1.0。

（2）match
匹配字段条件，类似于SQL中的where column='xxx'。下面的查询搜索标题中有“hadoop”字样的文档：

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "query": {
    "match": {
      "title": "hadoop"
    }
  }
}'

match查询不区分大小写。在进行匹配时，词条和输入的文本都被转换成小写进行比较。match查询返回文档的_score相关性得分。

默认情况下，match查询使用OR操作符。例如，如果搜索文本“Elasticsearch Denver”，ES会搜索“Elasticsearch OR Denver”，同时匹配“Elasticsearch Amsterdam”和“Denver Clojure”。下面的查询搜索同时包含“Elasticsearch”和“Denver”关键词的结果：

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "query": {
    "match": {
      "name": {
        "query": "Elasticsearch Denver",
        "operator": "and"
      }
    }
  }
}'

（3）match_phrase
下面的查询搜索name字段中包含“enterprise london”短语，并且“enterprise”和“london”之间允许包含一个单词的文档：

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "query": {
    "match_phrase": {
      "name": {
        "query": "enterprise london",
        "slop": 1
      }
    }
  },
  "_source": [
    "name",
    "description"
  ]
}'

（4）phrase_prefix
下面的例子中，phrase_prefix使用的是“Elasticsearch den”，ES使用“den”文本进行前缀匹配，查找所有name字段，发现那些以“den”开始的取值。max_expansions设置最大前缀扩展数量。由于产生的结果可能是个很大的集合，需要限制扩展的数量。

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "query": {
    "match_phrase_prefix": {
      "name": {
        "query": "Elasticsearch den",
        "max_expansions": 1
      }
    }
  },
  "_source": [
    "name"
  ]
}'

（5）multi_match
可以在多个字段中匹配多个词条，类似于SQL中的where name like '%elasticsearch%' or name like '%hadoop%' or 'description' like '%elasticsearch%' or 'description' like '%hadoop%'：

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "query": {
    "multi_match": {
      "query": "elasticsearch hadoop",
      "fields": [
        "name",
        "description"
      ]
    }
  }
}'

就像match查询可以转化为phrase查询或者phrase_prefix查询，multi_match查询可以转化为phrase查询或者phrase_prefix查询，方法是指定type键。除了可以指定搜索字段是多个而不是单独一个之外，可以将multi_match查询当做match查询使用。

2. term

term查询和过滤器可以指定需要搜索的文档字段和词条。注意，term搜索的词条是没有经过分析的，文档中的词条必须要精确匹配才能作为结果返回。

（1）term查询

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "query": {
    "term": {
      "tags": "elasticsearch"
    }
  },
  "_source": [
    "name",
    "tags"
  ]
}'

（2）term过滤器
和term查询相似，可以使用term过滤器来限制结果文档，使其包含特定的词条，不过无须计算得分。

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "query": {
    "bool": {
      "filter": {
        "term": {
          "tags": "elasticsearch"
        }
      }
    }
  }
}'

（3）terms查询
和term查询类似，terms查询可以搜索某个文档字段中的多个词条。例如下面的查询搜索标签含有“jvm”或“hadoop”的文档。

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "query": {
    "terms": {
      "tags": [
        "jvm",
        "hadoop"
      ]
    }
  },
  "_source": [
    "name",
    "tags"
  ]
}'

对于和查询匹配的文档，可以强制规定每篇文档中匹配词条的最小数量，为了实现这一点需要指定minimum_should_match参数。

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "query": {
    "bool": {
      "minimum_should_match": 2,
      "must": {
        "terms": {
          "tags": [
            "jvm",
            "hadoop",
            "lucene"
          ]
        }
      }
    }
  }
}'

3. query_string

下面的查询搜索包含“nosql”的文档。两个查询等价，前者使用URL执行，后者使用请求主体发送：

curl -XGET '172.16.1.127:9200/get-together/_search?q=nosql&pretty'
curl -XPOST '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "query": {
    "query_string": {
      "query": "nosql"
    }
  }
}'

默认情况下，query_string查询将会搜索_all字段，该字段是由所有字段组合而成。可以通过default_field设置字段：

curl -XPOST '172.16.1.127:9200/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "query": {
    "query_string": {
      "default_field": "description",
      "query": "nosql"
    }
  }
}'

也可以在多个字段上执行查询，此时应使用fields：

curl -XPOST '172.16.1.127:9200/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "query": {
    "query_string": {
      "fields": ["description", "tags"],
      "query": "nosql"
    }
  }
}'

下面的查询搜索所有名称中含有“nosql”的文档，但是排除了那些描述中有“mongodb”的结果：

curl -XPOST '172.16.1.127:9200/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "query": {
    "query_string": {
      "query": "name:nosql AND -description:mongodb"
    }
  }
}'

可以使用如下命令查询所有于1999年到2001年期间创建的标签为搜索或lucene的文档：

curl -XPOST '172.16.1.127:9200/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "query": {
    "query_string": {
      "query": "(tags:search OR tags:lucene) AND (created_on:[1999-01-01 TO 2001-01-01])"
    }
  }
}'

针对query_string查询，建议的替换方案包括term、terms、match或multi_match查询。

三、复合查询

1. bool查询

bool查询允许在单独的查询中组合任意数量的查询，指定的查询子句表明哪些部分是必须（must）匹配、应该（should）匹配或者是不能（must_not）匹配上ES索引里的数据。

下面的例子查询attendees字段中必须包含“david”，也应该包含“clint”和“andy”，并且date必须大于等于'2013-06-30'。minimum_should_match表示最小的should子句匹配数，满足这个数量的文档才能作为结果返回。minimum_should_match的默认值有一些隐藏的特性。如果指定了must子句，minimum_should_match的默认值为0。如果没有指定must子句，默认值为1。

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "attendees": "david"
          }
        }
      ],
      "should": [
        {
          "term": {
            "attendees": "clint"
          }
        },
        {
          "term": {
            "attendees": "andy"
          }
        }
      ],
      "must_not": [
        {
          "range": {
            "date": {
              "lt": "2013-06-30T00:00"
            }
          }
        }
      ],
      "minimum_should_match": 1
    }
  }
}'

可以使用下面的语句改写这个查询，它在逻辑上与上个查询等价，但只包含must一个bool查询选项，更短小。
curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '
{
"query": {
"bool": {
"must": [
{
"term": {
"attendees": "david"
}
},
{
"range": {
"date": {
"gte": "2013-06-30T00:00"
}
}
},
{
"terms": {
"attendees": [
"clint",
"andy"
]
}
}
]
}
}
}'

2. bool过滤器

bool过滤器和bool查询的表现基本一致。只是它组合的是过滤器。bool过滤器不支持minimum_should_match属性，而是使用了默认值1。

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "query": {
    "bool": {
      "filter": {
        "bool": {
          "must": [
            {
              "term": {
                "attendees": "david"
              }
            }
          ],
          "should": [
            {
              "term": {
                "attendees": "clint"
              }
            },
            {
              "term": {
                "attendees": "andy"
              }
            }
          ],
          "must_not": [
            {
              "range": {
                "date": {
                  "lt": "2013-06-30T00:00"
                }
              }
            }
          ]
        }
      }
    }
  }
}'

四、其它查询和过滤器

1. range查询和过滤器

（1）查询

# where created_on > 2012-06-01 and created_on < 2012-09-01
curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "query": {
    "range": {
      "created_on": {
        "gt": "2012-06-01",
        "lt": "2012-09-01"
      }
    }
  }
}'

（2）过滤器

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "query": {
    "bool": {
      "filter": {
        "range": {
          "created_on": {
            "gt": "2012-06-01",
            "lt": "2012-09-01"
          }
        }
      }
    }
  }
}'

range查询支持字符串范围，如果想搜索name在“c”和“e”之间的文档，可以使用下面的搜索：

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "query": {
    "range": {
      "name": {
        "gt": "c",
        "lt": "e"
      }
    }
  }
}'

使用range查询时，应仔细考虑一下过滤器是否为更好的选择。由于在查询范围之中的文档是二元匹配（“是的，文档在范围之中”或者“不是，文档不在范围之中”），range查询不必是查询。为了获得更好的性能，它应该是过滤器。如果不确定是查询还是过滤器，请使用过滤器。在99%的用例中，使用range过滤器是正确的选择。

2. prefix查询和过滤器

prefix查询和过滤器允许根据给定的前缀来搜索词条。这里前缀在搜索之前是没有经过分析的。例如，为了在索引中搜索title为“liber”开头的全部文档，使用下面的查询：

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "query": {
    "prefix": {
      "title": "liber"
    }
  }
}'

类似地也可以使用过滤器：

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "query": {
    "bool": {
      "filter": {
        "prefix": {
          "title": "liber"
        }
      }
    }
  }
}'

由于前缀搜索没有经过分析，前缀查询或过滤器是大小写敏感的。

3. wildcard查询

# 创建索引，添加两个文档
curl -XPOST '172.16.1.127:9200/wildcard-test/_doc/1?pretty' -H 'Content-Type: application/json' -d '
{
  "title":"The Best Bacon Ever"
}'

curl -XPOST '172.16.1.127:9200/wildcard-test/_doc/2?pretty' -H 'Content-Type: application/json' -d '
{
  "title":"How to raise a barn"
}'

# “ba*n”会匹配bacon和barn
curl '172.16.1.127:9200/wildcard-test/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": {
    "wildcard": {
      "title": {
        "wildcard": "ba*n"
      }
    }
  }
}'

# “ba?n”只会匹配barn，不会匹配bacon
curl '172.16.1.127:9200/wildcard-test/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": {
    "wildcard": {
      "title": {
        "wildcard": "ba?n"
      }
    }
  }
}'

使用这种查询时，需要注意的是wildcard查询不像match等其它查询那样轻量级。查询词条中越早出现通配符（*或?），ES就需要做更多的工作来进行匹配。

4. exists过滤器

exists过滤器允许过滤文档，只查找那些在特定字段有值的文档：

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "filter": {
        "exists": {
          "field": "location_event.geolocation"
        }
      }
    }
  }
}'

5. missing过滤器

missing过滤器可以搜索字段里没有值，或者是映射时指定了默认值的文档（也叫做null值，即映射里null_value）。为了搜索缺失reviews字段的文档，可以使用下面的过滤器：

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must_not": {
        "exists": {
          "field": "reviews"
        }
      }
    }
  }
}'

6. 将任何查询转变为过滤器

ES允许通过query过滤器将任何查询转化为过滤器。例如，有个query_string查询搜索匹配“Elasticsearch”的名字，可以使用如下搜索将其转变为过滤器：

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "filter": {
        "query_string": {
          "query": "name:\"Elasticsearch\""
        }
      }
    }
  }
}'

五、为任务选择最好的查询

表1为ES的常用案例中使用哪些查询的指南。

用例	使用的查询类型
想从类似Google的界面接受用户的输入，然后根据这些输入搜索文档	如果想支持+/-或者在特定字段中搜索，就是用simple_query_string查询
想将输入作为词组并搜索包含这个词组的文档，词组中的单词也许包含一些间隔（slop）	要查找和用户搜索相似的词组，使用match_phrase查询，并设置一定量的slop
想在not_analyzed字段中搜索单个关键字，并完全清楚这个词应该是如何出现的	使用term查询，因为查询的词条不会被分析
希望组合许多不同的搜索请求或者不同类型的搜索，创建一个单独的搜索来处理它们	使用bool查询，将任意数量的子查询组合到一个单独的查询
希望在某个文档中的多个字段搜索特定的单词	使用multi_match查询，它和match查询的表现类似，不过是在多个字段上搜索
希望通过一次搜索返回所有的文档	使用match_all查询，在一次搜索中返回全部文档
希望在字段中搜索一定取值范围内的值	使用range查询，搜索取值在一定范围内的文档
希望在字段中搜索特定字符串开头的取值	使用prefix查询，搜索以给定字符串开头的词条
希望根据用户已经输入的内容，提供单个关键词的自动完成功能	使用prefix查询，发送用户已经输入的内容，然后获取以此文本开头的匹配项
希望搜索特定字段没有取值的所有文档	使用missing过滤器过滤出缺失某些字段的文档