Elasticsearch实战系列(五)--搜索数据

最新推荐文章于 2023-11-13 22:11:31 发布

wuxiaohao1128

最新推荐文章于 2023-11-13 22:11:31 发布

阅读量198

点赞数

分类专栏： ElasticSearch 文章标签： ES搜索通过id获取文档和搜索的区别如何搜索多索引多类型搜索

本文链接：https://blog.csdn.net/wuxiaohao1128/article/details/95341150

版权

ElasticSearch 专栏收录该内容

8 篇文章 2 订阅

订阅专栏

一、前言

搜索是ElasticSearch的终极目标，通常我们使用GET请求携带搜索条件进行搜索
curl '192.168.160.128:9200/testindex/testtype/_search?q=name:zhangsan'
其中：
1. name为指定的字段，若直接q=zhangsan则表示查询所有字段，即_all字段
2. 若需要指定结果数，可以使用size。比如size=1
3. 若需要指定返回的结果字段，可以使用fields
注：在ES中搜索我们需要关注搜索的三个组成部分

二、在什么索引、类型下进行搜索

可以在特定的索引和类型下进行搜索
可以在同一个索引的多个类型下进行搜索，类型使用逗号隔开即可
可以在多个索引或所有索引中进行搜索，索引使用多个逗号隔开即可
1. 进行所有索引搜索时不指定索引即可，curl '192.168.160.128:9200/_search?q=zhangsan'
2. 当然也可以使用_all的占位符作为索引的名称

三、搜索返回的结果包含哪些信息

索引返回的结果案例

{
	"took": 10,
	"timed_out": false,
	"_shards": {
		"total": 10,
		"successful": 10,
		"skipped": 0,
		"failed": 0
	},
	"hits": {
		"total": 1,
		"max_score": 0.2876821,
		"hits": [{
			"_index": "testindex",
			"_type": "testtype",
			"_id": "1",
			"_score": 0.2876821,
			"_source": {
				"name": "zhangsan",
				"age": "25"
			}
		}]
	}
}

其中分为：时间、分片信息、命中统计数、结果文档

时间：
1. 其中took表示搜索所花费的时间，单位毫秒
2. timed_out表示是否超时，若超时这里显示true。并且结果信息返回超时前所搜集的结果
分片信息,_shards下的信息：
1. total表示索引数量
2. successful表示成功返回数据的索引数量
3. failed表示未返回数据的索引数量
命中统计数据：
1. total表示命中的文档数，要注意的是这里的total返回的是总命中的文档数，可以和时间返回的文档数量不一致
2. max_score显示这些匹配文档的最高得分，得分是该文档和给定搜索条件的相关性衡量，得分默认都是通过TF-IDF(词频-逆文档频率)算法进行计算的
结果文档：
1. _index表示该文档所属的索引
2. _type表示该文档所属的类型
3. _id和_score分别表示id和得分
4. _source展示的是所有字段，但是当在查询时指定了fields，那么这里将显示fields而不是_source

四、如何进行搜索

最开始，我们可以通过URI进行搜索，并且所有的搜索项都是放置在URL里面，当面对复合搜索时就比较困难。所以通常情况下将查询放置到组成请求的数据中。ES允许使用JSON指定所有的搜索条件。
使用JSON格式查询的意义：
1. 在一条URL中放置所有的查询条件会变得越来越难以处理
搜索是我们通常需要关注：查询的字符串选项、通过id进行搜索、过滤以及聚集查询。并且需要知晓通过id获取文档和搜索的区别
使用查询的字符串选项
1. 案例：
```
curl -H "Content-Type: application/json" '192.168.160.128:9200/testindex/testtype/_search' -d '{"query":{"query_string":{"query":"zhangsan"}}}'
```
其中：
1. 其中query_string提供了除字符串之外的更多选项
2. 上述案例是查询所有字段_all，若想指定字段，可以采用default_field进行指定
3. 要注意的是若query指定的内容(zhangsan)是包含空格的长字符串，ES中默认采用的是OR进行匹配，比如name=zhangsan or name=lisi。若需要修改为AND，需要指定default_operator为AND

curl -H "Content-Type: application/json" '192.168.160.128:9200/testindex/testtype/_search' -d '{"query":{"query_string":{"query":"zhangsan lisi","default_field":"name","default_operator":"AND"}}}'

选着合适的查询类型：
1. 上述我们采用了query_string进行匹配，其实还有很多其它的查询类型，比如term进行精确匹配，比如name需要为zhangsan

使用过滤器

过滤器和查询的区别在于过滤器不会返回得分信息，并且过滤查询更为快速，更容易缓存
注：过滤查询在ES5.0以后就被删除掉，在上述版本中使用过滤器回报如下错误

curl -H "Content-Type: application/json" '192.168.160.128:9200/testindex/testtype/_search?pretty' -d '{"query":{"filtered":{"filter":{"term":{"name":"zhangsan"}}}}}'

{
  "error" : {
    "root_cause" : [
      {
        "type" : "parsing_exception",
        "reason" : "no [query] registered for [filtered]",
        "line" : 1,
        "col" : 22
      }
    ],
    "type" : "parsing_exception",
    "reason" : "no [query] registered for [filtered]",
    "line" : 1,
    "col" : 22
  },
  "status" : 400
}

解决方案是使用bool查询

curl -H "Content-Type: application/json" '192.168.160.128:9200/testindex/testtype/_search?pretty' -d '{"query":{"bool":{"must":{"match":{"name":"zhangsan"}}}}}'

使用聚集查询

注：在ES5.X以后，聚集查询采用了新的数据结构缓存在内存中了，需要单独开启。否则就会报如下错误

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "Fielddata is disabled on text fields by default. Set fielddata=true on [name] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
      }
    ],
    "type" : "search_phase_execution_exception",
    "reason" : "all shards failed",
    "phase" : "query",
    "grouped" : true,
    "failed_shards" : [
      {
        "shard" : 0,
        "index" : "testindex",
        "node" : "M5mt6K1nTq24H0CMBFcvnA",
        "reason" : {
          "type" : "illegal_argument_exception",
          "reason" : "Fielddata is disabled on text fields by default. Set fielddata=true on [name] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
        }
      }
    ]
  },
  "status" : 400
}

curl -H "Content-Type: application/json" '192.168.160.128:9200/testindex/_mapping/testtype' -d '{"properties":{"name":{"type":"text","fielddata":true}}}'

采用上述命令开启后即可使用聚集查询

curl -H "Content-Type: application/json" '192.168.160.128:9200/testindex/testtype/_search?pretty' -d '{"aggregations":{"names":{"terms":{"field":"name"}}}}'


{
  "took" : 74,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "testindex",
        "_type" : "testtype",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "name" : "zhangsan",
          "age" : "25"
        }
      }
    ]
  },
  "aggregations" : {
    "names" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "zhangsan",
          "doc_count" : 1
        }
      ]
    }
  }
}

使用聚集来钻取可用的数据，并获取实时的统计数据

通过id获取文档
1. 案例
2. ```
curl '192.168.160.128:9200/testindex/testtype/1?pretty'
{
  "_index" : "testindex",
  "_type" : "testtype",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "name" : "zhangsan",
    "age" : "25"
  }
}
```
  其中：
  1. 回复内容包括：索引、类型、id、版本、found(表示文档是否存在)
3. 通过id搜索文档要比普通搜索要快，所消耗的资源成本也更低。
4. 通过id获取文档和搜索的区别：
  1. id获取文档是实时完成的，只要一个索引操作完成了，新的文档就可以通过GET API获取
  2. 搜索是近实时的，需要等待默认情况下每秒进行一次的刷新操作