通常,在ES中使用查询时,默认返回的前10条结果,当我们一个查询有上万的结果时,我们如何获取全部数据??虽然我们可以通过size设置查询后返回的条数。
ES的API中提供了scan和scroll,这个方法有点类型传统数据库中的游标。
方法1:直接使用es提供的scroll
第一步:向ES服务器发送如下GET请求。{}中的内容写在请求体中。其中,scroll=1m,设定scroll 在1min内保持打开状态
GET /old_index/_search?scroll=10m
{
"query": { "match_all": {}},
"size": 1000
}
调用这个请求后,ES服务会响应一个类似以下的json:
{"_scroll_id":"c2NhbjszOzM1MTpvRkJrRHNWbFNiV2RPLVhlbWlYc1h3OzM1MDpvRkJrRHNWbFNiV2RPLVhlbWlYc1h3OzIzNzpnV2JCMkQ1RVFBdV90d3ZJOEVhOTl3OzE7dG90YWxfaGl0czo2NjUyOw==","took":3,"timed_out":false,"_shards":{"total":3,"successful":3,"failed":0},"hits":{"total":6652,"max_score":0.0,"hits":[]}}
,其中,_scroll_id在接下来的使用中非常重要,_scroll_id相当于传统数据库中的游标对象。
第二步:向服务器发送如下GET请求。将返回的_scroll_id作为参数传给服务器。第二行的内容是写在请求体中。
GET /_search/scroll?scroll=1m
c2NhbjszOzM1MTpvRkJrRHNWbFNiV2RPLVhlbWlYc1h3OzM1MDpvRkJrRHNWbFNiV2RPLVhlbWlYc1h3OzIzNzpnV2JCMkQ1RVFBdV90d3ZJOEVhOTl3OzE7dG90YWxfaGl0czo2NjUyOw==
调用这个请求后,ES服务会响应一个类似以下的json:
{"_scroll_id":"c2NhbjszOzM1MTpvRkJrRHNWbFNiV2RPLVhlbWlYc1h3OzM1MDpvRkJrRHNWbFNiV2RPLVhlbWlYc1h3OzIzNzpnV2JCMkQ1RVFBdV90d3ZJOEVhOTl3OzE7dG90YWxfaGl0czo2NjUyOw==","took":2,"timed_out":false,"_shards":{"total":3,"successful":3,"failed":0},"hits":{"total":101,"max_score":null,"hits":[{"_index":"old_index","_type":"3","_id":"AVCoH6dlYbq5kuCt6S7A","_score":1.0,"_source":{文档}},{"_index":"old_index","_type":"3","_id":"AVCoH6dlYbq5kuCt6S7A","_score":1.0,"_source":{文档}}]}}
细心的我们会发现,ES返回的_scroll_id和发送给服务器的的_scroll_id值相同。说明是同一个对象。
第三步:重复第二步,直到hits中的数据为空。此时,该查询的所有数据获取完毕
第四步:删除该_scroll_id。GET请求如下所示:
DELETE /_search/scroll
c2NhbjszOzM1MTpvRkJrRHNWbFNiV2RPLVhlbWlYc1h3OzM1MDpvRkJrRHNWbFNiV2RPLVhlbWlYc1h3OzIzNzpnV2JCMkQ1RVFBdV90d3ZJOEVhOTl3OzE7dG90YWxfaGl0czo2NjUyOw==
注意:
The response to this scroll request includes the first batch of results. Although we specified a size of 1,000, we get back many more documents. When scanning, the size is applied to each shard, so you will get back a maximum of size * number_of_primary_shards documents in each batch.
方法2:使用python提供的helpers.scan方法
scan使用代码:
-
scanResp = helpers.scan(es, _body, scroll=
"10m", index= _index, doc_type= _doc_type, timeout=
"10m")
-
-
for resp
in scanResp:
-
print resp
原文地址:https://blog.csdn.net/woshicheng1990/article/details/49452051