前言
search 我们经常使用,默认一次返回10条数据,并且可以通过 from 和 size 参数修改返回条数并执行分页操作。但是有时需要返回大量数据,就必须通过scan和scroll实现。两者一起使用来从Elasticsearch里高效地取回巨大数量的结果而不需要付出深分页的代价。
详情参考:https://es.xiaoleilu.com/060_Distributed_Search/20_Scan_and_scroll.html
与上文链接不同的是,本文是关于python实现的介绍和描述。
数据说明
索引hz中一共29999条数据,且内容如下。批量导入数据代码可见:
http://blog.csdn.net/xsdxs/article/details/72849796
代码示例
ES客户端代码:
# -*- coding: utf-8 -*-
import elasticsearch
ES_SERVERS = [{
'host': 'localhost',
'port': 9200
}]
es_client = elasticsearch.Elasticsearch(
hosts=ES_SERVERS
)
search接口搜索代码:
# -*- coding: utf-8 -*-
from es_client import es_client
def search(search_offset, search_size):
es_search_options = set_search_optional()
es_result = get_search_result(es_search_options, search_offset, search_size)
final_result = get_result_list(es_result)
return final_result
def get_result_list(es_result):
final_result = []
result_items = es_result['hits']['hits']
for item in result_items:
final_result.append(item['_source'])
return final_result
def get_search_result(es_search_options, search_offset, search_size, index='hz', doc_type='xyd'):
es_result = es_client.search(
index=index,
doc_type=doc_type,
body=es_search_options,
from_=search_offset,
size=search_size
)
return es_result
def set_search_optional():
# 检索选项
es_search_options = {
"query": {
"match_all": {}
}
}
return es_search_options
if __name__ == '__main__':
final_results = search(0, 1000)
print len(final_results)
这样一切貌似ok,正常输出1000,但是现在改下需求,想搜索其中20000条数据。
if __name__ == '__main__':
final_results = search(0, 20000)
输出如下错误:
elasticsearch.exceptions.TransportError: TransportError(500, u’search_phase_execution_exception’, u’Result window is too large, from + size must be less than or equal to: [10000] but was [20000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter.’)
说明:search接口最多返回1w条数据。所以这里会报错。
不废话,基于scan和scroll实现,直接给代码如下:
# -*- coding: utf-8 -*-
from es_client import es_client
from elasticsearch import helpers
def search():
es_search_options = set_search_optional()
es_result = get_search_result(es_search_options)
final_result = get_result_list(es_result)
return final_result
def get_result_list(es_result):
final_result = []
for item in es_result:
final_result.append(item['_source'])
return final_result
def get_search_result(es_search_options, scroll='5m', index='hz', doc_type='xyd', timeout="1m"):
es_result = helpers.scan(
client=es_client,
query=es_search_options,
scroll=scroll,
index=index,
doc_type=doc_type,
timeout=timeout
)
return es_result
def set_search_optional():
# 检索选项
es_search_options = {
"query": {
"match_all": {}
}
}
return es_search_options
if __name__ == '__main__':
final_results = search()
print len(final_results)
输出如下:
把29999条数据全部取出来了。