1.简介
elasticsearch提供了如下三种方式来解决分页与遍历问题,分别是from/size、scroll以及search_after。
2.from/size
(1).query
查询job字段为"Java engineer"的文档,从第2个位置开始,展示2条。
POST /employee/_search
{
"query": {
"match": {
"job": "Java engineer"
}
},
"from": 1,
"size": 2
}
- from:开始位置
- size:获取数量
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : 0.6931472,
"hits" : [
{
"_index" : "employee",
"_type" : "_doc",
"_id" : "wol1hnsBEsHOdz1YHsp3",
"_score" : 0.6931472,
"_source" : {
"name" : "Stephen Curry",
"job" : "Java engineer",
"age" : 27,
"salary" : 20000.0,
"birthday" : "1995-08-06"
}
},
{
"_index" : "employee",
"_type" : "_doc",
"_id" : "wYl0hnsBEsHOdz1Y4cqT",
"_score" : 0.47000363,
"_source" : {
"name" : "James Harden",
"job" : "Java engineer",
"age" : 31,
"salary" : 30000.0,
"birthday" : "1991-01-01"
}
},
{
"_index" : "employee",
"_type" : "_doc",
"_id" : "xol1hnsBEsHOdz1YXcqt",
"_score" : 0.47000363,
"_source" : {
"name" : "Chirs Paul",
"job" : "Java engineer",
"age" : 33,
"salary" : 29000.0,
"birthday" : "1988-12-02"
}
},
{
"_index" : "employee",
"_type" : "_doc",
"_id" : "x4l1hnsBEsHOdz1YbMo-",
"_score" : 0.2876821,
"_source" : {
"name" : "Jason Tatum",
"job" : "Java engineer",
"age" : 24,
"salary" : 15000.0,
"birthday" : "1997-08-02"
}
}
]
}
}
(2).深度分页
如果检索条件中,from为990,size为10,即需要获取从990-1000的文档,那么elasticsearch会在每个分片上先获取1000个文档,然后再由coordinating node聚合所有分片的结果后,再排序选取前1000个文档。
但是如果from的值越深,处理的文档就会越多,占用内存也会越多,耗时就会越长。为了解决深度分页的问题,elasticsearch通过index.max_result_window限制最多为10000条。
3.scroll
(1).简介
遍历文档集的API,是以快照的方式来避免深度分页问题的,其具有以下特性。
- 不能用来做实时搜索,因为数据不是实时的
- 尽量不要使用复杂的sort条件,使用_doc最高效
- 使用稍微复杂
(2).query
a:发起scroll search请求
elasticsearch在收到请求后会根据查询条件创建文档id合集的快照。
POST /employee/_search?scroll=5m
{
"size": 2
}
{
"_scroll_id" : "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAOM0FjBKRG9TQkFWUXIyOVpCMW15dFNRZ3cAAAAAAADjNRYwSkRvU0JBVlFyMjlaQjFteXRTUWd3AAAAAAAA4zYWMEpEb1NCQVZRcjI5WkIxbXl0U1FndwAAAAAAAOMyFjBKRG9TQkFWUXIyOVpCMW15dFNRZ3cAAAAAAADjMxYwSkRvU0JBVlFyMjlaQjFteXRTUWd3",
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 7,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "employee",
"_type" : "_doc",
"_id" : "wol1hnsBEsHOdz1YHsp3",
"_score" : 1.0,
"_source" : {
"name" : "Stephen Curry",
"job" : "Java engineer",
"age" : 27,
"salary" : 20000.0,
"birthday" : "1995-08-06"
}
},
{
"_index" : "employee",
"_type" : "_doc",
"_id" : "xYl1hnsBEsHOdz1YTsrN",
"_score" : 1.0,
"_source" : {
"name" : "Kevin Durant",
"job" : "Vue engineer",
"age" : 30,
"salary" : 28000.0,
"birthday" : "1992-05-01"
}
}
]
}
}
- scroll:该scroll快照的有效时间
- size:每次scroll返回的文档数
- _scroll_id:后续调用scroll API需要的参数
b:调用scroll API
通过不断迭代调用直到返回hits.hits数组为空来获取文档集合。
POST /_search/scroll
{
"scroll": "5m",
"scroll_id": "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAAqFlZ3X0JzalA1UlJlZXVleDhONkRteXcAAAAAAAAALRZWd19Cc2pQNVJSZWV1ZXg4TjZEbXl3AAAAAAAAAC4WVndfQnNqUDVSUmVldWV4OE42RG15dwAAAAAAAAArFlZ3X0JzalA1UlJlZXVleDhONkRteXcAAAAAAAAALBZWd19Cc2pQNVJSZWV1ZXg4TjZEbXl3"
}
{
"_scroll_id" : "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAOM0FjBKRG9TQkFWUXIyOVpCMW15dFNRZ3cAAAAAAADjNRYwSkRvU0JBVlFyMjlaQjFteXRTUWd3AAAAAAAA4zYWMEpEb1NCQVZRcjI5WkIxbXl0U1FndwAAAAAAAOMyFjBKRG9TQkFWUXIyOVpCMW15dFNRZ3cAAAAAAADjMxYwSkRvU0JBVlFyMjlaQjFteXRTUWd3",
"took" : 9,
"timed_out" : false,
"terminated_early" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 7,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "employee",
"_type" : "_doc",
"_id" : "wYl0hnsBEsHOdz1Y4cqT",
"_score" : 1.0,
"_source" : {
"name" : "James Harden",
"job" : "Java engineer",
"age" : 31,
"salary" : 30000.0,
"birthday" : "1991-01-01"
}
},
{
"_index" : "employee",
"_type" : "_doc",
"_id" : "w4l1hnsBEsHOdz1YM8pq",
"_score" : 1.0,
"_source" : {
"name" : "LeBron James",
"job" : "Technical director",
"age" : 35,
"salary" : 50000.0,
"birthday" : "1987-12-25"
}
}
]
}
}
c:删除快照
过多的scroll调用会占用大量的内存,可以通过clear API删除过多的scroll快照。
POST /_search/scroll
{
"scroll_id": ["DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAAqFlZ3X0JzalA1UlJlZXVleDhONkRteXcAAAAAAAAALRZWd19Cc2pQNVJSZWV1ZXg4TjZEbXl3AAAAAAAAAC4WVndfQnNqUDVSUmVldWV4OE42RG15dwAAAAAAAAArFlZ3X0JzalA1UlJlZXVleDhONkRteXcAAAAAAAAALBZWd19Cc2pQNVJSZWV1ZXg4TjZEbXl3"]
}
4.search_after
(1).简介
避免深度分页的性能问题,提供实时的下一页文档获取功能,其具有以下特点。
- 缺点是不能使用from参数,即不能指定页数
- 只能下一页不能上一页
- 使用简单
(2).query
a:指定sort值,并保证值唯一进行搜索。
POST /employee/_search
{
"size": 2,
"sort": {
"age": "desc",
"_id": "desc"
}
}
{
"took" : 6,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 7,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "employee",
"_type" : "_doc",
"_id" : "w4l1hnsBEsHOdz1YM8pq",
"_score" : null,
"_source" : {
"name" : "LeBron James",
"job" : "Technical director",
"age" : 35,
"salary" : 50000.0,
"birthday" : "1987-12-25"
},
"sort" : [
35,
"w4l1hnsBEsHOdz1YM8pq"
]
},
{
"_index" : "employee",
"_type" : "_doc",
"_id" : "xol1hnsBEsHOdz1YXcqt",
"_score" : null,
"_source" : {
"name" : "Chirs Paul",
"job" : "Java engineer",
"age" : 33,
"salary" : 29000.0,
"birthday" : "1988-12-02"
},
"sort" : [
33,
"xol1hnsBEsHOdz1YXcqt"
]
}
]
}
}
b:使用上一步最后一个文档的sort值进行查询。
POST /employee/_search
{
"size": 1,
"search_after": [33, "3"],
"sort": {
"age": "desc",
"_id": "desc"
}
}
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 7,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "employee",
"_type" : "_doc",
"_id" : "wYl0hnsBEsHOdz1Y4cqT",
"_score" : null,
"_source" : {
"name" : "James Harden",
"job" : "Java engineer",
"age" : 31,
"salary" : 30000.0,
"birthday" : "1991-01-01"
},
"sort" : [
31,
"wYl0hnsBEsHOdz1Y4cqT"
]
}
]
}
}
(3).如何避免深度分页
通过唯一排序值定位将每次要处理的文档数都控制在size内,其示意图如下所示。
5.总结
类型 | 场景 |
---|---|
from/size | 需要实时获取顶部的部分文档,且需要自由翻页 |
scroll | 需要全部文档,如导出所有数据的功能 |
search_after | 需要全部文档,不需要自由翻页 |