7.2 ElasticSearch运行机制之分页与遍历

1.简介
elasticsearch提供了如下三种方式来解决分页与遍历问题,分别是from/size、scroll以及search_after。

2.from/size
(1).query
查询job字段为"Java engineer"的文档,从第2个位置开始,展示2条。

POST /employee/_search
{
  "query": {
    "match": {
      "job": "Java engineer"
    }
  },
  "from": 1,
  "size": 2
}
  • from:开始位置
  • size:获取数量
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : 0.6931472,
    "hits" : [
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "wol1hnsBEsHOdz1YHsp3",
        "_score" : 0.6931472,
        "_source" : {
          "name" : "Stephen Curry",
          "job" : "Java engineer",
          "age" : 27,
          "salary" : 20000.0,
          "birthday" : "1995-08-06"
        }
      },
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "wYl0hnsBEsHOdz1Y4cqT",
        "_score" : 0.47000363,
        "_source" : {
          "name" : "James Harden",
          "job" : "Java engineer",
          "age" : 31,
          "salary" : 30000.0,
          "birthday" : "1991-01-01"
        }
      },
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "xol1hnsBEsHOdz1YXcqt",
        "_score" : 0.47000363,
        "_source" : {
          "name" : "Chirs Paul",
          "job" : "Java engineer",
          "age" : 33,
          "salary" : 29000.0,
          "birthday" : "1988-12-02"
        }
      },
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "x4l1hnsBEsHOdz1YbMo-",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "Jason Tatum",
          "job" : "Java engineer",
          "age" : 24,
          "salary" : 15000.0,
          "birthday" : "1997-08-02"
        }
      }
    ]
  }
}

(2).深度分页
如果检索条件中,from为990,size为10,即需要获取从990-1000的文档,那么elasticsearch会在每个分片上先获取1000个文档,然后再由coordinating node聚合所有分片的结果后,再排序选取前1000个文档。
但是如果from的值越深,处理的文档就会越多,占用内存也会越多,耗时就会越长。为了解决深度分页的问题,elasticsearch通过index.max_result_window限制最多为10000条。

3.scroll
(1).简介
遍历文档集的API,是以快照的方式来避免深度分页问题的,其具有以下特性。

  • 不能用来做实时搜索,因为数据不是实时的
  • 尽量不要使用复杂的sort条件,使用_doc最高效
  • 使用稍微复杂

(2).query
a:发起scroll search请求
elasticsearch在收到请求后会根据查询条件创建文档id合集的快照。

POST /employee/_search?scroll=5m
{
	"size": 2
}
{
  "_scroll_id" : "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAOM0FjBKRG9TQkFWUXIyOVpCMW15dFNRZ3cAAAAAAADjNRYwSkRvU0JBVlFyMjlaQjFteXRTUWd3AAAAAAAA4zYWMEpEb1NCQVZRcjI5WkIxbXl0U1FndwAAAAAAAOMyFjBKRG9TQkFWUXIyOVpCMW15dFNRZ3cAAAAAAADjMxYwSkRvU0JBVlFyMjlaQjFteXRTUWd3",
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 7,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "wol1hnsBEsHOdz1YHsp3",
        "_score" : 1.0,
        "_source" : {
          "name" : "Stephen Curry",
          "job" : "Java engineer",
          "age" : 27,
          "salary" : 20000.0,
          "birthday" : "1995-08-06"
        }
      },
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "xYl1hnsBEsHOdz1YTsrN",
        "_score" : 1.0,
        "_source" : {
          "name" : "Kevin Durant",
          "job" : "Vue engineer",
          "age" : 30,
          "salary" : 28000.0,
          "birthday" : "1992-05-01"
        }
      }
    ]
  }
}
  • scroll:该scroll快照的有效时间
  • size:每次scroll返回的文档数
  • _scroll_id:后续调用scroll API需要的参数

b:调用scroll API
通过不断迭代调用直到返回hits.hits数组为空来获取文档集合。

POST /_search/scroll
{
	"scroll": "5m",
	"scroll_id": "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAAqFlZ3X0JzalA1UlJlZXVleDhONkRteXcAAAAAAAAALRZWd19Cc2pQNVJSZWV1ZXg4TjZEbXl3AAAAAAAAAC4WVndfQnNqUDVSUmVldWV4OE42RG15dwAAAAAAAAArFlZ3X0JzalA1UlJlZXVleDhONkRteXcAAAAAAAAALBZWd19Cc2pQNVJSZWV1ZXg4TjZEbXl3"
}
{
  "_scroll_id" : "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAOM0FjBKRG9TQkFWUXIyOVpCMW15dFNRZ3cAAAAAAADjNRYwSkRvU0JBVlFyMjlaQjFteXRTUWd3AAAAAAAA4zYWMEpEb1NCQVZRcjI5WkIxbXl0U1FndwAAAAAAAOMyFjBKRG9TQkFWUXIyOVpCMW15dFNRZ3cAAAAAAADjMxYwSkRvU0JBVlFyMjlaQjFteXRTUWd3",
  "took" : 9,
  "timed_out" : false,
  "terminated_early" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 7,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "wYl0hnsBEsHOdz1Y4cqT",
        "_score" : 1.0,
        "_source" : {
          "name" : "James Harden",
          "job" : "Java engineer",
          "age" : 31,
          "salary" : 30000.0,
          "birthday" : "1991-01-01"
        }
      },
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "w4l1hnsBEsHOdz1YM8pq",
        "_score" : 1.0,
        "_source" : {
          "name" : "LeBron James",
          "job" : "Technical director",
          "age" : 35,
          "salary" : 50000.0,
          "birthday" : "1987-12-25"
        }
      }
    ]
  }
}

c:删除快照
过多的scroll调用会占用大量的内存,可以通过clear API删除过多的scroll快照。

POST /_search/scroll
{
	"scroll_id": ["DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAAqFlZ3X0JzalA1UlJlZXVleDhONkRteXcAAAAAAAAALRZWd19Cc2pQNVJSZWV1ZXg4TjZEbXl3AAAAAAAAAC4WVndfQnNqUDVSUmVldWV4OE42RG15dwAAAAAAAAArFlZ3X0JzalA1UlJlZXVleDhONkRteXcAAAAAAAAALBZWd19Cc2pQNVJSZWV1ZXg4TjZEbXl3"]
}

4.search_after
(1).简介
避免深度分页的性能问题,提供实时的下一页文档获取功能,其具有以下特点。

  • 缺点是不能使用from参数,即不能指定页数
  • 只能下一页不能上一页
  • 使用简单

(2).query
a:指定sort值,并保证值唯一进行搜索。

POST /employee/_search
{
	"size": 2,
	"sort": {
		"age": "desc",
		"_id": "desc"
	}
}
{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 7,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "w4l1hnsBEsHOdz1YM8pq",
        "_score" : null,
        "_source" : {
          "name" : "LeBron James",
          "job" : "Technical director",
          "age" : 35,
          "salary" : 50000.0,
          "birthday" : "1987-12-25"
        },
        "sort" : [
          35,
          "w4l1hnsBEsHOdz1YM8pq"
        ]
      },
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "xol1hnsBEsHOdz1YXcqt",
        "_score" : null,
        "_source" : {
          "name" : "Chirs Paul",
          "job" : "Java engineer",
          "age" : 33,
          "salary" : 29000.0,
          "birthday" : "1988-12-02"
        },
        "sort" : [
          33,
          "xol1hnsBEsHOdz1YXcqt"
        ]
      }
    ]
  }
}

b:使用上一步最后一个文档的sort值进行查询。

POST /employee/_search
{
	"size": 1,
	"search_after": [33, "3"],
	"sort": {
		"age": "desc",
		"_id": "desc"
	}
}
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 7,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "employee",
        "_type" : "_doc",
        "_id" : "wYl0hnsBEsHOdz1Y4cqT",
        "_score" : null,
        "_source" : {
          "name" : "James Harden",
          "job" : "Java engineer",
          "age" : 31,
          "salary" : 30000.0,
          "birthday" : "1991-01-01"
        },
        "sort" : [
          31,
          "wYl0hnsBEsHOdz1Y4cqT"
        ]
      }
    ]
  }
}

(3).如何避免深度分页
通过唯一排序值定位将每次要处理的文档数都控制在size内,其示意图如下所示。
在这里插入图片描述

5.总结

类型场景
from/size需要实时获取顶部的部分文档,且需要自由翻页
scroll需要全部文档,如导出所有数据的功能
search_after需要全部文档,不需要自由翻页
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值