_count 和 _search无条件过滤查询的hits.total.value不一致
今天遇到一个奇怪的问题之前也没有注意到,记录一下,如下:
req
POST doc_1/_search
{
"query": {
"match_all": {}
}
}
rep
{
......,
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
......
},
......
}
req
POST doc_1/_count
rep
{
"count" : 1700000,
......
}
在 Elasticsearch 中,_count
API 和 _search
API 的行为存在一些差异:
-
_count
API 的返回结果:POST doc_1/_count
返回count
值,这个值表示索引中符合条件的实际文档总数。例如,你这里得到的count
为 1,700,000,表示doc_1
索引实际包含 1,700,000 条文档。
-
_search
API 的返回结果 (hits.total.value
):POST doc_1/_search {"query": {"match_all": {}}}
使用默认的from
和size
参数,其中默认的size
值为 10,000。如果没有显式地设置track_total_hits
,_search
API 默认会限制hits.total.value
为 10,000,以优化性能。- 在这种情况下,Elasticsearch 会返回
hits.total.value
为10,000
,并带有relation: "gte"
,意思是文档总数 “大于等于 10,000”。
解决方法
为了获得实际的文档总数,可以将查询请求添加参数 track_total_hits
,例如设置为 true
:
POST doc_1/_search
{
"query": {
"match_all": {}
},
"track_total_hits": true
}
官方解释
Generally the total hit count can’t be computed accurately without visiting all matches, which is costly for queries that match lots of documents. The track_total_hits parameter allows you to control how the total number of hits should be tracked. Given that it is often enough to have a lower bound of the number of hits, such as “there are at least 10000 hits”, the default is set to 10,000. This means that requests will count the total hit accurately up to 10,000 hits. It is a good trade off to speed up searches if you don’t need the accurate number of hits after a certain threshold.