问题
- 针对搜索结果,需要根据相关度智能排序
- 但是对于某些文本内容相似,搜索得分一致,需要启用其他排序规则,例如时间
- 后面又发现,对于这些相似的文本,部分文本得分score和其他文本不同,导致排序排在了后面
- 以以下数据为例,对于模糊搜索
“上半年经济运行”
需要根据标题检索,然后得分相同的再根据时间倒序排序。但是实际上2009年的出现在第一条,2021年的在第二条,这是不允许的
[
{
"createDate": "2009-07-21",
"id": "7917561",
"title": "2009年上半年全省经济运行情况"
},
{
"createDate": "2021-08-02",
"id": "8193901",
"title": "2021年上半年全省经济运行情况"
},
{
"createDate": "2020-08-02",
"id": "8193891",
"title": "2020年上半年全省经济运行情况"
},
{
"createDate": "2019-08-02",
"id": "8193881",
"title": "2019年上半年全省经济运行情况"
},
{
"createDate": "2014-08-02",
"id": "8193861",
"title": "2014年上半年全省经济运行情况"
},
{
"createDate": "2019-07-18",
"id": "4271871",
"title": "2019年上半年全省经济运行情况"
},
{
"createDate": "2017-08-02",
"id": "8193871",
"title": "2017年上半年全省经济运行情况"
},
{
"createDate": "2017-01-23",
"id": "7914371",
"title": "2016年全省经济运行情况"
},
{
"createDate": "2016-01-22",
"id": "7914981",
"title": "2015年全省经济运行情况"
},
{
"createDate": "2015-01-22",
"id": "7915411",
"title": "2014年全省经济运行情况"
},
{
"createDate": "2014-01-23",
"id": "7915791",
"title": "2013年全省经济运行情况"
},
{
"createDate": "2012-01-20",
"id": "7916451",
"title": "2011年全省经济运行情况"
},
{
"createDate": "2011-01-24",
"id": "7916941",
"title": "2010年全省经济运行情况"
},
{
"createDate": "2010-01-23",
"id": "7917271",
"title": "2009年全省经济运行情况"
}
]
原因探究
shard与Lucene
- 不同index的不同shard,对于同样的数据,检索得分可能不同
- 这是因为每一个shard都是一个Lucene实例,Lucene使用TF/IDF计算相关度算法。而每个Lucene实例只保存了自身的TF和IDF统计信息,所以一个shard只知道term在其自身中出现的次数,而非整个cluster
TF: Term Frequency的缩写,表示该term在当前document出现的频率
IDF: Inverse Document Frequency缩写,表示该term在所有文档中出现的频率
- 从TF/IDF算法可以看出,该term在当前文档出现次数越高,那么分值越大;如果该term在所有文档出现的频率越小,那么分值越大。这样term分数,不仅和此篇命中的文档有关,还和该shard的文档数量、文档内容量有关
- 而每个shard里的文档,是根据哈希算法分配的,数量不总是一致的。尤其当文档总数较少时,这种数量不一致可能比较明显。从而同一篇文档,针对term可能得分不同
searchType
QUERY_THEN_FETCH
- 在elasticsearch搜索时,默认使用
QUERY_THEN_FETCH
- 根据官方文档,
QUERY_THEN_FETCH
模式搜索步骤如下:- 发送查询到每个shard
- 找到所有匹配的文档,当然,使用本地的TF/IDF信息进行打分
- 对结果构建一个优先队列(排序,标页等)
- 返回关于结果的足够的元数据到请求节点。注意,不包含文档内容
- 来自所有shard的分数合并起来,并在请求节点上进行排序,获得要求的分页和数量的文档
- 最终,实际文档从他们各自所在的独立的shard上检索出来(此时包含文档内容)
- 按请求要求,包装好结果返回给用户请求
- 由以上可以看出,默认方法并不保证相同的文档得分一致
- 但是实际上当对准确率要求没那么苛刻时,结果还是很理想的,所以一般的检索场景都是能满足的
- Lucene根据哈希算法分配文档到不同shard,当文档数据量比较大时,哈希结果会使不同shard文档数量趋于一致,默认的方式也能取得相当理想的结果
DFS_QUERY_THEN_FETCH
- 可以使用
search_type
参数指定其他搜索模式,DFS_QUERY_THEN_FETCH
就是Elasticsearch
提供的,针对以上问题的解决方案 - 与
{@link #QUERY_THEN_FETCH}
大致相同 - 只是在初始分散阶段,
DFS_QUERY_THEN_FETCH
会向所有shard
询问TF/IDF
,以获得更准确的评分 - 在具体每个
shard
的查询时,就可以使用预先查询获取到的全局TF/IDF
源码
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.action.search;
/**
* Search type represent the manner at which the search operation is executed.
*
*
*/
public enum SearchType {
/**
* Same as {@link #QUERY_THEN_FETCH}, except for an initial scatter phase which goes and computes the distributed
* term frequencies for more accurate scoring.
*/
DFS_QUERY_THEN_FETCH((byte) 0),
/**
* The query is executed against all shards, but only enough information is returned (not the document content).
* The results are then sorted and ranked, and based on it, only the relevant shards are asked for the actual
* document content. The return number of hits is exactly as specified in size, since they are the only ones that
* are fetched. This is very handy when the index has a lot of shards (not replicas, shard id groups).
*/
QUERY_THEN_FETCH((byte) 1),
// 2 used to be DFS_QUERY_AND_FETCH
/**
* Only used for pre 5.3 request where this type is still needed
*/
@Deprecated
QUERY_AND_FETCH((byte) 3);
/**
* The default search type ({@link #QUERY_THEN_FETCH}.
*/
public static final SearchType DEFAULT = QUERY_THEN_FETCH;
private byte id;
SearchType(byte id) {
this.id = id;
}
/**
* The internal id of the type.
*/
public byte id() {
return this.id;
}
/**
* Constructs search type based on the internal id.
*/
public static SearchType fromId(byte id) {
if (id == 0) {
return DFS_QUERY_THEN_FETCH;
} else if (id == 1
|| id == 3) { // This is a BWC layer for pre 5.3 indices where QUERY_AND_FETCH was id 3 but we don't have it anymore from 5.3 on
return QUERY_THEN_FETCH;
} else {
throw new IllegalArgumentException("No search type for [" + id + "]");
}
}
/**
* The a string representation search type to execute, defaults to {@link SearchType#DEFAULT}. Can be
* one of "dfs_query_then_fetch"/"dfsQueryThenFetch", "dfs_query_and_fetch"/"dfsQueryAndFetch",
* "query_then_fetch"/"queryThenFetch" and "query_and_fetch"/"queryAndFetch".
*/
public static SearchType fromString(String searchType) {
if (searchType == null) {
return SearchType.DEFAULT;
}
if ("dfs_query_then_fetch".equals(searchType)) {
return SearchType.DFS_QUERY_THEN_FETCH;
} else if ("query_then_fetch".equals(searchType)) {
return SearchType.QUERY_THEN_FETCH;
} else {
throw new IllegalArgumentException("No search type for [" + searchType + "]");
}
}
}
解决
- 如果要求评分必须一致,可以使用
DFS_QUERY_THEN_FETCH
,但是使用此方式可能会有一点点的查询性能损耗,目前在我们生产环境使用可以忽略
searchRequestBuilder.setSearchType(SearchType.DFS_QUERY_THEN_FETCH).get();
- 如果数据量比较少,可以考虑单shard,修改index的配置,
number_of_shards=1