Elasticsearch：搜索相同内容，但评分不同，排序混乱问题解决

最新推荐文章于 2023-05-15 17:34:12 发布

坚持是一种态度

最新推荐文章于 2023-05-15 17:34:12 发布

阅读量1.7k

点赞数 1

分类专栏： Elasticsearch生产实战 java 文章标签： elasticsearch java search_type DFS_QUERY_THEN QUERYTHENFETCH

本文链接：https://blog.csdn.net/u010882234/article/details/119328056

版权

java 同时被 2 个专栏收录

112 篇文章 2 订阅

订阅专栏

Elasticsearch生产实战

31 篇文章 10 订阅

订阅专栏

文章目录

问题

针对搜索结果，需要根据相关度智能排序
但是对于某些文本内容相似，搜索得分一致，需要启用其他排序规则，例如时间
后面又发现，对于这些相似的文本，部分文本得分score和其他文本不同，导致排序排在了后面
以以下数据为例，对于模糊搜索“上半年经济运行”需要根据标题检索，然后得分相同的再根据时间倒序排序。但是实际上2009年的出现在第一条，2021年的在第二条，这是不允许的

[
    {
        "createDate": "2009-07-21",
        "id": "7917561",
        "title": "2009年上半年全省经济运行情况"
    },
    {
        "createDate": "2021-08-02",
        "id": "8193901",
        "title": "2021年上半年全省经济运行情况"
    },
    {
        "createDate": "2020-08-02",
        "id": "8193891",
        "title": "2020年上半年全省经济运行情况"
    },
    {
        "createDate": "2019-08-02",
        "id": "8193881",
        "title": "2019年上半年全省经济运行情况"
    },
    {
        "createDate": "2014-08-02",
        "id": "8193861",
        "title": "2014年上半年全省经济运行情况"
    },
    {
        "createDate": "2019-07-18",
        "id": "4271871",
        "title": "2019年上半年全省经济运行情况"
    },
    {
        "createDate": "2017-08-02",
        "id": "8193871",
        "title": "2017年上半年全省经济运行情况"
    },
    {
        "createDate": "2017-01-23",
        "id": "7914371",
        "title": "2016年全省经济运行情况"
    },
    {
        "createDate": "2016-01-22",
        "id": "7914981",
        "title": "2015年全省经济运行情况"
    },
    {
        "createDate": "2015-01-22",
        "id": "7915411",
        "title": "2014年全省经济运行情况"
    },
    {
        "createDate": "2014-01-23",
        "id": "7915791",
        "title": "2013年全省经济运行情况"
    },
    {
        "createDate": "2012-01-20",
        "id": "7916451",
        "title": "2011年全省经济运行情况"
    },
    {
        "createDate": "2011-01-24",
        "id": "7916941",
        "title": "2010年全省经济运行情况"
    },
    {
        "createDate": "2010-01-23",
        "id": "7917271",
        "title": "2009年全省经济运行情况"
    }
]

原因探究

shard与Lucene

不同index的不同shard，对于同样的数据，检索得分可能不同
这是因为每一个shard都是一个Lucene实例，Lucene使用TF/IDF计算相关度算法。而每个Lucene实例只保存了自身的TF和IDF统计信息，所以一个shard只知道term在其自身中出现的次数，而非整个cluster

TF: Term Frequency的缩写，表示该term在当前document出现的频率
IDF: Inverse Document Frequency缩写，表示该term在所有文档中出现的频率

从TF/IDF算法可以看出，该term在当前文档出现次数越高，那么分值越大；如果该term在所有文档出现的频率越小，那么分值越大。这样term分数，不仅和此篇命中的文档有关，还和该shard的文档数量、文档内容量有关
而每个shard里的文档，是根据哈希算法分配的，数量不总是一致的。尤其当文档总数较少时，这种数量不一致可能比较明显。从而同一篇文档，针对term可能得分不同

searchType

QUERY_THEN_FETCH

在elasticsearch搜索时，默认使用QUERY_THEN_FETCH
根据官方文档，QUERY_THEN_FETCH模式搜索步骤如下：
- 发送查询到每个shard
- 找到所有匹配的文档，当然，使用本地的TF/IDF信息进行打分
- 对结果构建一个优先队列（排序，标页等）
- 返回关于结果的足够的元数据到请求节点。注意，不包含文档内容
- 来自所有shard的分数合并起来，并在请求节点上进行排序，获得要求的分页和数量的文档
- 最终，实际文档从他们各自所在的独立的shard上检索出来（此时包含文档内容）
- 按请求要求，包装好结果返回给用户请求
由以上可以看出，默认方法并不保证相同的文档得分一致
但是实际上当对准确率要求没那么苛刻时，结果还是很理想的，所以一般的检索场景都是能满足的
Lucene根据哈希算法分配文档到不同shard，当文档数据量比较大时，哈希结果会使不同shard文档数量趋于一致，默认的方式也能取得相当理想的结果

DFS_QUERY_THEN_FETCH

可以使用search_type参数指定其他搜索模式，DFS_QUERY_THEN_FETCH就是Elasticsearch提供的，针对以上问题的解决方案
与 {@link #QUERY_THEN_FETCH}大致相同
只是在初始分散阶段，DFS_QUERY_THEN_FETCH会向所有shard询问TF/IDF，以获得更准确的评分
在具体每个shard的查询时，就可以使用预先查询获取到的全局TF/IDF

源码

/*
 * Licensed to Elasticsearch under one or more contributor
 * license agreements. See the NOTICE file distributed with
 * this work for additional information regarding copyright
 * ownership. Elasticsearch licenses this file to you under
 * the Apache License, Version 2.0 (the "License"); you may
 * not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing,
 * software distributed under the License is distributed on an
 * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 * KIND, either express or implied.  See the License for the
 * specific language governing permissions and limitations
 * under the License.
 */

package org.elasticsearch.action.search;

/**
 * Search type represent the manner at which the search operation is executed.
 *
 *
 */
public enum SearchType {
    /**
     * Same as {@link #QUERY_THEN_FETCH}, except for an initial scatter phase which goes and computes the distributed
     * term frequencies for more accurate scoring.
     */
    DFS_QUERY_THEN_FETCH((byte) 0),
    /**
     * The query is executed against all shards, but only enough information is returned (not the document content).
     * The results are then sorted and ranked, and based on it, only the relevant shards are asked for the actual
     * document content. The return number of hits is exactly as specified in size, since they are the only ones that
     * are fetched. This is very handy when the index has a lot of shards (not replicas, shard id groups).
     */
    QUERY_THEN_FETCH((byte) 1),
    // 2 used to be DFS_QUERY_AND_FETCH

    /**
     * Only used for pre 5.3 request where this type is still needed
     */
    @Deprecated
    QUERY_AND_FETCH((byte) 3);

    /**
     * The default search type ({@link #QUERY_THEN_FETCH}.
     */
    public static final SearchType DEFAULT = QUERY_THEN_FETCH;

    private byte id;

    SearchType(byte id) {
        this.id = id;
    }

    /**
     * The internal id of the type.
     */
    public byte id() {
        return this.id;
    }

    /**
     * Constructs search type based on the internal id.
     */
    public static SearchType fromId(byte id) {
        if (id == 0) {
            return DFS_QUERY_THEN_FETCH;
        } else if (id == 1
            || id == 3) { // This is a BWC layer for pre 5.3 indices where QUERY_AND_FETCH was id 3 but we don't have it anymore from 5.3 on
            return QUERY_THEN_FETCH;
        } else {
            throw new IllegalArgumentException("No search type for [" + id + "]");
        }
    }

    /**
     * The a string representation search type to execute, defaults to {@link SearchType#DEFAULT}. Can be
     * one of "dfs_query_then_fetch"/"dfsQueryThenFetch", "dfs_query_and_fetch"/"dfsQueryAndFetch",
     * "query_then_fetch"/"queryThenFetch" and "query_and_fetch"/"queryAndFetch".
     */
    public static SearchType fromString(String searchType) {
        if (searchType == null) {
            return SearchType.DEFAULT;
        }
        if ("dfs_query_then_fetch".equals(searchType)) {
            return SearchType.DFS_QUERY_THEN_FETCH;
        } else if ("query_then_fetch".equals(searchType)) {
            return SearchType.QUERY_THEN_FETCH;
        } else {
            throw new IllegalArgumentException("No search type for [" + searchType + "]");
        }
    }
}

解决

如果要求评分必须一致，可以使用DFS_QUERY_THEN_FETCH，但是使用此方式可能会有一点点的查询性能损耗，目前在我们生产环境使用可以忽略

searchRequestBuilder.setSearchType(SearchType.DFS_QUERY_THEN_FETCH).get();

如果数据量比较少，可以考虑单shard，修改index的配置，number_of_shards=1

坚持是一种态度

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
打赏
2
评论
Elasticsearch：搜索相同内容，但评分不同，排序混乱问题解决

问题针对搜索结果，需要根据相关度智能排序但是对于某些文本内容相似，搜索得分一致，需要启用其他排序规则，例如时间后面又发现，对于这些相似的文本，部分文本得分score和其他文本不同，导致排序排在了后面以以下数据为例，对于模糊搜索“上半年经济运行”需要根据标题检索，然后得分相同的再根据时间倒序排序。但是实际上2009年的出现在第一条，2021年的在第二条，这是不允许的[ { "createDate": "2009-07-21", "id": "7917561
复制链接

扫一扫