ES优化复合排序

李昂的数字之旅

于 2023-05-14 19:24:45 发布

阅读量1.5k

点赞数

文章标签： java 数据库算法 elasticsearch

本文链接：https://blog.csdn.net/xsgnzb/article/details/130672371

版权

ELK 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

背景

我们项目是一个影视平台，平台有一个视频筛选页面，可以让用户通过不同条件进行筛选，并且按照指定的排序条件，分页展示数据。其中一个排序条件的规则是：近30天内的数据，按照播放热度（play_score）倒序；30天以前的数据，按照发布时间（publish_time）倒序。针对这个排序需求，ES有不同的实现方案。

方案A - 分两次查询

将数据分为30天内和30天外两个集合

如果分页落在30天内，就按播放热度倒序
如果分页落在30天外，就按照发布时间倒序
如果分页落在30天两边，#1查询x条数据，#2查询y条数据，两次查询结果拼起来。

先不说分几次查询性能怎么样，光维护逻辑就有点复杂，展示一下别人写的代码，不用看代码逻辑，只是表达这种实现方式不优雅。

// 先按30天内按播放热度排序，用完了再取30天外的。这个只能取出来放内存里去分页
// 分3种情况，1是全30天内的，2是部分30天内部分30天外，3是都在30天外

Date lastTime = DateUtils.addDays(new Date(), -30);
listQuery.add(QueryBuilders.rangeQuery("publish_time").gte(lastTime.getTime()));
boolQueryBuilder.must().addAll(listQuery);
long in30dayCount = elasticsearchTemplate.count(searchQuery,  LongVideoEsItemVO.class);
log.info("in30dayCount={}", in30dayCount);
// 30天内
if (in30dayCount >= pageSize * pageIndex) {
    searchQuery.addSort(Sort.by(Sort.Direction.DESC,"play_score"));
    // 客户端的pageIndex要减1
    searchQuery.setPageable(PageRequest.of(pageIndex - 1, pageSize));
    AggregatedPage<LongVideoEsItemVO> page = elasticsearchTemplate.queryForPage(searchQuery, LongVideoEsItemVO.class);
    List<LongVideoEsItemVO> list = page.getContent();
    return list;
}
// 30天外
if (in30dayCount <= (pageSize * (pageIndex - 1))) {
    // 不知道这种写法行不行
    boolQueryBuilder.must().remove(QueryBuilders.rangeQuery("publish_time").gte(lastTime.getTime()));
    // 时间范围是所有
    searchQuery.addSort(Sort.by(Sort.Direction.DESC,"publish_time"));
    // 客户端的pageIndex要减1
    searchQuery.setPageable(PageRequest.of(pageIndex - 1, pageSize));
    AggregatedPage<LongVideoEsItemVO> page = elasticsearchTemplate.queryForPage(searchQuery, LongVideoEsItemVO.class);
    List<LongVideoEsItemVO> list = page.getContent();
    return list;
}
// 部分30天内，部分30天外
{
    searchQuery.addSort(Sort.by(Sort.Direction.DESC,"play_score"));
    // 客户端的pageIndex要减1
    searchQuery.setPageable(PageRequest.of(pageIndex - 1, pageSize));
    AggregatedPage<LongVideoEsItemVO> page = elasticsearchTemplate.queryForPage(searchQuery, LongVideoEsItemVO.class);
    int remainCount = pageSize;
    List<LongVideoEsItemVO> listAll = new ArrayList<>();
    List<LongVideoEsItemVO> list1 = page.getContent();
    if (!CollectionUtils.isEmpty(list1)) {
        remainCount = pageSize - list1.size();
        listAll.addAll(list1);
    }
    // 剩下的从30天外找
    // 去掉之前的一些筛选条件和排序条件
    listQuery.remove(QueryBuilders.rangeQuery("publish_time").gte(lastTime.getTime()));
    listQuery.add(QueryBuilders.rangeQuery("publish_time").lt(lastTime.getTime()));

    BoolQueryBuilder boolQueryBuilder2 = QueryBuilders.boolQuery();
    boolQueryBuilder2.must().addAll(listQuery);

    SearchQuery searchQuery2 = new NativeSearchQuery(boolQueryBuilder2);
    searchQuery2.addSort(Sort.by(Sort.Direction.DESC,"publish_time"));

    searchQuery2.setPageable(PageRequest.of(0, remainCount));
    AggregatedPage<LongVideoEsItemVO> page2 = elasticsearchTemplate.queryForPage(searchQuery2, LongVideoEsItemVO.class);
    List<LongVideoEsItemVO> list2 = page2.getContent();
    if (!CollectionUtils.isEmpty(list2)) {
        listAll.addAll(list2);
    }
    return listAll;

方案B - sort脚本

首先考虑用一条查询语句返回结果，这样分页实现起来也方便。ES支持排序脚本，可以在排序条件里使用painless语言来描述排序规则。上面的排序需求翻译成脚本如下：

{
    "query": {
        "match_all": {
        }
    },
    "_source": [
        "publish_time",
        "play_score"
    ],
    "from": 0,
    "size": 20,
    "sort": [
        {
            "_script": {
                "type": "number",
                "script": {
                    "lang": "painless",
                    "source": "doc['publish_time'].value.toInstant().toEpochMilli() > params.currentTime ? doc['play_score'].value : -1",
                    "params": {
                        "currentTime": 1681315200000
                    }
                },
                "order": "desc"
            }
        },
        {
            "publish_time": {
                "order": "desc"
            }
        }
    ]
}

我们在sort里定义了两个排序规则，第一个是排序脚本，第二个是按照publish_time倒序。ES会按照规则顺序，对文档进行排序。先按照规则1的值进行倒序，如果规则1的值相同，则按照规则1进行倒序。

在排序脚本里，currentTime是近n天的时间戳，如果发布时间比currentTime大，规则1的排序值使用doc['play_score']，否则为-1。

在包含100万条文档的索引里，查询耗时约250ms。

方案C - function score

除了直接使用排序脚本，因为ES默认是使用_score值进行排序，所以通过自定义_score，也能达到排序的效果。用function_score查询可以实现自定义_score，function_score下包含多个打分函数：

script_score：自定义脚本
field_value_factor：字段映射
weight：权重
random_score：随机数
decay functions：衰减函数

其中script_score和field_value_factor能实现功能需求。

script_score方式

思路一：为所有文档，通过publish_time和play_score计算出一个_score，然后根据_score排序

思路二：近30天的数据，_score设置为play_score的值；30天外的数据，_score设置为1。然后在sort里，先按照_score排序，再按照publish_time排序。

思路一的DSL：

{
    "query": {
        "function_score": {
            "query": {
                "match_all": {
                }
            },
            "functions": [
                {
                    "script_score": {
                        "script": {
                            "source": "def of30DayAgoTimestamp = 1681401; def publishDate = doc['publish_time'].value.toInstant().toEpochMilli()/1000000; if (publishDate == 0) { 0 } else if (publishDate > of30DayAgoTimestamp) { doc['play_score'].value + 2000000 } else { publishDate }"
                        }
                    }
                }
            ],
            "boost_mode": "replace"
        }
    },
    "_source": [
        "publish_time",
        "play_score"
    ],
    "sort": [
        {
            "_score" : {
                "order": "desc"
            }
        }, 
        {
            "publish_time": {
                "order": "desc"
            }
        }
    ],
    "from": 0,
    "size": 20
}

很不幸，查询时间超过5秒，超时了。

思路二的DSL：

{
    "query": {
        "function_score": {
            "query": {
                "match_all": {
                }
            },
            "functions": [
                {
                    "filter": {
                        "range": {
                            "publish_time": {
                                "gte": "now-30d/d"
                            }
                        }
                    },
                    "script_score": {
                        "script": {
                            "source": "doc['play_score'].value == null? 0 : doc['play_score'].value"
                        }
                    }
                }
            ],
            "boost_mode": "replace"
        }
    },
    "_source": [
        "publish_time",
        "play_score"
    ],
    "sort": [
        {
            "_score" : {
                "order": "desc"
            }
        }, 
        {
            "publish_time": {
                "order": "desc"
            }
        }
    ],
    "from": 0,
    "size": 20
}

查询耗时约70ms。

field_value_factor的方式

逻辑和script_score的思路二实现方式一样，由于没有用到脚本，查询耗时约50ms。

{
    "query": {
        "function_score": {
            "query": {
                "match_all": {
                }
            },
            "functions": [
                {
                    "filter": {
                        "range": {
                            "publish_time": {
                                "gte": "now-30d/d"
                            }
                        }
                    },
                    "field_value_factor": {
                        "field": "play_score",
                        "missing": 0
                    }
                }
            ],
            "boost_mode": "replace"
        }
    },
    "_source": [
        "title_for_search",
        "publish_time",
        "play_score"
    ],
    "sort": [
        {
            "_score" : {
                "order": "desc"
            }
        }, 
        {
            "publish_time": {
                "order": "desc"
            }
        }
    ],
    "size": 20
}

总结

ES我们可以通过_score和sort控制排序规则，本文从实现play_score和publish_time复合排序出发，对比了多次查询、sort脚本排序、script_score打分脚本、field_value_factor打分函数这4种方式的实现和性能。得到2条结论：1. 使用ES自定义排序规则，能简化排序功能实现。2. 实现内置函数field_value_factor比使用脚本（sort脚本、script_score脚本）性能更好。

李昂的数字之旅

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
ES优化复合排序

ES我们可以通过_score和sort控制排序规则，本文从实现play_score和publish_time复合排序出发，对比了多次查询、sort脚本排序、script_score打分脚本、field_value_factor打分函数这4种方式的实现和性能。得到2条结论：1. 使用ES自定义排序规则，能简化排序功能实现。2. 实现内置函数field_value_factor比使用脚本（sort脚本、script_score脚本）性能更好。
复制链接

扫一扫