进阶-第20__深度探秘搜索技术_使用rescoring机制优化近似匹配搜索的性能

最新推荐文章于 2023-02-15 12:55:33 发布

两点一刻

最新推荐文章于 2023-02-15 12:55:33 发布

阅读量290

点赞数

分类专栏： elasticsearch 文章标签： elasticsearch

本文链接：https://blog.csdn.net/qq_35524586/article/details/88426872

版权

elasticsearch 专栏收录该内容

180 篇文章 5 订阅

订阅专栏

match和phrase match(proximity match)区别

match

match --> 只要简单的匹配到了一个term，就可以理解将term对应的doc作为结果返回，扫描倒排索引，扫描到了就ok

phrase match

phrase match --> 首先扫描到所有term的doc list; 找到包含所有term的doc list; 然后对每个doc都计算每个term的position，是否符合指定的范围; slop，需要进行复杂的运算，来判断能否通过slop移动，匹配一个doc

区别

match query的性能比phrase match和proximity match（有slop）要高很多。因为后两者都要计算position的距离。

match query比phrase match的性能要高10倍，比proximity match的性能要高20倍。

但是别太担心，因为es的性能一般都在毫秒级别，match query一般就在几毫秒，或者几十毫秒，而phrase match和proximity match的性能在几十毫秒到几百毫秒之间，所以也是可以接受的。

优化proximity match

优化proximity match的性能，一般就是减少要进行proximity match搜索的document数量。主要思路就是，用match query先过滤出需要的数据，然后再用proximity match来根据term距离提高doc的分数，同时proximity match只针对每个shard的分数排名前n个doc起作用，来重新调整它们的分数，这个过程称之为rescoring，重计分。因为一般用户会分页查询，只会看到前几页的数据，所以不需要对所有结果进行proximity match操作。

用我们刚才的说法，match + proximity match同时实现召回率和精准度

默认情况下，match也许匹配了1000个doc，proximity match全都需要对每个doc进行一遍运算，判断能否slop移动匹配上，然后去贡献自己的分数

但是很多情况下，match出来也许1000个doc，其实用户大部分情况下是分页查询的，所以可能最多只会看前几页，比如一页是10条，最多也许就看5页，就是50条

proximity match只要对前50个doc进行slop移动去匹配，去贡献自己的分数即可，不需要对全部1000个doc都去进行计算和贡献分数

rescore：重打分

match：1000个doc，其实这时候每个doc都有一个分数了; proximity match，前50个doc，进行rescore，重打分，即可; 让前50个doc，term距离越近的，排在越前面

GET /forum/article/_search

{

"query": {

"match": {

"content": "java spark"

}

"rescore": {

"window_size": 50,//意思对上面query match 的前window_size 进行冲洗打分

"query": {

"rescore_query": {

"match_phrase": {

"content": {

"query": "java spark",

"slop": 50

}

结果：

{

"took": 33,

"timed_out": false,

"_shards": {

"total": 5,

"successful": 5,

"failed": 0

"hits": {

"total": 2,

"max_score": 1.258609,

"hits": [

{

"_index": "forum",

"_type": "article",

"_id": "5",

"_score": 1.258609,

"_source": {

"articleID": "DHJK-B-1395-#Ky5",

"userID": 3,

"hidden": false,

"postDate": "2017-03-01",

"tag": [

"elasticsearch"

"tag_cnt": 1,

"view_cnt": 10,

"title": "this is spark blog",

"content": "spark is best big data solution based on scala ,an programming language similar to java spark",

"sub_title": "haha, hello world",

"author_first_name": "Tonny",

"author_last_name": "Peter Smith"

}

{

"_index": "forum",

"_type": "article",

"_id": "2",

"_score": 0.68640786,

"_source": {

"articleID": "KDKE-B-9947-#kL5",

"userID": 1,

"hidden": false,

"postDate": "2017-01-02",

"tag": [

"java"

"tag_cnt": 1,

"view_cnt": 50,

"title": "this is java blog",

"content": "i think java is the best programming language",

"sub_title": "learned a lot of course",

"author_first_name": "Smith",

"author_last_name": "Williams"

}

]

}

两点一刻

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
进阶-第20__深度探秘搜索技术_使用rescoring机制优化近似匹配搜索的性能

match和phrase match(proximity match)区别matchmatch --&gt; 只要简单的匹配到了一个term，就可以理解将term对应的doc作为结果返回，扫描倒排索引，扫描到了就okphrase matchphrase match --&gt; 首先扫描到所有term的doc list; 找到包含所有term的doc list; 然后对每个doc都...
复制链接

扫一扫

专栏目录