【论文阅读】Ranking Relevance in Yahoo Search

最新推荐文章于 2022-04-29 00:55:23 发布

抖腿大刘

最新推荐文章于 2022-04-29 00:55:23 发布

阅读量1k

点赞数 2

分类专栏：机器学习 paper

本文链接：https://blog.csdn.net/u010496169/article/details/87720183

版权

机器学习同时被 2 个专栏收录

7 篇文章 0 订阅

订阅专栏

paper

1 篇文章 0 订阅

订阅专栏

论文进行

问题 -》方法 -》评价 -》结论

问题 - 商业检索问题：

一般：

转化率问题：

时效问题

距离问题

.....

用户体验：

坏的排到前面（more worse）

好的没在前面

方法

针对以上问题的改进，本文的重点

评价

专家系统评估的，DCG

关注top n 位置

是一个相关性和位置的综合指标

offline data 和 yahoo search data，在offline data表现不是特别好，但是在 yahoo data上表现较好，两个数据集上的分布不同，本质是负样本的比例不同。

参考：https://www.cnblogs.com/memento/p/8673309.html （信息检索常用指标：map， cg, dcg, ndcg）

结论

好，具体结论和方法一起分步看

方法-detail

改进方法：过去和改进

过去：

现在 - 框架：

Round Second： core ranking

GBDT + logistic loss （比hinge loss 好）：能降低bad urls 排到 top positions上

主要是改进了pseudo-response（也就是微积分的时候）:

从原来的：

改变为：

感觉本身是一个粗糙到细致的变化。原来定义了5个类别：perfect, excellent, good, fair, bad。但是在标注label的时候，将perfect, excellent, good 作为 +1，将fair, bad作为 -1，这样就丧失了一些信息。

所以在这一步定义scale(perfect) = 3, scale(Excellent) = 2 and scale(Good/fair/Bad) = 1，希望能使好的向前排。

文中说这个和加了sample weight是不同的。

This method is actually equivalent to adding different weights on loss for different labels, but it is different from adding sample weights, which actually affect the tree growth of each stage m but do not change pseudo-response.

因为sample weight本身没有影响pseudo-response。（在一些其他资料上看到说，过去的sample weight本质是在loss function上乘上这个weight，而这个是在pseudo-response上乘的）

作用：

1.使差的url降低了40% ：On the other aspect, the percentage of bad/embarrassing results is reduced by 40% through LogisticRank. We observe that the improvement of LogisticRank is mainly from bad results removal.

2.DCG的提升

可以看出在这个数据集上GBRank要比LambdaMart表现好，所以后面用到这种和过去比较的时候，只用了GBRank。当然大部分还是直接和自己改进的这个LogisticRank比较的，文中的base就是LogisticRank。

Round Third : contextual ReRanking

引入原因：

在core ranking中只是考虑了query X url pair feature的关系，而没有考虑上下文的关系

思考：

如果有2个一模一样的doc，都很好。那是不是应该将它们分别放置到1，2位呢？

在几十个（30个）结果上提取上下文信息进行reranking.

主要提取的上下文特征有：

rank：这个ult相对排名；mean; variance；normalize；用 topic feautre 计算url直接的相似度

？没有说明用的什么模型，也没有说明哪类特征比较重要

作用：

现在 - 特征：

主要是增加了3大类特征：CS， TTM，和DSM

感觉是首先针对tail query的问题：（1）扩充到高频的term =》 CS（2）翻译成有行为的query =》TTM

然后再找一些不仅仅是文本层面的，还包含用户意图信息的特征 =》 DSM

CS: click similarity

自己感觉本质是在找相似的term

doc和query，尤其是短的query，基于click关系被扩充

During the propagation process, queries are enriched with new and relevant terms.

是query驱动影响doc的标示，然后doc的标示又会反过来影响query的表示。

传播公式：

示例（基于文章中给出的点击图。不知道为什么我不能把这个拍的图正过来....）：

在实际应用的时候，首先只考虑了非零权重，然后只用了top K 的term。首先top K 的term已经够用了，而且这些term的值在迭代的时候还不会怎么容易丢失精度；然后这样还比较容易收敛。

作用：

1.特征的重要性是第一

2.DNG的提升

TTM: translated text matching

其实就是通过翻译模型将原来的query，翻译成10个query，然后和doc 的 title求一个cosin，比如求个最大。

文中也给了2个表现特别好的特征。

作用：

1.那2个比较有用的特征重要性分别是7和10

2.DNG的提升

DSM： deep semantic matching

上面只是考虑的word level，但是因为query 和 doc 之间有语义的gap，要通过一些手段解决语义匹配问题，也更好的理解用户的意图。eg. 用户搜索“北京到济南火车票”那可以出现美团旅行，用户搜索“外卖app“，可以出现美团或者饿了么。

而且，一般torso query 和 tail query 都比top query要长，也就更大可能的包含更多的语义信息。

其实就是用了一个dssm，query和前10个的doc，都被当成vector放进去。其中likelihood function是：

雅虎在实际应用的时候做了如下的优化：

1.用一些简单的方法，去除了一些垃圾信息

2. 训练集中，最开始的doc是一个正例，其余的9个doc都是负例

3. doc不仅选取了title，还有domain

4. To reduce the vocabulary size, we use 3-letter shingling of words as proposed in DSSM 。（其实这个3个字母的切词粒度是一个通用的英文在DSSM的切词粒度，主要是考虑了向量空间和单词冲突）and normalized bag-of-words vectors of dimension 30,000 are used as input features 。

作用：

1.特征排名第8

2.DNG的提升

在top中DSM表现最好；在torso和tail中，cs表现最好。

cs表现好的在torso,tail。ttm表现好的在torso，dsm在3种类型的query总表现差不多。

现在 - Recall

Query Rewrite

从recall的层面解决query 和 doc 的语义gap问题。主要的框架如下图所示：

过去的一些方案是直接将original query 替换为 rewrite queries 去做召回的，但是这么做风险。

主要是original query 通过 translate model 得到 rewrite queries。

然后用这些query（包括original query 和 rewrite queries）与 top N 的 doc 分别算出相关性得分

某个doc的最终得分是max{这些query X 这个doc的得分}，根据最终得分将doc进行排序，选取top N，作为最终的输出

其中，这个translation model的框架是： (1) the learning phase that learns phrase-level translations from queries to documents; and (2) the decoding phase that generates candidates for a given query.

其中第一步是通过click graph学习phrases-level的translation，其中学习的是query X title。因为title中包含很多信息，也因为query一般都很短，title也相对比较短，比起doc的主体，title和query更加相似。

第二步是，在候选rewrite query中选取出最有可能的query，主要是通过了以下的公式进行的：

不仅应用了一些常用的feature function ： We here adopt widely used feature functions in traditional statistical machine translation systems including translation scores provided by the learning phase, language model of original queries, word penalty, phrase penalty and distortion ，还自己造了一些比较有用的feature function，也就是h():

Query feature functions:

h1-number of words in q, h2-number of stop words in q, h3-language model score of the query q, h4- query frequency of q, h5-average length of words in q;
Rewrite query feature functions:

h6-number of words in qc, h7-number of stop words in qc, h8-language model score of the query rewrite qc, h9-query frequencies of the query rewrite qc, h10- average length of words in qc and
Pair feature functions:

h11 -Jaccard similarity of URLs shared by q and qc in the query-URL graph, h12-difference between the frequencies of the original query q and the rewrite candidate qc, h13-word-level cosine similarity between q and qc, h14-difference between the number of words between q and qc, h15-number of common words in q and qc, h16-difference of language model scores between q and qc, h17-difference between the number of stop words between q and qc, h18-difference between the average length of words in q and qc.
其中加黑的h11, h12, h13是比较重要的特征

作用：

在实际应用中，为了解决性能问题，Yahoo用了cache，如果miss cache，就随行为进行降权

在2种比较典型的和转化率相关的方面：

Recency-sensitive ranking

Thinking：如果搜索query = “特朗普”，出来的结果完全是他过去作为商人的新闻，而没有他作为总统的新闻，和关门事件的新闻，那我们的搜索结果是不符合预期的。

目标：

使新的结果向上提，但是原有的旧的结果的分值不变。（We use a time-sensitive classifier to decide whether the component should be added, which prevents changing the score of non-recency queries and non-timesensitive documents ）

解决方案：