[Search Engines笔记] 16: Ranked retrieval: Feature-based models

最新推荐文章于 2021-08-13 09:26:27 发布

cos2cot

最新推荐文章于 2021-08-13 09:26:27 发布

阅读量491

点赞数

分类专栏： Search Engine 笔记文章标签： search engine

本文链接：https://blog.csdn.net/cos2cot/article/details/78838679

版权

Search Engine 笔记专栏收录该内容

5 篇文章 0 订阅

订阅专栏

参考文档：

Jamie的课件：http://boston.lti.cs.cmu.edu/classes/11-642/

阿衡的SE笔记：http://www.shuang0420.com/categories/NLP/Search-Engines/

为啥要Learning to Rank：

我们已经学习了很多的检索方法：

Retrieval Models：Vector Space，BM25，language models…
Representations: Title, body, url, inlink…
Query templates: Sequential dependency models…
Query-independent evidence: PR, url depth….

这些不同方法的evidence可以combine起来提高SE的accuracy 。因为潜在的combinations很多，而且有很多paraneters需要调整，所以人工combine这些方法是不切实际的。于是Learning to Rank 应运而生～

Main idea:

Learn a model that combines many types of evidence (each type as a feature).

类似ML的其他问题：

给training data，从feature vectors->desired scores，学一个model Y=f(X; w)
对一个new data x, 带入model得到y=f(x; w)。

Introduction：

LeToR是个supervised learning。熟悉的任务是classification（难，因为分错了就全完了）和regression（简单，因为和target的差值是loss，就算很大也不算全错）。ranking task是要 find the best ranking （order）of given documents，但一般问题被转化为finding ranking scores。

大型SE中，检索通过一系列的检索模型完成：

exact match boolean：form a set of docs。为了快，而且刚开始文档质量参差不齐，差一点的算法就能搞定了；
best-match retrieval：rank the set，选出一部分。
L2R：reranking，选出一部分。因为此时文档少，而且质量差不多了，需要用复杂的方法去区分他们的差别。

LeToR在大型SE中被用在高层（reranking少量文档）以平衡efficient和effective。

L2R主要包括三个dimensions：

document representation — features
type of training data
ML算法

L2R Framework：

S1: 对于每个query，对docs做feature extraction，把doc d表示为

S2: 用training data学一个model

这个model给每篇docs一个对于qry的分数。

Document representation — features：

VSM
coordinationMatch：number of query terms matching doc d
BM25 for either doc or url, body, inlink, title, keywords…..
Indri
PageRank
Spam
URLDepth
Wiki score
Avg word length
…..

Training data 种类：

binary assessments 相关、不相关 or 二值
document scores 实数score或者几个level
preferences （di > dj）
rankings （di > dj > dk > dm > …..）

L2R Approaches:

Pointwise
- Training data是一个document的class或者score
- Accurate score不等于accurate ranking
- position information ignored
Pairwise
- Training data是一个preference among一对文档
- Accurate preference不等于accurate ranking
- position information ignored
Listwise
- Training data是一个docs的ranking
- 直接optimize ranking metrics很难

相同点：

都用一个trained model h去estimate the score of x（doc的feature vector）；

都用h算出来的分排序。

不同点：

不同的training方法；

不同的training data。

Pointwise是最弱的。

优点：简单
缺点：focus on 错误的任务，影响了effectiveness

Pairwise和listwise差不多useful：

Pairwise的学习目标不完美，但是容易实现
- 它minimize misordering errors，但想要的是best ranking
- simpler learning problem，有theoretical guarantees

-------------

- 结果中rel和non-rel的文档数差不多的qry占领了training data里的大多数位置；
- 对noisy label更敏感。它怎么合并结果的？
Listwise的学习目标很棒，但是很难实现
- perfect learning target
- harder learning problem

Pointwise：

Training a model using individual documents。

Doc的score：可以是二值的，也可以是relevance levels，也可以是实数？

Score learning 方法假设training data中每个doc的score都是desired scores。Scores可以是任意的，只要保证order正确就好～（一点都不会影响？）

Pairwise：

通过data里文章的分数／等级高低，生成di>dj的pairs作为training data。

目标： minimize number of misclassified doc pairs

用所有的pair算分嘛？

Listwise：

生成Trainin data：按照relevance score排序。 training data里也有metrics吗？

挑战：很多metrics是不连续的或者not convex，很难做optimization

解决方案：

找其他容易optimize的metrics，比如likelihood of best ranking
optimize an approximation of the metric
bound the objective function ？？？

ListMLE：看纸质课件！！

大思路：

建立一个ranking的概率P(x1, x2, x3, ...)
找到一个model h(x)使得training data里best rankings的概率最大
用h(x)的分对新data进行排序

如何建立概率P(x1, x2, x3, …)？

直接建立不实际，因为space太大！

所以ListMLE借助了independence的假设，缩小了space

假设独立的话是不是就不能直接用在diversification？是！

cos2cot

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
[Search Engines笔记] 16: Ranked retrieval: Feature-based models

参考文档：Jamie的课件：http://boston.lti.cs.cmu.edu/classes/11-642/阿衡的SE笔记：http://www.shuang0420.com/categories/NLP/Search-Engines/为啥要Learning to Rank：我们已经学习了很多的检索方法： Retrieval Models：Vec
复制链接

扫一扫