[翻译][Paper][WWW'10]Classiﬁcation-Enhanced Ranking (2)

最新推荐文章于 2024-05-02 10:44:35 发布

hillbird

最新推荐文章于 2024-05-02 10:44:35 发布

阅读量939

点赞数

分类专栏： machine learning search engine 文章标签： query class url features signal components

machine learning 同时被 2 个专栏收录

2 篇文章 0 订阅

订阅专栏

search engine

2 篇文章 0 订阅

订阅专栏

1. INTRODUCTION
Many have speculated that classifying web pages can improve a search engine's ranking of results relevant to a query[12, 23, 26, 13]. Intuitively results should be more relevant when they match the class of a query as well as less relevant when they do not. While much research has focused on attempting to improve query and document classi cation under the assumption that relevance gains would follow, we focus on why and how classi cation can directly improve relevance. In particular, we present a simple framework for classi cation-enhanced ranking that uses clicks in combination with the classi cation of web pages to derive a class distribution for the query. At the heart of our approach is the assumption that clicks on a speci c URL should not only serve as a relevance signal for that URL but can also be used as a relevance signal for other URLs that are related to the same class.

This situation is depicted in Figure 1a, where we observe a user issuing the query \historical deaths" and clicking on the second result. The gray boxes depict the automatically predicted classes for each result. Note these classes are not actually displayed to the user. We interpret a click on the second result to provide evidence that the user desires results from the \Society/History" class in general - including both the clicked URL as well as other URLs belonging to the same class. That is to say, using the query logs we implicitly derive the class of this query is \Society/History". Rather than create a heuristic to balance evidence for the class of the URL versus other features, we introduce this evidence within a machine learning framework and let the model learn the weight to balance these competing features. Using click information solely on a per URL basis would only improve the ranking by moving the clicked result higher. However, in Figure 1b, we see that by using class information to boost results that share the same class as the clicked result, we see improved relevance of results in positions 1-3 (with the clicked result in position 1).

Although previous research [12] has demonstrated categorical interfaces can signi cantly decrease the time a user spends nding a search result, these interfaces are fraught with their own challenges that has prevented widespread adoption (e.g. precision of the classi ers, labels of the classes, ordering of the classes). As a consequence research in this area has often focused on improving these components or the underlying precision of the prediction mechanisms, assuming categories will be directly displayed to the user. In this paper, we do not propose to show the categories to users but rather to use them to improve ranking. By using the classes of clicked results we can boost those results as well as other results that share the same classes. To achieve this, we de ne a variety of features that capture the match between the class distributions of a web page and a query, the ambiguity of a query, and the coverage of a retrieved result relative to a query's set of classes. Experimental results demonstrate that a ranker learned with these features signi cantly improves ranking over a competitive baseline. In an empirical study, we demonstrate signi cant improvements over a large set of web queries. In a further breakdown of those web queries, the statistical gains continue to hold even on tail and long queries where click data is sparse. Furthermore, our methodology is agnostic with respect to the classi cation space, and therefore is general enough to be used as a mechanism to derive query classes for a variety of alternative taxonomies.

In the remainder of the paper, we fi rst give a brief overview of related work, placing this work in the broader context of query classi cation research. Then, we describe our approach in detail. After which, we present the results of an empirical study which highlight the improvements achievable via this method and conduct a small study of feature importance to indicate promising avenues of further study.Finally, we discuss future work and conclude.

简介：

许多研究表明网页分类可以提高搜索引擎排序结果与query的相关性[12,23,26,13]。当搜索结果匹配query的类型时我们很直观的认为该结果更为相关。尽管许多研究尝试query和文档分类在已满足此分类可知的假设下相关性将随之提高，但我们将会致力于分类为什么和怎样地去提升相关性。在特殊情况下，我们提出一种简单的网页分类框架用于加强分类式排序。此框架使用点击分类网页，从而获得一条query的类型分布。此方法的核心是假设点击一个特殊的URL不仅能代表它是相关的，并且也可以作为其他URL属于同一类型的相关性信号。

由Figure 1a所示，此图描绘了一个用户正在使用query“historical deaths”查询，并试图点击第二条搜索结果。灰色的方框描绘这条query检索下的每条结果被智能地预测的分类。注意这些分类并不会呈现给用户。一般而言，这次对第二条搜索结果的点击可以印证用户想从“Society/History”分类获取结果——包括此被点击的URL和其他属于同一分类的URL。也就是说，使用query日志我们推测出该query是属于”Society/History”这个分类的。与其创建一个启发式算法去平衡点击特征与其他特称，我们将介绍使用一种机器学习框架，让数学模型学习如何平衡这些特征的权重。单独使用点击信息对URL分类将只会提高点击次数比较多的结果的排名。然而，如Figure 1b所示，使用分类信息去提升同一分类的被点击结果，我们看到在1-3的结果相关性提升了（随着在位置1的结果被点击了）。

尽管前期研究已经证明了绝对的分类作用可以显著降低用户用户花费在寻找搜索结果的时间，但这些分类方法仍会因为自身的一些挑战而难于推行（比如分类器的精度，分类的标签，分类的顺序）。因此在这一领域的研究往往集中在解决这些问题或者是提高预测机制的基本精度，前提是假定类目将会直接呈现给用户。在这篇论文中，我们不会去假设展现分类结果给用户而是用来提高排序效果。使用点击排序结果的分类我们可以强化这些结果的排序也可以提高相同分类下的结果。。

要做到这一点，我们定义多种特征去获取query匹配下的网页类目分布，以及query的二义性，与query的类目分布相关的检索结果的覆盖率。实验结果证明排序器通过学习这些特征可以显著提升排序效果超越了原本的竞争基线。根据这些经验成果，我们将证明在大部分网页查询集合上有显著提高，这些基于概率的增益也会对点击数据稀疏的长尾query起作用。另外，我们的研究方法对分类空间是不可知的，并且是足够通用的可以作为一种机制泛化到遵循多种不同的分类方法的query分类应用中。

接下来我们首先给出相关工作的概述，我们会用更多的篇章来概括query分类研究。然后将详细的描述我们的方法。之后我们给出研究经验的结果，通过这一方法它们会有明显的提升，并产生一个特征重要性的小研究，会指引更有希望的新研究。最后，我们将讨论将来的工作以及作出总结。