【机器学习】笔记之聚类Cluster —— 聚类在信息检索中的应用

最新推荐文章于 2022-11-11 14:55:52 发布

㊒㊖

最新推荐文章于 2022-11-11 14:55:52 发布

阅读量2k

点赞数

分类专栏：机器学习笔记文章标签：机器学习特征工程聚类(Cluster) Information Retrieval

机器学习笔记专栏收录该内容

1 篇文章 0 订阅

订阅专栏

本文转载翻译自斯坦福大学出版的 Introduction to Information Retrieval

博主仍在学习当中，在翻译的过程中，加入了一些个人的理解，欢迎大家积极讨论！

集群在信息查询中的一些应用（转载）：

首先我们先了解一下聚类假说(Cluster hypothesis)。聚类假设说明了我们在信息检索中使用聚类时所做的基本假设，即同一群集中的文档在信息需求的相关性方面表现相似（我的理解是，一个cluster中的文档所表达的信息很相近）。该假设指出，如果来自群集的文档与搜索请求相关，则来自同一群集的其他文档可能也是相关的。这是因为群集将共享许多术语的文档放在一起。

Table 1: Some applications of clustering in information retrieval
Application	What is clustered	Benefit	Example
Search result clustering	search results	more effective information presentation to user	Figure 1
Scatter-Gather	(subsets of) collection	alternative user interface:"search withou typing"	Figure 2
Collection clustering	collection	effective information presentation for exploratory browsing
Language modeling	collection	increased precision and/or recall
Cluster-based retrieval	collection	higher efficiency: faster search

表1显示了聚类在信息检索中的一些主要应用。它们在它们聚类的文档集中有所不同 - 搜索结果，集合的集合或子集 - 以及他们试图改进的信息检索系统的方面 - 用户体验，用户界面，搜索系统的有效性或效率。但它们都是基于聚类假设所述的基本假设（相同集群，文档信息相似）。

表1中提到的第一个应用程序是搜索结果聚类，其中搜索结果是指为响应查询而返回的文档。信息检索中搜索结果的默认表示是一个简单的列表。用户从上到下扫描列表，直到找到他们要查找的信息。相反，搜索结果聚类会对搜索结果进行聚类，以便类似文档一起显示。扫描几个连贯的组通常比许多单个文档更容易。如果搜索词具有不同的词义，则此功能特别有用。 Figure 1中的例子是美洲虎（Jaguar）。网上有三种常见的感觉指的是汽车，动物和Apple操作系统。 Vivísimo搜索引擎（http://vivisimo.com）返回的“聚集结果”面板可以是一个更有效的用户界面，用于理解搜索结果中的内容而不是简单的文档列表。

\includegraphics[width=15cm]{clust01.eps} — Figure 1. 前四个搜索结果中没有关于Jaguar作为美洲虎词义的搜索结果，但是可以通过左侧的clustered results来选择集群

表1中的第二个应用程序是Scatter-Gather，它的目标是获取更好的用户界面。根据用户选择或聚集的文档组，Scatter-Gather对整个集合进行聚类，以获取用户所选择文档组。合并选定的组，并再次对结果集进行聚类。重复该过程直到找到感兴趣的簇。一个例子如Figure 2所示。

\includegraphics[width=13cm]{clust02.eps} — Figure 2. 图中最上层为纽约时报的故事中，有八个类别/集群（即为Scatter-Gather中的Scatter）。用户人工的从这八个类别中选择三个，通过“Gather”这三个类别，我们等到了一个更小的集合——International Stroies。然后继续“Scatter”操作，即从International Stories集合的8个类别中选取两个得到Smaller International Stories。重复该过程直到找到具有相关文档的小集群（e.g., W.Africa）。

自动生成的集群（如Figure 2中的集群）不像人工构建的分层树那样整齐有序。此外，自动查找群集的描述性标签是一个难题。但通过cluster-based navigation可以取代关键字搜索（标准信息检索范例）。在用户更喜欢浏览搜索的情况下尤其如此，因为他们不确定要使用哪些搜索项。

作为Scatter-Gather中用户调停的迭代聚类的替代方法，我们还可以计算不受用户交互影响的集合的静态层次聚类（Table 1中的“Collection Cluster”）。谷歌新闻及其前身哥伦比亚新闻布拉斯特系统就是这种方法的例子。对于新闻，我们需要经常重新计算群集，以确保用户可以访问最新的故事。群集非常适合访问新闻故事集，因为新闻阅读不是真正的搜索，而是选择有关最近事件的故事子集的过程。

The fourth application of clustering exploits the cluster hypothesis directly for improving search results, based on a clustering of the entire collection. We use a standard inverted index to identify an initial set of documents that match the query, but we then add other documents from the same clusters even if they have low similarity to the query. For example, if the query is car and several car documents are taken from a cluster of automobile documents, then we can add documents from this cluster that use terms other than car (automobile, vehicle etc). This can increase recall since a group of documents with high mutual similarity is often relevant as a whole.

More recently this idea has been used for language modeling. Equation 102 , page 102 , showed that to avoid sparse data problems in the language modeling approach to IR, the model of document can be interpolated with a collection model. But the collection contains many documents with terms untypical of . By replacing the collection model with a model derived from 's cluster, we get more accurate estimates of the occurrence probabilities of terms in .

Clustering can also speed up search. As we saw in Section 6.3.2 ( page 6.3.2 ) search in the vector space model amounts to finding the nearest neighbors to the query. The inverted index supports fast nearest-neighbor search for the standard IR setting. However, sometimes we may not be able to use an inverted index efficiently, e.g., in latent semantic indexing (Chapter 18 ). In such cases, we could compute the similarity of the query to every document, but this is slow. The cluster hypothesis offers an alternative: Find the clusters that are closest to the query and only consider documents from these clusters. Within this much smaller set, we can compute similarities exhaustively and rank documents in the usual way. Since there are many fewer clusters than documents, finding the closest cluster is fast; and since the documents matching a query are all similar to each other, they tend to be in the same clusters. While this algorithm is inexact, the expected decrease in search quality is small. This is essentially the application of clustering that was covered in Section 7.1.6 (page 7.1.6 ).