信息检索导论第十七章笔记(英文)

Hierarchical clustering

Abstract

The algorithms introduced in Chapter 16 return a flat unstructured set of clusters, require a prespecified number of clusters as input and are nondeterministic.

Hierarchical clustering outputs a hierarchy, a structure that is more informative than the unstructured set of clusters returned by flat clustering.

The most common hierarchical clustering algorithms have a complexity that is at least quadratic in the number of documents compared to the linear complexity of K-means and EM.

The differences in applications of flat and hierarchical clustering in IR.
we select flat clustering when efficiency is important and hierarchical clustering when one of the potential problems of flat clustering is a concern.

Hierarchical agglomerative clustering

HAC is Bottom-up hierarchical clustering, which is frequently used in IR.

  • HAC is visualized as a dendrogram

在这里插入图片描述

Merge operation is monotonic, which means that if s1, s2, . . . , sK−1 are the combination similarities of the successive merges of an HAC, then s1 ≥ s2 ≥ . . . ≥ sK−1 holds.

  • Cutting point
  1. Cut at a prespecified level of similarity.
  2. Cut the dendrogram where the gap between two successive combination similarities is largest
  3. apply equation

在这里插入图片描述

where K’ refers to the cut of the hierarchy that results in K’ clusters

  • pseodo code

在这里插入图片描述

  • Four merge criteria of these four variants of HAC

在这里插入图片描述

Single-link and complete-link clustering

This single-link merge criterion is local. We pay attention solely to the area where the two clusters come closest to each other. Other, more distant parts of the cluster and the clusters’ overall structure are not taken into account.

This complete-link merge criterion is nonlocal; the entire structure of the clustering can influence merge decisions. This results in a preference for compact clusters with small diameters over long, straggly clusters, but also causes sensitivity to outliers.

  • Comparison

在这里插入图片描述

Left: The single-link similarity of the two upper two-point clusters is the similarity of d2 and d3 (solid line), which is greater than the single-link similarity of the two left two-point clusters (dashed line).

Right: The complete-link similarity of the two upper two-point clusters is the similarity of d1 and d4 (dashed line), which is smaller than the complete-link similarity of the two left two-point clusters (solid line).

Compared with the above dendrogram, there is also a dendrogram according to the complete-link clustering:

在这里插入图片描述

Note:
graph-theoretic interpretations motivate the terms single-link and complete-link clustering. Single-link clusters at step k are maximal sets of points that are linked via at least one link (a single link) of similarity s ≥ sk ; complete-link clusters at step k are maximal sets of points that are completely linked with each other via links of similarity s ≥ sk.

Single-link clustering can produce straggling clusters:

ss

which is called chaining.

Complete-link clustering pays too much attention to outliers:

在这里插入图片描述

sometimes, it cannot find most intuitive cluster structure.

Time complexity

naive HAC: Θ(N^3)

a more efficient algorithm is the priority-queue algorithm, which has a time complexity of Θ(N^2logN)

在这里插入图片描述

  • four different similarity measures

在这里插入图片描述

  • single link optimization

next-best-merge array (NBM)

在这里插入图片描述

Group-average agglomerative clustering

GAAC evaluates cluster quality based on all similarities between documents, which is alse called group-average clustering and average-link clustering.

GAAC computes the average similarity sim-ga of all pairs of documents, including pairs from the same cluster. But self-similarities are not included in the average:

在这里插入图片描述

  • Compute SIM-GA efficiently

在这里插入图片描述

so we have:

在这里插入图片描述

Note:
GAAC Requires:

  1. documents represented as vectors
  2. length normalization of vectors, so that self-similarities are 1.0
  3. the dot product for computing the similarity between vectors and sums of vectors.

Centroid clustering

In centroid clustering, the similarity of two clusters is defined as the similarity of their centroids:

在这里插入图片描述

the difference between GAAC and centroid clustering is that GAAC considers all pairs of documents in computing average pairwise similarity

  • Example

Three iterations of centroid clustering. Each iterationmerges the two clusterswhose centroids are closest.

在这里插入图片描述

Note:
centroid clustering is not monotonic.
在这里插入图片描述

the intersection is circled.

  • Similarity measure

It is conceptually simpler than the average of all pairwise similarities in GAAC.

Optimality of hierarchical agglomerative clustering

define COME-SIM as the smallest combination similarity:

在这里插入图片描述

Centroid clustering is not optimal because inversions can occur.

  • single-link
    The combination similarity of a cluster ω is the smallest similarity of any bipartition of the cluster, where the similarity of a bipartition is the largest similarity between any two documents from the two parts:

在这里插入图片描述

  • complete link
    The combination similarity of a cluster ω is the smallest similarity of any two points in ω:

在这里插入图片描述

  • GAAC:
    The combination similarity of a cluster ω is the average of all pairwise similarities in ω

在这里插入图片描述

Note:
If we use these definitions of combination similarity, then optimality is a property of a set of clusters and not of a process that produces a set of clusters.

  • Comparison of HAC algorithm

在这里插入图片描述

Recommend GAAC for document clustering because it is generally the method that produces the clustering with the best properties for applications. It does not suffer from chaining, from sensitivity to outliers and from inversions.

  • first story detection or novelty detection
  1. One approach to this task is to find a tight cluster within the documents that were sent across the wire in a short period of time and are dissimilar from all previous documents.

For example, the documents sent over the wire in the minutes after the World Trade Center attack on September 11, 2001, form such a cluster. Variations of single-link clustering can do well on this task since it is the structure of small parts of the vector space – and not global structure – that is important in this case.

  1. an approach to duplicate detection on the web

Cluster labeling

Differential cluster labeling selects cluster labels by comparing the distribution of terms in one cluster with that of other clusters.

three labeling method to a K-means:
在这里插入图片描述

Cluster-internal labeling computes a label that solely depends on the cluster itself, not on other clusters. Labeling a cluster with the title of the document closest to the centroid is one cluster-internal method.

Implementation notes

Most problems that require the computation of a large number of dot products benefit from an inverted index. This is also the case for HAC clustering.

In order to solve this problem,

  1. truncating centroids (keeping only highly weighted terms)
  2. representing clusters by means of sparse medoids instead of dense centroids.

These optimizations can also be applied to GAAC and centroid clustering.

We can employ an HAC algorithm to compute seeds of high quality. If the HAC algorithm is applied to a document subset of size √N, then the overall runtime of K-means cum
HAC seed generation is Θ(N). An appropriate adjustment can be made for an Θ(N^2 log N) algorithm to guarantee linearity, which is called Buckshot algorithm.

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值