Clustering by Passing Messages Between Data Points
通过传递数据点之间的消息进行聚类
Brendan J. Frey* and Delbert Dueck
Abstract: Clustering data by identifying a subset of representative examples is important for processing sensory signals and detecting patterns in data. Such “exemplars” can be found by randomly choosing an initial subset of data points and then iteratively refining it, but this works well only if that initial choice is close to a good solution. We devised a method called “affinity propagation,” which takes as input measures of similarity between pairs of data points. Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. We used affinity propagation to cluster images of faces, detect genes in microarray data, identify representative sentences in this manuscript, and identify cities that are efficiently accessed by airline travel. Affinity propagation found clusters with much lower error than other methods, and it did so in less than one-hundredth the amount of time.
摘要:通过识别具有代表性的例子的子集来聚类数据对于处理感官信号和检测数据中的模式非常重要。 这样的“样本”可以通过随机选择数据点的初始子集,然后迭代地细化它来找到,但是初始值的选择对聚类能否很好的工作非常重要。 我们设计了一种称为“亲和传播”的方法,它以数据点对之间的相似性作为输入度量。 在数据点之间交换实值信息,直到一个高质量的样本集和相应的集群逐渐出现。 我们使用亲和传播对人脸图像进行聚类,检测微阵列数据中的基因,识别这份文章中的代表句子,并通过航空公司识别有效访问的城市。亲和矩阵通过相对于其他方法更低的错误率来找到聚类,而且在运行时间上来说,不到其他方法百分之一的时间。
Clustering data based on a measure of similarity is a critical step in scientific data analysis and in engineering systems. A common approach is to use data to learn a set of centers such that the sum of squared errors between data points and their nearest centers is small. When the centers are selected from actual data points, they are called “exemplars.” The popular k-centers clustering technique (1) begins with an initial set of randomly selected exemplars and iteratively refines this set so as to decrease the sum of squared errors. k-centers clustering is quite sensitive to the initial selection of exemplars, so it is usually rerun many times with different initializations in an attempt to find a good solution. However,this works well only when the number of clusters is small and chances are good that at least one random initialization is close to a good solution. We take a quite different approach and introduce a method that simultaneously considers all data points as potential exemplars. By viewing each data point as a node in a network, we devised a method that recursively transmits real-valued messages along edges of the network until a good set of exemplars and corresponding clusters emerges.As described later, messages are updated on the basis of simple formulas that search for minima of an appropriately chosen energy function. At any point in time, the magnitude of each message reflects the current affinity that one data point has for choosing another data point as its exemplar, so we call our method “affinity propagation.” Figure 1A illustrates how clusters gradually emerge during the message-passing procedure.
基于相似性度量的聚类数据是科学数据分析和工程系统中的关键步骤。一种常见的方法是使用数据学习来得到一组中心,使得数据点与其最近的中心之间的平方差之和很小。当中心从实际数据点中间选择时,这些数据点被称为“样本”。很受欢迎的k-中心聚类技术是从一系列样本中随机初始化选定,然后使用迭代优化的方式减少它们的平方差之和。k-中心聚类对初始样本的选择很敏感,所以通常需要尝试不同的初始值反复运行多次才能找到一个好的解决方案。我们采用一种完全不同的方法,介绍一种同时考虑所有数据点作为潜在样本的方法。通过将每个数据点看作是网络中的一个节点,我们设计一种沿着网络的边递归传递实值,信息的方法直到一系列好的样本和对应的聚类产生。如下文所述,在通过简单的公式选择接近适当能量函数的基础上寻找最小值完成信息的更新。在任何时间点, 每个消息的magnitude反映了一个数据点选择另一个数据点作为其样本的当前亲和力,图1描述了聚类是如何在信息传递的过程中逐渐产生的。
Affinity propagation takes as input a collection of real-valued similarities between data points, where the similarity s(i,k) indicates how well the data point with index k is suited to be the exemplar for data point i. When the goal is to minimize squared error, each similarity is set to a negative squared error (Euclidean distance): For points xi and xk, s(i,k) =−||xi − xk||2 . Indeed, the method described here can be applied when the optimization criterion is much more general. Later, we describe tasks where similarities are derived for pairs of images, pairs of microarray measurements, pairs of English sentences, and pairs of cities. When an exemplar-dependent probability model is available, s(i,k) can be set to the log-likelihood of data point i given that its exemplar is point k. Alternatively, when appropriate, similarities may be set by hand.
亲和力的传播将数据点的实值相似性的集合作为输入,其中相似度s(i,k)表示具有索引k的数据点非常适合作为数据点i的样本。当目标是最小化均方误差时,每个相似性被设置为负平方误差(欧氏距离):对于数据点xi和xk,
实际上这些描述方法可以在优化准则更一般的情况下适用。 稍后,我们将描述与图像对的导出相似性的任务, 一对微阵列测量,一对英语句子和一对城市。当一个样本相关概率模型可用时,s(i,k)可以设置为点i的对数似然,因为它的样本点是点k。或者,在适当的时候,可以手工确定相似之处。
Rather than requiring that the number of clusters be prespecified, affinity propagation takes as input a real number s(k,k) for each data point k so that data points with larger values of s(k,k) are more likely to be chosen as exemplars. These values are referred to as “preferences.” The number of identified exemplars (number of clusters) is influenced by the values of the input preferences, but also emerges from the message-passing procedure. If a priori, all data points are equally suitable as exemplars, the preferences should be set to a common value—this value can be varied to produce different numbers of clusters. The shared value could be the median of the input similarities (resulting in a moderate number of clusters) or their minimum (resulting in a small number of clusters).
亲和传播以每个数据点k的实数s(k,k)作为输入,因此当s(k,k)较大时的数据点更有可能被选择为样本,而不是要求预先指定聚类的数量。这些值被称为“首选项”。已识别的样本数(聚类数)受输入首选项值的影响,但在消息传递过程中也存在影响。 如果有先验条件,那么所有数据点都同样适合作为样本,则首选项应设置为公共值—— 这个值可以被改变以产生不同数量的聚类。 共享值可以是输入相似性的中位数(导致适度数量的聚类) 或者它们的最小值(导致少量的聚类)。
There are two kinds of message exchanged between data points, and each takes into account a different kind of competition. Messages can be combined at any stage to decide which points are exemplars and, for every other point, which exemplar it belongs to. The“responsibility” r(i,k), sent from data point i to candidate exemplar point k, reflects the accumulated evidence for how well-suited point k is to serve as the exemplar for point i, taking into account other potential exemplars for point i (Fig. 1B). The “availability” a(i,k), sent from candidate exemplar point k to point i,reflects the accumulated evidence for how appropriate it would be for point i to choose point k as its exemplar, taking into account the support from other points that point k should be an exemplar (Fig. 1C). r(i,k) and a(i,k) can be viewed as log-probability ratios. To begin with, the availabilities are initialized to zero:a(i,k) = 0. Then, the responsibilities are computed using the rule
In the first iteration, because the availabilities are zero, r(i,k) is set to the input similarity between point i and point k as its exemplar,minus the largest of the similarities between point i and other candidate exemplars. This competitive update is data-driven and does not take into account how many other points favor each candidate exemplar. In later iterations, when some points are effectively assigned to other exemplars, their availabilities will drop below zero as prescribed by the update rule below. These negative availabilities will decrease the effective values of some of the input similarities s(i,k′) in the above rule, removing the corresponding candidate exemplars from competition. For k = i, the responsibility r(k,k) is set to the input preference that point k be chosen as an exemplar, s(k,k), minus the largest of the similarities between point i and all other candidate exemplars. This “self-responsibility” reflects accumulated evidence that point k is an exemplar, based on its input preference tempered by how ill-suited it is to be assigned to another exemplar.
数据点之间的信息交换有两种,每种都考虑到不同类型的竞争。信息可以在任意阶段以考虑哪一个数据点作为样本的方式结合,对每一个数据点,可以考虑其属于哪个样本。The “responsibility” r(i,k), 从数据点i发送到候选样本点k,反映了累积的证据,证明了k点是多适合作为第一点的范例,考虑到数据点i 的其他潜在范例 (Fig. 1B)。The “availability” a(i,k), 从候选样本点k发送到点i,反映了累积的证据,证明i点选择k点作为其样本是多么合适。 考虑到其他点的支持,点k应该是一个范例(Fig. 1C)。
r(i,k)和a(i,k)可以看作对数概率比。首先,可用性被初始化为0,a(i,k) = 0. 然后,使用以下的方式计算responsibilities.
在第一次迭代中,因为可用的值为0,将r(i,k)设置为点i和点k之间的输入相似性作为其样本,减去点i和其他候选样本的最大相似性。这种竞争性更新是数据驱动的,而且不考虑有多少有利于每个候选样本的其他数据点。在之后的迭代过程中,当一些数据点被有效的分配到其他示例中, 根据下面的更新规则,它们的availabilities将降到0以下 . These negative availabilities将降低上述规则中一些输入相似性s(i,k’)的有效值,从竞争中移除相应的潜在样本。如果k = i ,the responsibility r(k,k) 被设置为输入偏好,点k被选择为样本,s(k,k),减去点i和其他所有潜在样本的最大相似性。This “self-responsibility” ,反映了积累的证据表明k点是一个样本,基于它输入偏好的调节——通过它是如何不适合被分配到另一个样本。
Whereas the above responsibility update lets all candidate exemplars compete for ownership of a data point, the following availability update gathers evidence from data points as to whether each candidate exemplar would make a good exemplar:
The availability a(i,k) is set to the self-responsibility r(k,k) plus the sum of the positive responsibilities candidate exemplar k receives from other points. Only the positive portions of incoming responsibilities are added, because it is only necessary for a good exemplar to explain some data points well (positive responsibilities),regardless of how poorly it explains other data points (negative responsibilities). If the self-responsibility r(k,k) is negative (indicating that point k is currently better suited as belonging to another exemplar rather than being an exemplar itself), the availability of point k as an exemplar can be increased if some other points have positive responsibilities for point k being their exemplar. To limit the influence of strong incoming positive responsibilities, the total sum is thresholded so that it cannot go above zero. The “self-availability” a(k,k) is updated differently:
This message reflects accumulated evidence that point k is an exemplar, based on the positive responsibilities sent to candidate exemplar k from other points.
而上述responsibility更新允许所有候选示例竞争数据点的所有权,以下availability更新从数据点收集证据,说明每个候选样本是否会成为一个好的样本:
The availability a(i,k) 设置为self-responsibility r(k,k)加上候选样本k从其他点接收的积极responsibilities之和。只增加了新的responsibilities的积极部分,因为只有一个好的范例才能很好地解释一些数据点(positive responsibilities),不管它从其他数据点解释得有多糟糕(negative responsibilities)。如果self-responsibility是负值(表明点k目前更适合属于另一个样本,而不是自身作为样本),为了限制如果其他一些点相对点k作为他们的样本有positive responsibilities,那么点k作为样本的可用性可以增加。为了限制强烈的incoming positive responsibilities的影响,总和是有一个阈值的,所以它不能超过0.The “self-availability” a(k,k)的更新方式不同:
这一信息反映了积累的证据,即k点是一个样本,基于从其他点发送给候选样k的positive responsibilities。
The above update rules require only simple,local computations that are easily implemented (2), and messages need only be exchanged between pairs of points with known similarities. At any point during affinity propagation, availabilities and responsibilities can be combined to identify exemplars. For point i, the value of k that maximizes a(i,k) + r(i,k) either identifies point i as an exemplar if k = i, or identifies the data point that is the exemplar for point i. The message-passing procedure may be terminated after a fixed number of iterations, after changes in the messages fall below a threshold, or after the local decisions stay constant for some number of iterations. When updating the messages,it is important that they be damped to avoid numerical oscillations that arise in some circumstances. Each message is set to l times its value from the previous iteration plus 1 – λ times its prescribed updated value, where the damping factor l is between 0 and 1. In all of our experiments (3), we used a default damping factor of λ = 0.5, and each iteration of affinity propagation consisted of (i) updating all responsibilities given the availabilities, (ii) updating all availabilities given the responsibilities, and (iii) combining availabilities and responsibilities to monitor the exemplar decisions and terminate the algorithm when these decisions did not change for 10 iterations.
上述更新规则只需要简单的局部计算,这些计算很容易在公式(2)中实现,并且消息只需要在具有已知相似性的点对之间进行交换。在affinity propagation过程中的任何时候,availabilities 和 responsibilities都可以结合起来来识别样本。对数据点i,最大化a(i,k) + r(i,k)的k值,如果k = i,就把i定义为样本点,或者定义数据点,这是i点的样本。消息传递过程可以在固定的迭代次数之后终止,在信息的更改低于阈值之后,或者在局部决策保持恒定的情况下进行一些迭代。当更新信息时,重要的是要对它们进行阻尼,以避免在某些情况下出现数值振荡。 每个消息被设置为λ倍其值从上一次迭代加上1-λ倍其规定的更新值,其中阻尼因子λ在0到1之间。在公式(3)对应的所有实验中,我们将阻尼因子设置为λ = 0.5,affinity propagation的每一次迭代包括(i)根据availabilities更新所有responsibilities,(ii)根据responsibilities更新所有responsibilities,(iii) 结合availabilities和responsibilities当这些决策在10次迭代中没有变化时,监视示例决策并终止算法。
Figure 1A shows the dynamics of affinity propagation applied to 25 two-dimensional data points (3), using negative squared error as the similarity. One advantage of affinity propagation is that the number of exemplars need not be specified beforehand. Instead, the appropriate number of exemplars emerges from the message passing method and depends on the input exemplar preferences. This enables automatic model selection, based on a prior specification of how preferable each point is as an exemplar. Figure 1D shows the effect of the value of the common input preference on the number of clusters. This relation is nearly identical to the relation found by exactly minimizing the squared error (2).
图1A显示了应用于25个二维数据点(3)的affinity propagation,以负平方误差作为相似性。affinity propagation的一个优点是样本数量不需要事先指定。相反,适当数量的样本出现在消息传递方法中,并取决于输入样本首选项。这使得基于对每个点作为样本的可取性的先验规范能够自动选择模型。 图1D显示了公共输入偏好值对聚类数的影响。这个关系与通过公式(2)中精确最小化平方误差找到的关系几乎相同。
We next studied the problem of clustering images of faces using the standard optimization criterion of squared error. We used both affinity propagation and k-centers clustering to identify exemplars among 900 grayscale images extracted from the Olivetti face database (3). Affinity propagation found exemplars with much lower squared error than the best of 100 runs of k-centers clustering (Fig. 2A), which took about the same amount of computer time.We asked whether a huge number of random restarts of k-centers clustering could achieve the same squared error. Figure 2B shows the error achieved by one run of affinity propagation and the distribution of errors achieved by 10,000 runs of k-centers clustering, plotted against the number of clusters. Affinity propagation uniformly achieved much lower error in more than two orders of magnitude less time. Another popular optimization criterion is the sum of absolute pixel differences (which better tolerates outlying pixel intensities), so we repeated the above procedure using this error measure. Affinity propagation again uniformly achieved lower error (Fig. 2C).
接下来,我们利用平方误差的标准优化准则研究了人脸图像聚类问题。我们用affinity propagation和k-中心聚类去识别从Olivetti face 数据集中提取出来的900张灰度图像中的样本。 在图2A中, 在花费了大约相同的计算时间下,affinity propagation找到样本的平方误差远低于100次k-中心聚类的最佳结果。我们提出疑问——大量的k中心聚类随机重新启动是否可以实现相同的平方误差。 图2B显示了通过一次affinity propagation所实现的误差,以及通过10,000次k点聚类所实现的误差分布,根据聚类的数量绘制。Affinity propagation在两个数量级以上的时间内均匀地获得了较低的误差。另一个流行的优化准则是像素差异的绝对值之和(它更好地容忍外围像素强度),因此,我们使用这个误差度量重复了上述过程。在图2C中亲和力传播再次一致地获得了较低的误差。
Many tasks require the identification of exemplars among sparsely related data, i.e., where most similarities are either unknown or large and negative. To examine affinity propagation in this context, we addressed the task of clustering putative exons to find genes, using the sparse similarity matrix derived from microarray data and reported in (4). In that work, 75,066 segments of DNA (60 bases long) corresponding to putative exons were mined from the genome of mouse chromosome 1. Their transcription levels were measured across 12 tissue samples, and the similarity between every pair of putative exons (data points) was computed. The measure of similarity between putative exons was based on their proximity in the genome and the degree of coordination of their transcription levels across the 12 tissues. To account for putative exons that are not exons (e.g., introns), we included an additional artificial exemplar and determined the similarity of each other data point to this “non-exon exemplar” using statistics taken over the entire data set. The resulting 75,067 × 75,067 similarity matrix (3) consisted of 99.73% similarities with values of −∞, corresponding to distant DNA segments that could not possibly be part of the same gene. We applied affinity propagation to this similarity matrix, but because messages need not be exchanged between
point i and k if s(i,k) = −∞, each iteration of affinity propagation required exchanging messages between only a tiny subset (0.27% or 15million) of data point pairs.
许多任务需要在稀疏相关数据集中识别样本, 其中大多数相似之处要么是未知的或者大的,要么是负面的。 在这种背景下检查亲和力传播, 我们讨论了将putative exons(假定的外显子)聚类以寻找基因的任务,利用微阵列数据导出的稀疏相似矩阵,并在(4)中进行了记录。 在这项工作中,从小鼠1号染色体的基因组中提取了75,066个DNA片段(60个碱基长)。在12个组织样本中测量了它们的转录水平,并计算了每一对假定外显子(数据点)之间的相似性。在12个组织样本中测量了它们的转录水平,并计算了每一对假定外显子(数据点)之间的相似性。推测外显子之间的相似性是基于它们在基因组中的接近程度和它们在12个组织中转录水平的协调程度。解释非外显子的假定外显子(例如内含子), 我们包括了一个额外的人工样本,并使用接管整个数据集的统计数据确定了彼此数据点与这个“non-exon样本”的相似性。得到的75,067×75,067相似矩阵(3)与−∞的值相似度为99.73,对应于不可能是同一基因一部分的遥远DNA片段。我们将亲和传播应用于这个相似矩阵, 但如果s(i,k)=−∞,信息不需要在点i和k之间交换,亲和传播的每一次迭代都需要在数据点对的一个很小的子集(0.27%或1500万)之间交换消息。
Figure 3A illustrates the identification of gene clusters and the assignment of some data points to the nonexon exemplar. The reconstruction errors for affinity propagation and k-centers clustering are compared in Fig. 3B.For each number of clusters, affinity propagation was run once and took 6 min, whereas k-centers clustering was run 10,000 times and took 208 hours. To address the question of how well these methods perform in detecting bona fide gene segments, Fig. 3C plots the truepositive (TP) rate against the false-positive (FP) rate, using the labels provided in the RefSeq database (5). Affinity propagation achieved significantly higher TP rates, especially at low FP rates, which are most important to biologists. At a FP rate of 3%, affinity propagation achieved a TP rate of 39%, whereas the best k-centers clustering result was 17%. For comparison, at the same FP rate, the best TP rate for hierarchical agglomerative clustering (2) was 19%, and the engineering tool described
in (4), which accounts for additional biological knowledge, achieved a TP rate of 43%.
图3A说明了基因聚类的识别以及将一些数据点分配给non-xon样本。在图3B中比较了亲和传播和k-中心聚类的重建误差。对于每个簇数,亲和传播运行一次,耗时6分钟,而k中心聚类运行10,000次,耗时208小时。为了解决这些方法在检测真正基因片段方面表现如何的问题,图3C使用RefSeq数据库(5)中提供的标签,绘制了真阳性(TP)率与假阳性(FP)率的关系图。 亲和力传播获得了显著较高的TP率,特别是在低FP率下,这对生物学家是最重要的。 在FP率为3%的情况下,亲和力传播达到了39%的TP率%, 而最佳的k中心聚类结果为17%。 作为比较,在相同的FP率下,层次聚类(2)的最佳TP率为19%, 在(4)中描述的工程工具,用到额外的生物知识,达到了43%的TP率%。
Affinity propagation’s ability to operate on the basis of nonstandard optimization criteria makes it suitable for exploratory data analysis using unusual measures of similarity. Unlike metricspace clustering techniques such as k-means clustering (1), affinity propagation can be applied to problems where the data do not lie in a continuous space. Indeed, it can be applied to problems where the similarities are not symmetric [i.e., s(i,k) ≠ s(k,i)] and to problems where the similarities do not satisfy the triangle inequality [i.e., s(i,k) < s(i,j) + s( j,k)]. To identify a small number of sentences in a draft of this manuscript that summarize other sentences, we treated each sentence as a “bag of words” (6) and computed the similarity of sentence i to sentence k based on the cost of encoding the words in sentence i using the words in sentence k. We found that 97% of the resulting similarities (2, 3) were not symmetric. The preferences were adjusted to identify (using l = 0.8) different numbers of representative exemplar sentences (2), and the solution with four sentences is shown in Fig. 4A.
亲和传播在非标准优化准则基础上的操作能力使得它适合于使用不寻常的相似性度量进行探索性数据分析。不像度量空间聚类技术,如k-均值聚类(1),亲和传播可以应用于数据不位于连续空间的问题。事实上,它可以应用于相似不对称的问题[i.e., s(i,k) ≠ s(k,i)] 以及相似点不满足三角形不等式的问题[i.e., s(i,k) < s(i,j) + s( j,k)]. 为了找出这份手稿中总结其他句子的少数句子,我们把每个句子当作“词袋”(6) 并根据使用句子k中的单词编码句子i中单词的成本计算了句子i与句子k的相似性。我们发现结果中97%的相似都不是对称的。调整偏好以识别(使用λ=0.8)不同数量的代表性例句(2),四个句子的解如图4A所示。
We also applied affinity propagation to explore the problem of identifying a restricted number of Canadian and American cities that are most easily accessible by large subsets of other cities, in terms of estimated commercial airline travel time. Each data point was a city, and the similarity s(i,k) was set to the negative time it takes to travel from city i to city k by airline, including estimated stopover delays (3). Due to headwinds, the transit time was in many cases different depending on the direction of travel, so that 36% of the similarities were asymmetric. Further, for 97% of city pairs i and k, there was a third city j such that the triangle inequality was violated, because the trip from i to k included a long stopover delay n city j so it took longer than the sum of the durations of the trips from i to j and j to k. When the number of “most accessible cities” was constrained to be seven (by adjusting the input preference appropriately), the cities shown in Fig. 4, B to E, were identified. It is interesting that several major cities were not selected, either because heavy international travel makes them inappropriate as easily accessible domestic destinations (e.g., New YorkCity, Los Angeles) or because their neighborhoods can be more efficiently accessed through other destinations (e.g., Atlanta, Philadelphia, and Minneapolis account for Chicago’s destinations, while avoiding potential airport delays).
我们还应用亲和传播来探索识别有限数量的加拿大和美国城市的问题,这些城市最容易被其他城市的大子集访问, 在估计的商业航空公司旅行时间方面。 每个数据点都是一个城市,相似度s(i,k)被设置为从城市i到城市k乘坐航空公司所需的(negative time)负时间, 包括估计的中途停留延误(3)。 由于逆风的原因,过境时间在许多情况下是不同的,取决于旅行的方向,因此,36%的相似性是不对称的。 此外,对于97%的城市对i和k,存在第三个城市j,从而违反了三角形不等式,因为从 i 到 k 的旅行包括在城市 j 的长途中途停留延迟, 所以它花的时间总和比从 i 到 j 和 j 到 k 的旅行持续时间的总和更长。当“最容易到达的城市”的数量被限制为7个(通过适当调整输入偏好)时,如图4中B到E所示的城市。有趣的是,几个主要的城市没有被选中,要么是因为沉重的国际旅行使他们不适合作为容易到达的国内目的地(例如纽约市、洛杉矶),要么是因为他们的邻近城市可以通过其他目的地(例如亚特兰大、费城, 明尼阿波利斯占据了芝加哥的目的地,同时避免了潜在的机场延误)更有效的进入。
Affinity propagation can be viewed as a method that searches for minima of an energy function (7) that depends on a set of N hidden labels, c1,…,cN, corresponding to the N data points. Each label indicates the exemplar to which the point belongs, so that s(i,ci) is the similarity of data point i to its exemplar. ci = i is a special case indicating that point i is itself an exemplar, so that s(i,ci) is the input preference for point i. Not all configurations of the labels are valid; a configuration c is valid when for every point i, if some other point i′ has chosen i as its exemplar (i.e., ci′ = i), then i must be an exemplar (i.e., ci = i). The energy of a valid configuration is E© = −∑i=1 N s(i,ci). Exactly minimizing the energy is computationally intractable, because a special case of this minimization problem is the NP-hard k-median problem (8). However, the update rules for affinity propagation correspond to fixed-point recursions for minimizing a Bethe free-energy (9) approximation. Affinity propagation is most easily derived as an instance of the max-sum algorithm in a factor graph (10) describing the constraintson the labels and the energy function (2).
亲和力传播可以看作是一种搜索能量函数的最小值的方法,它依赖于一组N个隐藏标签,分别对应N个数据点, 每个标签都表示该点所属的样本, 因此,s(i,ci)是数据点i与其样本的相似性。是一个特例,表明数据点i是它自身的一个样本。并非所有标签的配置都是有效的,一个配置c是有效的,只有当对于每个点i都有——如果对于某些点选择i点作为它的样本(i.e., ci′ = i),那么i必须是一个样本(i.e., ci = i)。有效配置的能量方程为:E( c ) = −∑i=1 N s(i,ci).。 精确地最小化能量在计算上是难以解决的,因为这个最小化问题的特例是NP难的k中值问题。然而,亲和传播的更新规则对应于用最小化(a Bethe free-energy)自由能量近似的固定点递归。 亲和传播最容易导出为因子图中最大和算法的一个实例,描述标签和能量函数上的约束。
In some degenerate cases, the energy function may have multiple minima with corresponding multiple fixed points of the update rules, and these may prevent convergence. For example, if s(1,2) = s(2,1) and s(1,1) = s(2,2), then the solutions c1 = c2 = 1 and c1 = c2 = 2 both achieve the same energy. In this case, affinity propagation may oscillate, with both data points alternating between being exemplars and nonexemplars. In practice, we found that oscillations could always be avoided by adding a tiny amount of noise to the similarities to prevent degenerate situations,or by increasing the damping factor.
在一些退化的情况下,能量函数可能具有多个极小值,并具有相应的更新规则的多个固定点,这些情况可能会阻止收敛。例如,如果s(1,2) = s(2,1)并且s(1,1) = s(2,2),那么方案c1 = c2 = 1和c1 = c2 = 2都满足相同的能量函数。 在这种情况下,亲和力传播可能振荡,两个数据点在样本和非样本之间交替。在实践中,我们发现,通过在相似点上添加少量噪声来防止退化情况,或者通过增加阻尼因子来避免振荡。
Affinity propagation has several advantages over related techniques. Methods such as k-centers clustering (1), k-means clustering (1), and the expectation maximization (EM) algorithm (11) store a relatively small set of estimated cluster centers at each step. These techniques are improved upon by methods that begin with a large number of clusters and then prune them (12), but they still rely on random sampling and make hard pruning decisions that cannot be recovered from. In contrast, by simultaneously considering all data points as candidate centers and gradually identifying clusters, affinity propagation is able to avoid many of the poor solutions caused by unlucky initializations and hard decisions. Markov chain Monte Carlo techniques(13) randomly search for good solutions, but do
not share affinity propagation’s advantage of considering many possible solutions all at once.
亲和力传播比相关技术有几个优点。比如k-中心聚类,k-均值聚类,EM算法都需要在每个步骤存储一组相对较小的估计聚类中心。这些技术是通过从大量聚类开始,然后修剪它们来改进的,但它们仍然依赖于随机抽样,并做出无法恢复的艰难修剪决策。 通过同时考虑所有数据点作为候选中心并逐步识别聚类,与之相反, 亲和力传播能够避免许多由不幸的初始化和艰难的决策引起的糟糕的解。马尔可夫链蒙特卡洛技术(13)随机解决方案, 但没有分享亲和传播的优点,即一次考虑许多可能的解决方案。
Hierarchical agglomerative clustering (14)and spectral clustering (15) solve the quite different problem of recursively comparing pairs of points to find partitions of the data. These techniques do not require that all points within a cluster be similar to a single center and are thus not well-suited to many tasks. In particular, two points that should not be in the same cluster may be grouped together by an unfortunate sequence of pairwise groupings.
层次聚类(14)和谱聚类(15)解决了递归比较点对以找到数据分区的完全不同的问题。 这些技术不要求聚类中的所有点都与单个中心相似,因此不能很好地适应许多任务。 特别是,原本不应在同一组中的两个点,却被不幸的成对分组序列组合在一起。
In (8), it was shown that the related metric k-median problem could be relaxed to form a linear program with a constant factor approximation. There, the input was assumed to be metric,i.e., nonnegative, symmetric, and satisfying the triangle inequality. In contrast, affinity propagation can take as input general nonmetric similarities.Affinity propagation also provides a conceptually new approach that works well in practice. Whereas the linear programming relaxation is hard to solve and sophisticated software packages need to be applied (e.g., CPLEX), affinity propagation makes use of intuitive message updates that can be implemented in a few lines of code (2).
在(8)中,表明可以放宽k-中值相关度量问题,形成具有常量因子近似的线性程序。在那里,假设输入是度量的,即非负、对称和满足三角形不等式。相反,亲和传播可以作为输入的一般非度量相似性。亲和力传播也提供了一种新的概念方法,在实践中很好地工作。 虽然线性规划松弛很难解决,需要应用复杂的软件包,亲和传播利用直观的消息更新,可以在几行代码中实现。
Affinity propagation is related in spirit to techniques recently used to obtain record-breaking results in quite different disciplines (16). The approach of recursively propagating messages (17) in a “loopy graph” has been used to approach Shannon’s limit in error-correcting decoding (18, 19), solve random satisfiability problems with an order-of-magnitude increase in size (20), solve instances of the NP-hard twodimensional phase-unwrapping problem (21), and efficiently estimate depth from pairs of stereo images (22). Yet, to our knowledge, affinity propagation is the first method to make use of this idea to solve the age-old, fundamental problem of clustering data. Because of its simplicity, general applicability, and performance, we believe affinity propagation will prove to be of broad value inscience and engineering.
在“循环图”中递归传播消息的方法已经被用来接近Shannon在纠错解码中的极限, 随着尺寸的数量级增加可以用来解决随机可满足性问题,解决二维phase-unwrapping的NP难实例问题, 并有效地从成对立体图像中估计深度。 然而,据我们所知,亲和传播是第一种利用这一思想来解决数据聚类这一古老而基本的问题的方法。 由于其简单性、广泛适用性和性能,我们相信亲和传播将在科学和工程中具有广泛的价值。