【论文翻译】Clustering by Passing Messages Between Data Points

论文题目:Clustering by Passing Messages Between Data Points
论文来源:Clustering by Passing Messages Between Data Points

翻译人:BDML@CQUT实验室

Clustering by Passing Messages Between Data Points

Brendan J. Frey* and Delbert Dueck

Abstract

Clustering data by identifying a subset of representative examples is important for processing sensory signals and detecting patterns in data. Such “exemplars” can be found by randomly choosing an initial subset of data points and then iteratively refining it, but this works well only if that initial choice is close to a good solution. We devised a method called “affinity propagation,” which takes as input measures of similarity between pairs of data points. Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. We used affinity propagation to cluster images of faces, detect genes in microarray data, identify representative sentences in this manuscript, and identify cities that are efficiently accessed by airline travel. Affinity propagation found clusters with much lower error than other methods, and it did so in less than one-hundredth the amount of time.

摘要

在处理感知信号和检测数据中的模式中,通过识别具有代表性的例子进行数据聚类是十分重要的,这样的“exemplars”可以通过随机选择初始化数据点集合,然后迭代精炼该集合,这种方法只有在随机选择接近于一个良好的解时是很有效果的。我们设计了一种称为“affinity propagation”的方法,它将数据点对之间的相似性作为输入度量。实值消息在数据点之间交换,直到逐渐出现一组高质量的样本和相应的簇。我们使用仿射传播对人脸图像进行聚类,检测微阵列数据中的基因,识别手稿中的代表性句子,以及识别航空旅行可有效访问的城市。仿射传播发现聚类的错误比其他方法要低得多,而且只花了不到百分之一的时间。

正文

Clustering data based on a measure of similarity is a critical step in scientific data analysis and in engineering systems. A common approach is to use data to learn a set of centers such that the sum of squared errors between data points and their nearest centers is small. When the centers are selected from actual data points, they are called“exemplars.” The popular k-centers clustering technique begins with an initial set of randomly selected exemplars and iteratively refines this set so as to decrease the sum of squared errors. k-centers clustering is quite sensitive to the initial selection of exemplars, so it is usually rerun many times with different initializations in an attempt to find a good solution. However, this works well only when the number of clusters is small and chances are good that at least one random initialization is close to a good solution. We take a quite different approach and introduce a method that simultaneously considers all data points as potential exemplars. By viewing each data point as a node in a network, we devised a method that recursively transmits real-valued messages along edges of the network until a good set of exemplars and corresponding clusters emerges. As described later, messages are updated on the basis of simple formulas that search for minima of an appropriately chosen energy function. At any point in time, the magnitude of each message reflects the current affinity that one data point has for choosing another data point as its exemplar, so we call our method“affinity propagation.” Figure 1A illustrates how clusters gradually emerge during the message-passing procedure.

基于相似度度量的聚类是科学数据分析和工程系统中的关键步骤。一种常见的方法是使用数据来学习一组中心,以便数据点与其最近中心之间的平方误差之和很小。当从实际数据点中选择中心时,这些选择的数据中心实例点叫做exemplars。流行的k-centers聚类技术就是从一组初始的随机选择的样本开始,迭代地细化这组样本,以减少平方和误差。k-centers聚类对exemplars的初始选择非常敏感,因此通常会在不同的初始化情况下多次重新计算,试图找到一个最优的解。但是,只有当聚类的规模小,并且至少有一个随机初始化接近一个良好的解时,这种方法才能很好地工作。我们采用了一种完全不同的方法,并引入了一种同时考虑所有数据点作为潜在exemplars的方法。将每个数据点视为网络中的一个节点,我们设计了一种方法,该方法沿着网络的边缘递归地传递实值信息,直到出现一组良好的exemplars和相应的簇。正如后面所述,消息是在搜索适当选择的能量函数的极小值的简单公式的基础上更新的。在任何时间点,每个消息的大小反映了一个数据点在选择另一个数据点作为exemplars时的当前affinity,因此我们将我们的方法称为“仿射传播”。图1A说明了在消息传递过程中,簇是如何逐步产生的。

Fig. 1. How affinity propagation works.
(A) Affinity propagation is illustrated for two-dimensional data points, where negative Euclidean distance (squared error) was used to measure similarity. Each point is colored according to the current evidence that it is a cluster center (exemplar). The darkness of the arrow directed from point i to point k corresponds to the strength of the transmitted message that point i belongs to exemplar point k. (B) “Responsibilities” r(i,k) are sent from data points to candidate exemplars and indicate how strongly each data point favors the candidate exemplar over other candidate exemplars. © “Availabilities” a(i,k) are sent from candidate exemplars to data points and indicate to what degree each candidate exemplar is available as a cluster center for the data point. (D) The effect of the value of the input preference (common for all data points) on the number of identified exemplars (number of clusters) is shown. The value that was used in (A) is also shown, which was computed from the median of the pairwise similarities.

图1. AP的工作原理。
(A)AP以二维数据点为例,其中负的欧几里得距离(平方差)用于衡量相似度。每个点根据它是一个聚集中心(exemplars)着色。黑色箭头指示的是点i到点k与点i属于exemplar点k的传播消息的强弱一致。(B)“responsibility” r (i (k)发送数据点到候选exemplar,指明每一数据点对候选exemplar比其他候选exemplar支持强度。(C)“availability” a(i,k)从候选exemplar发送到数据点,并指出每个候选exemplar作为数据点的聚类中心可用的程度。(D)展示输入参考度(所有数据点共有)对已识别exemplar数目(簇数)的影响。(A)中使用的值也展示出来,它是从两两相似度的中值计算出来的。

Affinity propagation takes as input a collection of real-valued similarities between data points, where the similarity s(i,k) indicates how well the data point with index k is suited to be the exemplar for data point i. When the goal is to minimize squared error, each similarity is set to a negative squared error (Euclidean distance): For points x i x_i xi and x k x_k xk, s ( i , k ) = − ∣ ∣ x i − x k ∣ ∣ 2 s(i,k)=-||x_i-x_k||^2 s(i,k)=xixk2. Indeed, the method described here can be applied when the optimization criterion is much more general. Later, we describe tasks where similarities are derived for pairs of images, pairs of microarray measurements, pairs of English sentences, and pairs of cities. When an exemplar-dependent probability model is available, s(i,k) can be set to the log-likelihood of data point i given that its exemplar is point k. Alternatively, when appropriate, similarities may be set by hand.

AP将数据点之间实值的相似度作为输入,相似度s(i ,k)表示多大程度上k索引的数据点适合作为数据点i的exemplar。当目标是最小化平方误差时,每一个相似度设定为负的平方误差(欧几里得距离):对于点 x i x_i xi x k x_k xk, s ( i , k ) = − ∣ ∣ x i − x k ∣ ∣ 2 s(i,k)=-||x_i-x_k||^2 s(i,k)=xixk2。确实,这里描述的方法可以应用于优化准则更加一般的情况下。随后,我们描述了从成对的图像、成对的微阵列测量、成对的英语句子和成对的城市中获得相似度的任务。当exemplar依赖的概率模型可以得到时,s(i,k)可以设定为给定数据点i的exemplar为k时,数据点i的对数似然函数。或者,在适当的时候,可以手动设置相似度。

Rather than requiring that the number of clusters be prespecified, affinity propagation takes as input a real number s(k,k) for each data point k so that data points with larger values of s(k,k) are more likely to be chosen as exemplars. These values are referred to as “preferences.” The number of identified exemplars (number of clusters) is influenced by the values of the input preferences, but also emerges from the message-passing procedure. If a priori, all data points are equally suitable as exemplars, the preferences should be set to a common value— this value can be varied to produce different numbers of clusters. The shared value could be the median of the input similarities (resulting in a moderate number of clusters) or their minimum (resulting in a small number of clusters).

AP将实数s(k,k)作为每个数据点k的输入,并不需要预先指定簇的数量。这样s(k,k)值越大的数据点更有可能被选择为exemplar。这些值称为“preferences(参考度)”。识别exemplars的数量(簇的数量)受到输入参考度的影响,从消息传递过程中产生。如果有一个先验,所有数据点都同样适合作为exemplars,那么参考度应该设置为一个共同的值——这个值的改变可以产生不同数量的簇。这个值可以是输入相似度的中值(产生适中的簇数目),也可以是它们的最小值(产生小数量的簇数目)。

There are two kinds of message exchanged between data points, and each takes into account a different kind of competition. Messages can be combined at any stage to decide which points are exemplars and, for every other point, which exemplar it belongs to. The“responsibility” r(i,k), sent from data point i to candidate exemplar point k, reflects the accumulated evidence for how well-suited point k is to serve as the exemplar for point i, taking into account other potential exemplars for point i (Fig. 1B). The “availability” a(i,k), sent from candidate exemplar point k to point i, reflects the accumulated evidence for how appropriate it would be for point i to choose point k as its exemplar, taking into account the support from other points that point k should be an exemplar (Fig. 1C). r(i,k) and a(i,k) can be viewed as log-probability ratios. To begin with, the availabilities are initialized to zero: a(i,k) = 0. Then, the responsibilities are computed using the rule r ( i , k ) ← s ( i , k ) − m a x k ′ s . t . k ′ ≠ k { a ( i , k ′ ) + s ( i , k ′ ) } r(i,k)\leftarrow s(i,k)-max_{k's.t.k'\ne k} \{a(i,k')+s(i,k')\} r(i,k)s(i,k)maxks.t.k=k{a(i,k)+s(i,k)} In the first iteration, because the availabilities are zero, r(i,k) is set to the input similarity between point i and point k as its exemplar, minus the largest of the similarities between point i and other candidate exemplars. This competitive update is data-driven and does not take into account how many other points favor each candidate exemplar. In later iterations, when some points are effectively assigned to other exemplars, their availabilities will drop below zero as prescribed by the update rule below. These negative availabilities will decrease the effective values of some of the input similarities s(i,k′) in the above rule, removing the corresponding candidate exemplars from competition. For k = i, the responsibility r(k,k) is set to the input preference that point k be chosen as an exemplar, s(k,k), minus the largest of the similarities between point i and all other candidate exemplars. This “self-responsibility” reflects accumulated evidence that point k is an exemplar, based on its input preference tempered by how ill-suited it is to be assigned to another exemplar.

数据点之间交换的消息有两种,每一种都考虑到不同类型的竞争。消息可以在任何阶段进行组合,以决定哪些点是exemplars,对于其他的每个点,它属于哪个exemplars。“responsibility”r(i,k),从点i发送到候选exemplars点k,考虑到i点的其他潜在exemplars (图1B),反映了k作为点i的exemplar的合适程度。“availability”a(i,k),从候选exemplars点k发送到点i,考虑来自其他点对k应该成为一个exemplar的支持(图1C),反映的是点i选择点k作为它的exemplars的合适程度。r(i,k)和a(i,k)可以看成是对数概率比。首先,a(i,k)被初始化为0,即a(i,k) = 0。然后,在第一次迭代中使用下面的规则计算responsibility, r ( i , k ) ← s ( i , k ) − m a x k ′ s . t . k ′ ≠ k { a ( i , k ′ ) + s ( i , k ′ ) } r(i,k)\leftarrow s(i,k)-max_{k's.t.k'\ne k} \{a(i,k')+s(i,k')\} r(i,k)s(i,k)maxks.t.k=k{a(i,k)+s(i,k)}.由于availability为0,r(i,k)被设为k作为i的exemplar的输入相似度减去i与其他候选exemplar的最大相似度。这个竞争性的更新是数据驱动的,并且没有考虑每个候选exemplars有多少其他的点。在以后的迭代中,当一些点被有效地分配给其他exemplars时,它们的availability将会降到0以下,这是下面的更新规则所规定的。这些负的availability会降低上述规则中某些输入相似点s(i,k’)的有效值,使相应的候选exemplars从竞争中消失。对于k = i,responsibility r(k,k)被设为输入参考度s(k,k)减去i与其他候选exemplar的最大相似度。这种“self-responsibility”反映了k点是一个exemplar的累积证据,基于通过将它分配其他exemplar的不合适程度来输入参考度。

Whereas the above responsibility update lets all candidate exemplars compete for ownership of a data point, the following availability update gathers evidence from data points as to whether each candidate exemplar would make a good exemplar: a ( i , k ) ← m i n { 0 , r ( k , k ) + ∑ i ′ s . t . i ′ ∉ i , k m a x { 0 , r ( i ′ , k ) } } a(i,k)\leftarrow min\{0,r(k,k)+\sum_{i's.t.i'\notin{i,k}}max\{0,r(i',k)\}\} a(i,k)min{0,r(k,k)+is.t.i/i,kmax{0,r(i,k)}}. The availability a(i,k) is set to the self-responsibility r(k,k) plus the sum of the positive responsibilities candidate exemplar k receives from other points. Only the positive portions of incoming responsibilities are added, because it is only necessary for a good exemplar to explain some data points well (positive responsibilities), regardless of how poorly it explains other data points (negative responsibilities). If the self-responsibility r(k,k) is negative (indicating that point k is currently better suited as belonging to another exemplar rather than being an exemplar itself), the availability of point k as an exemplar can be increased if some other points have positive responsibilities for point k being their exemplar. To limit the influence of strong incoming positive responsibilities, the total sum is thresholded so that it cannot go above zero. The “self-availability” a(k,k) is updated differently: a ( k , k ) ← ∑ i ′ s . t . i ′ ≠ k m a x { 0 , r ( i ′ , k ) } a(k,k)\leftarrow \sum_{i's.t.i'\ne k}max\{0,r(i',k)\} a(k,k)is.t.i=kmax{0,r(i,k)}. This message reflects accumulated evidence that point k is an exemplar, based on the positive responsibilities sent to candidate exemplar k from other points.

上面responsibility更新让所有候选exemplars为数据点的所有权而竞争,下面availability更新从数据点收集了是否每个候选exemplars都能成为一个好exemplars的证据: a ( i , k ) ← m i n { 0 , r ( k , k ) + ∑ i ′ s . t . i ′ ∉ i , k m a x { 0 , r ( i ′ , k ) } } a(i,k)\leftarrow min\{0,r(k,k)+\sum_{i's.t.i'\notin{i,k}}max\{0,r(i',k)\}\} a(i,k)min{0,r(k,k)+is.t.i/i,kmax{0,r(i,k)}}. availability a(i,k)设为self-responsibility r(k,k)加上候选exemplar点k从其他数据点获得的正的responsibility的总和。只有进来的responsibility的正的部分被加,因为只需要一个好的exemplar就能很好地解释一些数据点(positive responsibilities),而不管它对于解释其他数据点(negative responsibilities)有多么不充分。如果self-responsibility r (k, k)是负的(表明点k是目前适合属于另一个exemplar,而不是本身作为一个exemplar), 。在一些数据点选择k为exemplar的时候获得正的responsibility时,k作为一个exemplar的availability会增加。为了限制positive responsibilities的强大影响,总和被设定了阈值,使其不可能超过零。“self-availability”a(k,k)的更新规则不同: a ( k , k ) ← ∑ i ′ s . t . i ′ ≠ k m a x { 0 , r ( i ′ , k ) } a(k,k)\leftarrow \sum_{i's.t.i'\ne k}max\{0,r(i',k)\} a(k,k)is.t.i=kmax{0,r(i,k)}. 基于其他数据点传送给候选exemplar点k的正的responsibility,反映了k作为一个exemplar的累积证据。

The above update rules require only simple, local computations that are easily implemented (2), and messages need only be exchanged between pairs of points with known similarities. At any point during affinity propagation, availabilities and responsibilities can be combined to identify exemplars. For point i, the value of k that maximizes a(i,k) + r(i,k) either identifies point i as an exemplar if k = i, or identifies the data point that is the exemplar for point i. The message-passing procedure may be terminated after a fixed number of iterations, after changes in the messages fall below a threshold, or after the local decisions stay constant for some number of iterations. When updating the messages, it is important that they be damped to avoid numerical oscillations that arise in some circumstances. Each message is set to λ times its value from the previous iteration plus 1 – λ times its prescribed updated value, where the damping factor λ is between 0 and 1. In all of our experiments (3), we used a default damping factor of λ = 0.5, and each iteration of affinity propagation consisted of (i) updating all responsibilities given the availabilities, (ii) updating all availabilities given the responsibilities, and (iii) combining availabilities and responsibilities to monitor the exemplar decisions and terminate the algorithm when these decisions did not change for 10 iterations.

上面的更新规则只需要简单的局部计算,并且容易实现 (2),消息只在已知相似度的成对结点之间交换。在AP过程的任何时候,每一个点使用availability和responsibility组合起来以识别exemplars。对于点i,无论k=i是否成立,使得a(i,k) + r(i,k)取得最大值的那个k就是点i的exemplars。消息传递过程终止于固定的迭代次数,或者消息的变化低于一个阈值,或局部决定在一定迭代次数中为常量。在更新消息时,重要的是使它们衰减,以避免在某些情况下出现的数值振荡。每条消息被设置为先前迭代里它的值的λ倍加上(1 – λ)倍指定的更新值,其中阻尼系数λ在0到1之间。我们所有的实验中(3),我们使用一个默认的阻尼系数λ = 0.5,并且每次AP迭代包括(i)更新所有被给了availability 的responsibility (ii)更新所有被给了responsibility的availability (iii)结合availability和responsibility监视exemplars决定和当这些决定在10次迭代中不在改变时,终止算法。

Figure 1A shows the dynamics of affinity propagation applied to 25 two-dimensional data points (3), using negative squared error as the similarity. One advantage of affinity propagation is that the number of exemplars need not be specified beforehand. Instead, the appropriate number of exemplars emerges from the message-passing method and depends on the input exemplar preferences. This enables automatic model selection, based on a prior specification of how preferable each point is as an exemplar. Figure 1D shows the effect of the value of the common input preference on the number of clusters. This relation is nearly identical to the relation found by exactly minimizing the squared error (2).

图1A展示了AP应用于25个二维数据点(3)的运动过程,使用负平方差作为相似度。AP的一个优点是不需要预先指定exemplar的数量。相反,合适的exemplar的数量从消息传递方法中产生,并取决于输入exemplar的参考度。这能够自动化模型选择是基于前面的对每个点如何作为exemplar合适的描述。图1D展示了输入共同的参考度对簇的数量的影响。这个关系与精确地最小化平方差(2)所发现的关系几乎相同。

Fig. 2. Clustering faces.
Exemplars minimizing the standard squared error measure of similarity were identified from 900 normalized face images (3). For a common preference of −600, affinity propagation found 62 clusters, and the average squared error was 108. For comparison, the best of 100 runs of k-centers clustering with different random initializations achieved a worse average squared error of 119. (A) The 15 images with highest squared error under either affinity propagation or k-centers clustering are shown in the top row. The middle and bottom rows show the exemplars assigned by the two methods, and the boxes show which of the two methods performed better for that image, in terms of squared error. Affinity propagation found higher-quality exemplars. (B) The average squared error achieved by a single run of affinity propagation and 10,000 runs of k-centers clustering, versus the number of clusters. The colored bands show different percentiles of squared error, and the number of exemplars corresponding to the result from (A) is indicated. © The above procedure was repeated using the sum of absolute errors as the measure of similarity, which is also a popular optimization criterion.

图2.聚类人脸。
从900个标准化的人脸图像中识别出最小化相似度的标准平方差的exemplars(3)。对于-600的参考度,AP找到62个簇,平均平方差为108。为进行比较,使用不同随机初始化的100个k-centers聚类的最佳运行获得了较差的平均平方差119。(A)顶行显示在AP或k-centers聚类下具有最高平方差的15张图像。 中间和底部两行显示了这两种方法分配的exemplars,而方框则显示了根据平方差,这两种方法中哪一种对图像效果更好。AP发现了更高质量的exemplars。(B)通过单次AP和10,000次k-centers聚类所获得的平均平方差与聚类数的关系。彩色带显示了不同百分比的平方差,并表示与(A)结果相对应的exemplars数目。(C)使用绝对误差之和作为相似度的度量重复上述过程,这也是一种常用的优化准则。

We next studied the problem of clustering images of faces using the standard optimization criterion of squared error. We used both affinity propagation and k-centers clustering to identify exemplars among 900 grayscale images extracted from the Olivetti face database (3). Affinity propagation found exemplars with much lower squared error than the best of 100 runs of k-centers clustering (Fig. 2A), which took about the same amount of computer time. We asked whether a huge number of random restarts of k-centers clustering could achieve the same squared error. Figure 2B shows the error achieved by one run of affinity propagation and the distribution of errors achieved by 10,000 runs of k-centers clustering, plotted against the number of clusters. Affinity propagation uniformly achieved much lower error in more than two orders of magnitude less time. Another popular optimization criterion is the sum of absolute pixel differences (which better tolerates outlying pixel intensities), so we repeated the above procedure using this error measure. Affinity propagation again uniformly achieved lower error (Fig. 2C).

然后我们研究了利用平方差的标准优化准则聚类人脸图像的问题。我们使用AP和k-centers聚类识别exemplars在900灰度图像提取奥利维蒂面部数据库(3)。发现AP算法exemplars的平方误差远低于k-centers聚类100次运行的最佳结果(图2),花了计算机相同的时间。我们期望大量的随机重新启动k-centers聚类是否能达到相同的平方误差。图2B展示了运行一次AP所获得的错误,以及10000次k-centers聚类所获得的错误分布,并对比了簇的数量。AP在花费更少时间方面超过两个数量级均获得了更低的误差。另一个常用的优化标准是绝对像素差的和(它能更好地容忍外围像素强度),所以我们使用这个误差测度重复了上述过程。AP再次均实现了更低的误差(图2C)。

Many tasks require the identification of exemplars among sparsely related data, i.e., where most similarities are either unknown or large and negative. To examine affinity propagation in this context, we addressed the task of clustering putative exons to find genes, using the sparse similarity matrix derived from microarray data and reported in (4). In that work, 75,066 segments of DNA (60 bases long) corresponding to putative exons were mined from the genome of mouse chromosome 1. Their transcription levels were measured across 12 tissue samples, and the similarity between every pair of putative exons (data points) was computed. The measure of similarity between putative exons was based on their proximity in the genome and the degree of coordination of their transcription levels across the 12 tissues. To account for putative exons that are not exons (e.g., introns), we included an additional artificial exemplar and determined the similarity of each other data point to this “non-exon exemplar” using statistics taken over the entire data set. The resulting 75,067 × 75,067 similarity matrix (3) consisted of 99.73% similarities with values of −∞, corresponding to distant DNA segments that could not possibly be part of the same gene. We applied affinity propagation to this similarity matrix, but because messages need not be exchanged between point i and k if s(i,k) = −∞, each iteration of affinity propagation required exchanging messages between only a tiny subset (0.27% or 15 million) of data point pairs.

许多任务需要在稀疏相关的数据中识别exemplars,例如:大多数相似度要么是未知的,要么是特别大的负数。为了检验在这种背景下的AP,我们提出了聚类假定的外显子发现基因的任务,使用来源于微阵列数据和报告(4)的稀疏相似矩阵。在那项任务中,75066段DNA(60基长)与从老鼠的基因组染色体1被开采的假定的外显子相一致。在12个组织样本中测量它们的转录水平,并计算每一对假定的外显子之间的相似度(数据点)。外显子之间相似度的衡量是基于它们在基因组中的接近度以及它们在12个组织样本中的转录水平的协调程度。为了说明假定的外显子不是外显子(如内含子), 我们加入了一个额外的人工exemplar,并使用整个数据集上的统计数据确定了每个数据点与此“ non-exon exemplars”的相似度。由此得到的75067×75067相似度矩阵(3)由99.73%相似度和负无穷组成,对应于不可能是名义基因的遥远的DNA片段。我们将AP应用到这个相似度矩阵中,但是由于s(i,k) =−∞时,点i和k之间不需要交换消息,因此AP每次迭代只需要在很小的子集(0.27%或1500万)的数据点对之间交换消息。

Fig. 3. Detecting genes.
Affinity propagation was used to detect putative exons (data points) comprising genes from mouse chromosome 1. Here, squared error is not appropriate as a measure of similarity, but instead similarity values were derived from a cost function measuring proximity of the putative exons in the genome and coexpression of the putative exons across 12 tissue samples (3). (A) A small portion of the data and the emergence of clusters during each iteration of affinity propagation are shown. In each picture, the 100 boxes outlined in black correspond to 100 data points (from a total of 75,066 putative exons), and the 12 colored blocks in each box indicate the transcription levels of the corresponding DNA segment in 12 tissue samples. The box on the far left corresponds to an artificial data point with infinite preference that is used to account for nonexon regions (e.g., introns). Lines connecting data points indicate potential assignments, where gray lines indicate assignments that currently have weak evidence and solid lines indicate assignments that currently have strong evidence. (B) Performance on minimizing the reconstruction error of genes, for different numbers of detected clusters. For each number of clusters, affinity propagation took 6 min, whereas 10,000 runs of k-centers clustering took 208 hours on the same computer. In each case, affinity propagation achieved a significantly lower reconstruction error than k-centers clustering. © A plot of true-positive rate versus false-positive rate for detecting exons [using labels from RefSeq (5)] shows that affinity propagation also performs better at detecting biologically verified exons than k-centers clustering.

图3.检测基因。
AP用于检测包含来自小鼠1号染色体基因的假定外显子(数据点)。在这里,平方误差不适合作为相似度性的度量,而相似度的值是从测量基因组中假定外显子的接近度和假定的外显子在12个组织样本中的共表达(3)的成本函数。(A)在AP的每次迭代中显示了一小部分数据和簇的出现。在每张图片中,100个黑色轮廓的方框对应于100个数据点(来自总共75,066个假定的外显子),每个框中的12个彩色块表示12个组织样本中相应DNA片段的转录水平。最左侧的框对应于具有无限参考度的人工数据点,该点用于解释非外显子区域(例如内含子)。连接数据点的线表示潜在的分配,其中灰线表示当前证据不足的分配,实线表示当前证据较强的分配。(B)对于不同数目的检测到的簇,使基因的重建误差最小化。对于每个数量的簇,AP花费6分钟,而在同一台计算机上进行10,000次k-centers聚类花费了208小时。在每种情况下,AP实现的重建误差均远低于k-centers聚类。 (C)检测外显子的真阳性率与假阳性率的关系图(使用来自RefSeq(5)的标签)显示,AP在检测经过生物学验证的外显子方面也比k-centers聚类更好。

Figure 3A illustrates the identification of gene clusters and the assignment of some data points to the nonexon exemplar. The reconstruction errors for affinity propagation and k-centers clustering are compared in Fig. 3B. For each number of clusters, affinity propagation was run once and took 6 min, whereas k-centers clustering was run 10,000 times and took 208 hours. To address the question of how well these methods perform in detecting bona fide gene segments, Fig. 3C plots the true-positive (TP) rate against the false-positive (FP) rate, using the labels provided in the RefSeq database (5). Affinity propagation achieved significantly higher TP rates, especially at low FP rates, which are most important to biologists. At a FP rate of 3%, affinity propagation achieved a TP rate of 39%, whereas the best k-centers clustering result was 17%. For comparison, at the same FP rate, the best TP rate for hierarchical agglomerative clustering (2) was 19%, and the engineering tool described in (4), which accounts for additional biological knowledge, achieved a TP rate of 43%.

图3A说明了基因簇的识别和一些数据点对非外显子exemplar的分配。图3B比较了AP和k-centers聚类的重建误差。对于每个数量的聚类,AP运行一次,耗时6分钟,而k-centers聚类运行10,000次,耗时208小时。为了解决这些方法在检测真正的基因片段表现如何的问题,图3 C使用RefSeq数据库提供的标签(5)绘制了真阳性(TP)率对假阳性(FP)率。AP取得了更高的TP率,特别是在低FP率下,这对于生物学家而言最重要。在FP率为3%时,AP的TP率为39%,而k-centers的最佳聚类结果为17%。相比之下,在相同的FP率下,层次聚类(2)的最佳TP率为19%,而(4)中所述的工程工具(具有附加的生物学知识)的TP率为43%。

Fig. 4. Identifying key sentences and air-travel routing.
Affinity propagation can be used to explore the identification of exemplars on the basis of nonstandard optimization criteria. (A) Similarities between pairs of sentences in a draft of this manuscript were constructed by matching words. Four exemplar sentences were identified by affinity propagation and are shown. (B) Affinity propagation was applied to similarities derived from air-travel efficiency (measured by estimated travel time) between the 456 busiest commercial airports in Canada and the United States—the travel times for both direct flights (shown in blue) and indirect flights (not shown), including the mean transfer time of up to a maximum of one stopover, were used as negative similarities (3). © Seven exemplars identified by affinity propagation are color-coded, and the assignments of other cities to these exemplars is shown. Cities located quite near to exemplar cities may be members of other more distant exemplars due to the lack of direct flights between them (e.g. Atlantic City is 100 km from Philadelphia, but is closer in flight time to Atlanta). (D) The inset shows that the Canada-USA border roughly divides the Toronto and Philadelphia clusters, due to a larger availability of domestic flights compared to international flights. However, this is not the case on the west coast as shown in (E), because extraordinarily frequent airline service between Vancouver and Seattle connects Canadian cities in the northwest to Seattle.

图4.识别关键句子和航线。
AP可用于探索基于非标准优化准则exemplar的识别。(A)在这篇手稿的草稿中,句子对之间的相似度是通过匹配单词来构建的。通过AP识别并显示了四个exemplar句子。(B)将AP应用于加拿大和美国456个最繁忙的商业机场之间的航空旅行效率(通过估计的旅行时间衡量)得出的相似度——直飞(蓝色显示)和间接飞行的旅行时间 (未显示)(包括最多一个中途停留的平均转换时间)被用作负相似度(3)。(C)通过AP识别的七个exemplars进行了颜色编码,并显示了其他城市对这些exemplars的分配。 由于城市之间缺乏直飞航班,因此距离exemplars很近的城市可能是其他较远的exemplars的成员(例如,大西洋城距离费城100公里,但到亚特兰大的飞行时间更近)。(D)插图显示,由于与国际航班相比,国内航班更多,因此加拿大-美国边界将大致划分为多伦多和费城的簇。但是,(E)中所示的西海岸情况并非如此,因为温哥华和西雅图之间非常频繁的航空服务将加拿大西北部的城市和西雅图连接起来。

Affinity propagation’s ability to operate on the basis of nonstandard optimization criteria makes it suitable for exploratory data analysis using unusual measures of similarity. Unlike metric-space clustering techniques such as k-means clustering (1), affinity propagation can be applied to problems where the data do not lie in a continuous space. Indeed, it can be applied to problems where the similarities are not symmetric [i.e., s(i,k) ≠ s(k,i)] and to problems where the similarities do not satisfy the triangle inequality [i.e., s(i,k) < s(i, j) + s( j,k)]. To identify a small number of sentences in a draft of this manuscript that summarize other sentences, we treated each sentence as a “bag of words” (6) and computed the similarity of sentence i to sentence k based on the cost of encoding the words in sentence i using the words in sentence k. We found that 97% of the resulting similarities (2, 3) were not symmetric. The preferences were adjusted to identify (using λ = 0.8) different numbers of representative exemplar sentences (2), and the solution with four sentences is shown in Fig. 4A.

AP能够在非标准优化准则进行操作,因此非常适合使用异常相似度度量进行探索性数据分析。与度量空间聚类技术(例如k-means聚类(1))不同,AP可以应用于数据不在连续空间中的问题。 实际上,它可以应用于相似度不对称的问题[即s(i,k)≠s(k,i)],以及相似度不满足三角不等式的问题[即s(i, k)<s(i,j)+ s(j,k)]。为了在该手稿的草稿中识别出概括其他句子的少量句子,我们将每个句子视为“单词袋”(6),并根据在句子i中,使用句子k中的单词,对单词进行编码的成本来计算句子i与句子k的相似度。我们发现97%的结果相似度(2、3)不是对称的。 调整参考度以识别(使用λ= 0.8)不同数量的代表性exemplar语句(2),图4A显示了具有四个语句的解决方案。

We also applied affinity propagation to explore the problem of identifying a restricted number of Canadian and American cities that are most easily accessible by large subsets of other cities, in terms of estimated commercial airline travel time. Each data point was a city, and the similarity s(i,k) was set to the negative time it takes to travel from city i to city k by airline, including estimated stopover delays (3). Due to headwinds, the transit time was in many cases different depending on the direction of travel, so that 36% of the similarities were asymmetric. Further, for 97% of city pairs i and k, there was a third city j such that the triangle inequality was violated, because the trip from i to k included a long stopover delay in city j so it took longer than the sum of the durations of the trips from i to j and j to k. When the number of “most accessible cities” was constrained to be seven (by adjusting the input preference appropriately), the cities shown in Fig. 4, B to E, were identified. It is interesting that several major cities were not selected, either because heavy international travel makes them inappropriate as easily accessible domestic destinations (e.g., New York City, Los Angeles) or because their neighborhoods can be more efficiently accessed through other destinations (e.g., Atlanta, Philadelphia, and Minneapolis account for Chicago’s destinations, while avoiding potential airport delays).

我们还应用了AP来探讨根据估计的商业航空公司旅行时间来确定数量有限的加拿大和美国城市的问题,这些城市最容易被其他城市的大部分子集访问。 每个数据点都是一个城市,相似度s(i,k)设置为乘飞机从城市i到城市k所花费的负时间,包括估计的中途停留时间(3)。 由于逆风,在许多情况下,航程时间会根据行进方向而有所不同,因此36%的相似度是不对称的。 此外,对于97%的城市对i和k,存在第三个城市j,从而违反了三角不等式,因为从i到k的行程在城市j中包含了较长的中途停留延迟,因此花费的时间比从i到j和j到k的航行持续时间总和更长。 当“最易到达的城市”的数量被限制为七个时(通过适当地调整输入参考度),确定了图4中的城市B到E。有趣的是,几个主要城市没有被选中,这是因为大量的国际旅行使它们不适合作为容易到达的国内目的地(如纽约,洛杉矶),或者因为可以通过其他目的地更有效地到达其附近一带,(如亚特兰大、费城和明尼阿波利斯是芝加哥的目的地,同时避免了潜在的机场延误。

Affinity propagation can be viewed as a method that searches for minima of an energy function (7) that depends on a set of N hidden labels, c 1 c_1 c1,…, c N c_N cN, corresponding to the N data points. Each label indicates the exemplar to which the point belongs, so that s(i, c i c_i ci) is the similarity of data point i to its exemplar. c i c_i ci = i is a special case indicating that point i is itself an exemplar, so that s(i, c i c_i ci) is the input preference for point i. Not all configurations of the labels are valid; a configuration c is valid when for every point i, if some other point i′ has chosen i as its exemplar (i.e., c i ′ c_i' ci = i), then i must be an exemplar (i.e., c i c_i ci = i). The energy of a valid configuration is E ( c ) = − ∑ i = 1 N s ( i , c i ) E(c)=-\sum_{i=1}^N s(i,c_i) E(c)=i=1Ns(i,ci). Exactly minimizing the energy is computationally intractable, because a special case of this minimization problem is the NP-hard k-median problem (8). However, the update rules for affinity propagation correspond to fixed-point recursions for minimizing a Bethe free-energy (9) approximation. Affinity propagation is most easily derived as an instance of the max-sum algorithm in a factor graph (10) describing the constraints on the labels and the energy function (2).

AP可以被视为一种搜索能量函数(7)最小值的方法,该函数取决于与N个数据点相对应的一组N个隐藏标签 c 1 c_1 c1,…, c N c_N cN。每个标签都指示该点所属的exemplar,因此s(i, c i c_i ci)是数据点i与其exemplar的相似度。 c i c_i ci = i是一种特殊情况,表示点i本身就是一个exemplar,因此s(i, c i c_i ci)是点i的输入参考度。并非标签的所有配置都有效;当对于每个点i,如果其他某个点i’选择i作为其exemplar(即 c i ′ c_i' ci = i),则配置c就是有效的,那么i必须是一个exemplar(即ci = i)。有效配置的能量为 E ( c ) = − ∑ i = 1 N s ( i , c i ) E(c)=-\sum_{i=1}^N s(i,c_i) E(c)=i=1Ns(i,ci)。 精确地最小化能量在计算上是棘手的,因为这种最小化问题的特例是NP-hard k-median问题(8)。但是,用于AP的更新规则对应于最小化Bethe自由能(9)近似值的定点递归。AP最容易通过描述了标签约束和能量函数(2)的因子图(10)中的max-sum算法实例得到。

In some degenerate cases, the energy function may have multiple minima with corresponding multiple fixed points of the update rules, and these may prevent convergence. For example, if s(1,2) = s(2,1) and s(1,1) = s(2,2), then the solutions c1 = c2 = 1 and c1 = c2 = 2 both achieve the same energy. In this case, affinity propagation may oscillate, with both data points alternating between being exemplars and nonexemplars. In practice, we found that oscillations could always be avoided by adding a tiny amount of noise to the similarities to prevent degenerate situations, or by increasing the damping factor.

在某些退化的情况下,能量函数可能具有多个最小值和相应的更新规则的多个不动点,这可能会阻止收敛。例如,如果s(1,2)= s(2,1)和s(1,1)= s(2,2),则解c1 = c2 = 1和c1 = c2 = 2的能量相同。在这种情况下,AP可能会随着两个数据点在exemplars和非exemplars之间交替而震荡。事实上,我们发现可以通过向相似度添加少量噪声来防止退化情况,或者通过增加阻尼系数来避免振荡。

Affinity propagation has several advantages over related techniques. Methods such as k-centers clustering (1), k-means clustering (1), and the expectation maximization (EM) algorithm (11) store a relatively small set of estimated cluster centers at each step. These techniques are improved upon by methods that begin with a large number of clusters and then prune them (12), but they still rely on random sampling and make hard pruning decisions that cannot be recovered from. In contrast, by simultaneously considering all data points as candidate centers and gradually identifying clusters, affinity propagation is able to avoid many of the poor solutions caused by unlucky initializations and hard decisions. Markov chain Monte Carlo techniques (13) randomly search for good solutions, but do not share affinity propagation’s advantage of considering many possible solutions all at once.

AP相对于相关的技术具有多个优势。如k-centers聚类(1),k-means聚类(1)和期望最大化(EM)算法(11)之类的方法在每个步骤都存储了相对较小的估计聚类中心集。 这些技术通过从大量簇开始然后删减的方法得到了改进(12),但是它们仍然依赖于随机采样并做出无法恢复的艰难的删减决定。相反,通过同时将所有数据点视为候选中心并逐渐识别簇,AP能够避免由于不顺利的初始化和艰难的决定而导致的许多不良解决方案。马尔可夫链蒙特卡罗技术(13)随机搜索良好的解决方案,但不具有AP的优点,即一次考虑许多可能的解决方案。

Hierarchical agglomerative clustering (14) and spectral clustering (15) solve the quite different problem of recursively comparing pairs of points to find partitions of the data. These techniques do not require that all points within a cluster be similar to a single center and are thus not well-suited to many tasks. In particular, two points that should not be in the same cluster may be grouped together by an unfortunate sequence of pairwise groupings.

凝聚层次聚类(14)和谱聚类(15)解决了递归比较点对以找到数据分区完全不同的问题。 这些技术并不要求簇内的所有点都类似于单个中心,因此不适合许多任务。特别是通过不成功的成对分组序列将不应该在同一聚类中的两个点分组在一起。

In (8), it was shown that the related metric k-median problem could be relaxed to form a linear program with a constant factor approximation. There, the input was assumed to be metric, i.e., nonnegative, symmetric, and satisfying the triangle inequality. In contrast, affinity propagation can take as input general nonmetric similarities. Affinity propagation also provides a conceptually new approach that works well in practice. Whereas the linear programming relaxation is hard to solve and sophisticated software packages need to be applied (e.g., CPLEX), affinity propagation makes use of intuitive message updates that can be implemented in a few lines of code (2).

在(8)中,证明了相关的度量k-median问题可以松弛成具有常数因子近似的线性规划。在这里,假定输入可以度量,即非负,对称并满足三角不等式。相反,AP可以将一般的不可度量相似度作为输入。AP还提供了一种在实践中效果很好的概念上新的方法。虽然线性规划的松弛难以解决,并且需要应用复杂的软件包(例如CPLEX),而AP利用了可以在几行代码中实现(2)的直观的消息更新功能。

Affinity propagation is related in spirit to techniques recently used to obtain record-breaking results in quite different disciplines (16). The approach of recursively propagating messages (17) in a “loopy graph” has been used to approach Shannon’s limit in error-correcting decoding (18, 19), solve random satisfiability problems with an order-of-magnitude increase in size (20), solve instances of the NP-hard two-dimensional phase-unwrapping problem (21), and efficiently estimate depth from pairs of stereo images (22). Yet, to our knowledge, affinity propagation is the first method to make use of this idea to solve the age-old, fundamental problem of clustering data. Because of its simplicity, general applicability, and performance, we believe affinity propagation will prove to be of broad value in science and engineering.

从本质上讲,AP与最近用于在完全不同的学科中获得破记录的结果的技术有关(16)。在“循环图”中递归传播消息的方法(17)已用于在纠错解码(18,19)中接近Shannon的极限,解决了大小满足增加一个数量级的随机可满足性问题(20),解决了NP-hard二维相位解包问题(21),并从成对的立体图像中有效地估计深度(22)。然而,据我们所知,AP是利用这种思想解决数据聚类这个古老的基本问题的第一种方法。由于它的简单性,普遍适用性和性能,我们相信AP将在科学和工程中被证明具有广泛的价值。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值