Clustering by Passing Messages Between Data Points【论文翻译】

Yzhang98

于 2020-08-27 11:06:27 发布

阅读量303

点赞数

文章标签：聚类机器学习计算机视觉 python 算法

本文链接：https://blog.csdn.net/zxggghjju/article/details/108128825

版权

Clustering by Passing Messages Between Data Points

通过在数据点之间传递信息进行聚类处理

Brendan J. Frey and Delbert Dueck*

Clustering data by identifying a subset of representative examples is important for processing sensory signals and detecting patterns in data. Such “exemplars” can be found by randomly choosing an initial subset of data points and then iteratively refining it, but this works well only if that initial choice is close to a good solution. We devised a method called “affinity propagation,”which takes as input measures of similarity between pairs of data points. Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. We used affinity propagation to cluster images of faces, detect genes in microarray data, identify representative sentences in this manuscript, and identify cities that are efficiently accessed by airline travel. Affinity propagation found clusters with much lower error than
other methods, and it did so in less than one-hundredth the amount of time.

通过确定一个有代表性的例子子集来对数据进行聚类，对于处理感官信号和检测数据中的模式非常重要。这样的 "范例 "可以通过随机选择一个初始数据点子集，然后反复完善来找到，但这只有在初始选择接近于一个好的解决方案时才会有好的效果。我们设计了一种名为 "亲和力传播 "的方法，它将数据点对之间的相似度作为输入措施。在数据点之间交换实值信息，直到逐渐形成一个高质量的范例和相应的聚类。我们使用亲和力传播对人脸图像进行聚类，检测微阵列数据中的基因，识别本手稿中的代表性句子，以及识别航空旅行有效访问的城市。亲和传播发现聚类的误差远低于其他方法，而且在不到百分之一的时间内完成。

正文

Clustering data based on a measure of similarity is a critical step in scientific data analysis and in engineering systems. A common approach is to use data to learn a set of centers such that the sum of squared errors between data points and their nearest centers is small. When the centers are selected from actual data points, they are called “exemplars.” The popular k-centers clustering technique (1) begins with an initial set of randomly selected exemplars and iteratively refines this set so as to decrease the sum of squared errors. k-centers clustering is quite sensitive to the initial selection of exemplars, so it is usually rerun many times with different initializations in an attempt to find a good solution. However,this works well only when the number of clusters is small and chances are good that at least one random initialization is close to a good solution. We take a quite different approach and introduce a method that simultaneously considers all data points as potential exemplars. By viewing each data point as a node in a network, we devised a method that recursively transmits real-valued messages along edges of the network until a good set of exemplars and corresponding clusters emerges.As described later, messages are updated on the basis of simple formulas that search for minima of an appropriately chosen energy function. At any point in time, the magnitude of each message reflects the current affinity that one data point has for choosing another data point as its exemplar, so we call our method “affinity propagation.” Figure 1A illustrates how clusters gradually emerge during the message-passing procedure.

基于相似度的测量对数据进行聚类是科学数据分析和工程系统中的关键步骤。一种常见的方法是利用数据学习一组中心，使数据点与其最近的中心之间的平方误差之和很小。当中心从实际数据点中选取时，它们被称为 "范例"。流行的k中心聚类技术从一个随机选择的范例的初始集开始，并迭代地完善这个集合，以减少平方误差之和。k中心聚类对范例的初始选择相当敏感，因此通常用不同的初始化重新运行多次，以试图找到一个好的解决方案。然而，只有当聚类的数量较少，并且至少有一个随机初始化是接近一个好的解决方案的机会时，这种方法才会有很好的效果。我们采用了一种完全不同的方法，并引入了一种同时将所有数据点视为潜在范例的方法。通过将每个数据点看作是网络中的一个节点，我们设计了一种方法，它沿着网络的边缘递归地传输实值信息，直到出现一组好的范例和相应的簇正如后面所描述的那样，信息是根据搜索适当选择的能量函数的最小值的简单公式来更新的。在任何时间点上，每个消息的大小都反映了一个数据点选择另一个数据点作为其范例的当前亲和力，所以我们称我们的方法为 "亲和力传播"。图1A说明了簇是如何在消息传递过程中逐渐出现的。

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 在这里插入图片描述

Fig. 1. How affinity propagation works.(A) Affinity propagation is illustrated for two-dimensional data points, where negative Euclidean distance (squared error) was used to measure similarity. Each point is colored according to the current evidence that it is a cluster center (exemplar). The darkness of the arrow directed from point i to point k corresponds to the strength of the transmitted message that point i belongs to exemplar point k. (B) “Responsibilities” r(i,k) are sent from data points to candidate exemplars and indicate how strongly each data point favors the candidate exemplar over other candidate exemplars. © “Availabilities” a(i,k) are sent from candidate exemplars to data points and indicate to what degree each candidate exemplar is available as a cluster center for the data point. (D) The effect of the value of the input preference (common for all data points) on the number of identified exemplars (number of clusters) is shown. The value that was used in (A) is also shown, which was computed from the median of the pairwise similarities.

图1.亲和力传播如何工作。（A）亲和力传播被解释为二维数据点，其中负欧氏距离（平方误差）被用来衡量相似度。每个点都根据当前的证据显示它是一个簇中心（范例）而着色。从点i指向点k的箭头的暗度对应于传输的信息的强度，即点i属于范例点k。 (B) "责任 "r(i,k)从数据点发送到候选范例，并表示每个数据点比其他候选范例更倾向于候选范例的程度。(C) "可用性 "a(i,k)从候选范例发送到数据点，并表示每个候选范例作为数据点的集群中心的可用程度。(D)显示了输入偏好的值(所有数据点共同的)对识别的范例数(簇数)的影响。还显示了(A)中使用的值，该值是由对偶相似度的中位数计算出来的。

Affinity propagation takes as input a collection of real-valued similarities between data points, where the similarity s(i,k) indicates
how well the data point with index k is suited to be the exemplar for data point i. When the goal is to minimize squared error, each similarity is set to a negative squared error (Euclidean distance): For points x_i and x_k, s(i,k) =−||x_i − x_k||². Indeed, the method described here can be applied when the optimization criterion is much more general. Later, we describe tasks where similarities are derived for pairs of images, pairs of microarray measurements, pairs of English sentences, and pairs of cities. When an exemplar-dependent probability model is available, s(i,k) can be set to the log-likelihood of data point i given that its exemplar is point k.Alternatively, when appropriate, similarities may be set by hand.

亲和传播将数据点之间的实值相似度集合作为输入，其中相似度s(i,k)表示指数为k的数据点适合成为数据点i的范例的程度，当目标是最小化平方误差时，每个相似度被设置为负平方误差（欧氏距离）。对于点x~i~和x~k~，s(i,k) =-||x~i~ - x~k~||^2^。事实上，当优化标准更为通用时，这里描述的方法可以应用。之后，我们描述了对图像、对微阵列测量、对英语句子和对城市的相似性的任务。当有一个依赖于范例的概率模型时，s(i,k)可以设置为数据点i的对数似然，给定它的范例是点k。另外，在合适的情况下，也可以手工设置相似度。

Rather than requiring that the number of clusters be prespecified, affinity propagation takes as input a real number s(k,k) for each data
point k so that data points with larger values of s(k,k) are more likely to be chosen as exemplars. These values are referred to as “preferences.” The number of identified exemplars (number of clusters) is influenced by the values of the input preferences, but also emerges from the message-passing procedure. If a priori, all data points are equally suitable as exemplars, the preferences should be set to a common value—this value can be varied to produce different numbers of clusters. The shared value could be the median of the input similarities (resulting in a moderate number of clusters) or their minimum (resulting in a small number of clusters).

亲和力传播并不要求预先指定聚类的数量，而是将每个数据点k的实数s(k,k)作为输入，这样s(k,k)值较大的数据点更有可能被选为样本。这些值被称为 "偏好"。识别出的样本数量，即聚类数量，其受输入偏好值的影响，但也会从消息传递过程中出现。如果先验地认为所有的数据点都同样适合作为样本，则应该将偏好设置为一个共同值，这个值可以变化以产生不同数量的簇。共同值可以是输入相似性的中位数，即产生中等数量的聚类或它们的最小值，即产生少量的聚类。

There are two kinds of message exchanged between data points, and each takes into account a different kind of competition. Messages can be combined at any stage to decide which points are exemplars and, for every other point, which exemplar it belongs to. The “responsibility” r(i,k), sent from data point i to candidate exemplar point k, reflects the accumulated evidence for how well-suited point
k is to serve as the exemplar for point i, taking into account other potential exemplars for point i (Fig. 1B). The “availability” a(i,k), sent from candidate exemplar point k to point i,reflects the accumulated evidence for how appropriate it would be for point i to choose point k as its exemplar, taking into account the support from other points that point k should be an exemplar (Fig. 1C). r(i,k) and a(i,k) can be viewed as log-probability ratios. To begin with, the availabilities are initialized to zero: a(i,k) = 0. Then, the responsibilities are computed using the rule
$r(i,k)\leftarrow s(i,k) - \underset{k' s.t k'\neq k}{max}\left \{ a(i,k')+s(i,k') \right \}$ (1)
In the first iteration, because the availabilitiesare zero, r(i,k) is set to the input similarity between point i and point k as its exemplar,minus the largest of the similarities between point i and other candidate exemplars. This competitive update is data-driven and does not take into account how many other points favor each candidate exemplar. In later iterations,when some points are effectively assigned to other exemplars, their availabilities will drop below zero as prescribed by the update rule below. These negative availabilities will decrease the effective values of some of the input similarities s(i,k′) in the above rule, removing the corresponding candidate exemplars from competition. For k = i, the responsibility r(k,k) is set to the input preference that point k be chosen as an exemplar, s(k,k), minus the largest of the similarities between point i and all other candidate exemplars. This “self-responsibility” reflects accumulated evidence that point k is an exemplar, based on its input preference tempered by how ill-suited it is to be assigned to another exemplar.

数据点之间交换的信息有两种，每一种都考虑到了不同的竞争。信息可以在任何阶段组合起来，以决定哪些点是范例点，对于其他每一个点，它属于哪个范例点。"责任 "r(i,k)，从数据点i发送给候选范例点k，反映了k点有多适合作为点i的范例的累积证据，同时考虑到点i的其他潜在范例（图1B）。从候选范例点k向点i发送的 "可用性 "a(i,k)，反映了点i选择点k作为其范例的合适程度的累积证据，同时考虑到其他点对点k应该成为范例的支持(见图1C)。r(i,k)和a(i,k)可以看作是对数概率。首先，可用性初始化为零：a(i,k)=0.然后，使用以下规则计算责任：
$r(i,k)\leftarrow s(i,k) - \underset{k' s.t k'\neq k}{max}\left \{ a(i,k')+s(i,k') \right \}$ (1)
在第一次迭代中，由于可利用性为零，r(i,k)被设置为点i和作为其范例的点k之间的输入相似度，减去点i和其他候选范例之间的最大相似度。这种竞争性更新是数据驱动的，并没有考虑到有多少其他的点有利于每个候选范例。在以后的迭代中，当一些点被有效地分配给其他范例时，它们的可利用性将按照下面的更新规则降到零以下。这些负的可用性将降低上述规则中一些输入相似度s(i,k′)的有效值，将相应的候选范例从竞争中移除。对于k=i，责任r(k,k)被设置为选择点k作为范例的输入偏好s(k,k)减去点i与所有其他候选范例之间的最大相似性。这种 "自我责任 "反映了累积的证据，证明点k是一个范例，基于它的输入偏好，由它的不适合被分配到另一个范例的程度来节制。

Whereas the above responsibility update lets all candidate exemplars compete for ownership of a data point, the following availability update gathers evidence from data points as to whether each candidate exemplar would make a good exemplar:
$\leftarrow min\left \{ 0,r(k,k)+ \underset{i' s.t.i'\notin \left \{ i,k \right \}}{\sum } max\left \{ 0,r(i',k) \right \} \right \}$ (2)
The availability a(i,k) is set to the selfresponsibility r(k,k) plus the sum of the positive responsibilities candidate exemplar k receives
from other points. Only the positive portions of incoming responsibilities are added, because it is only necessary for a good exemplar to explain some data points well (positive responsibilities), regardless of how poorly it explains other data points (negative responsibilities). If the selfresponsibility r(k,k) is negative (indicating that point k is currently better suited as belonging to another exemplar rather than being an exemplar itself), the availability of point k as an exemplar can be increased if some other points have positive responsibilities for point k being their exemplar. To limit the influence of strong incoming positive responsibilities, the total sum is thresholded so that it cannot go above zero. The “self-availability” a(k,k) is updated differently:
$\leftarrow \underset{i' s.t i'\neq k}{\sum }max\left \{ 0,r(i',k) \right \}$ (3)
This message reflects accumulated evidence that point k is an exemplar, based on the positive responsibilities sent to candidate exemplar k from other points.

上面的职责更新让所有候选范例争夺一个数据点的所有权，而下面的可用性更新则从数据点中收集证据，证明每个候选范例是否会成为一个好范例。
$\leftarrow min\left \{ 0,r(k,k)+ \underset{i' s.t.i'\notin \left \{ i,k \right \}}{\sum } max\left \{ 0,r(i',k) \right \} \right \}$ (2)
可用性a(i,k)设置为自我责任r(k,k)加上候选范例k从其他点获得的积极责任之和。只有传入责任的正向部分才会被添加，因为一个好的范例只需要很好地解释一些数据点（正向责任），而不管它对其他数据点的解释有多差（负向责任）。如果自我责任r(k,k)是负的(表明点k目前更适合作为属于另一个范例而不是本身就是一个范例)，如果其他一些点对点k是其范例有正的责任，则可以增加点k作为范例的可用性。为了限制强进的正责任的影响，对总和进行阈值化，使其不能超过零。"自我可用性 "a(k,k)的更新方式不同。
$\leftarrow \underset{i' s.t i'\neq k}{\sum }max\left \{ 0,r(i',k) \right \}$ (3)
该信息反映了根据其他点向候选范例k发出的积极责任，加强了k点是范例的证据。

The above update rules require only simple, local computations that are easily implemented(2), and messages need only be exchanged between pairs of points with known similarities.At any point during affinity propagation, availabilities and responsibilities can be combined to identify exemplars. For point i, the value of k that maximizes a(i,k) + r(i,k) either identifies point i as an exemplar if k = i, or identifies the data point that is the exemplar for point i. The message-passing procedure may be terminated after a fixed number of iterations, after changes in the messages fall below a threshold, or after the local decisions stay constant for some number of iterations. When updating the messages,it is important that they be damped to avoid numerical oscillations that arise in some circumstances. Each message is set to l times its value from the previous iteration plus 1 – λ times its prescribed updated value, where the damping factor l is between 0 and 1. In all of our experiments (3), we used a default damping factor of l = 0.5, and each iteration of affinity propagation consisted of (i) updating all responsibilities given the availabilities, (ii) updating all availabilities given the responsibilities, and (iii) combining availabilities and responsibilities to monitor the exemplar decisions and terminate the algorithm when these decisions did not change for 10 iterations.

上述更新规则只需要简单的、易于实现的局部计算，消息只需要在具有已知相似性的点对之间进行交换。在亲和力传播过程中的任何一点上，都可以将可用性和责任结合起来，以确定示范点。对于点i来说，如果k=i，则最大化a(i,k)+r(i,k)的k值可以将点i识别为范例，或者识别出数据点是点i的范例。消息传递过程可以在固定的迭代次数之后，在消息的变化低于阈值之后，或者在局部决定在一些迭代次数上保持不变之后终止。当更新消息时，重要的是要对它们进行阻尼，以避免在某些情况下出现数值振荡。每个消息被设置为l倍于上一次迭代的值加上1-λ倍于其规定的更新值，其中阻尼系数l在0和1之间，在我们所有的实验(3)中，我们使用默认的阻尼系数l=0.5，亲和力传播的每个迭代包括(i)更新所有给定的责任，(ii)更新所有给定的责任，和(iii)结合给定的责任和责任来监控范例决策，当这些决策在10次迭代中没有变化时，终止算法。

Figure 1A shows the dynamics of affinity propagation applied to 25 two-dimensional data points (3), using negative squared error as the
similarity. One advantage of affinity propagation is that the number of exemplars need not be specified beforehand. Instead, the appropriate number of exemplars emerges from the messagepassing method and depends on the input exemplar preferences. This enables automatic model selection, based on a prior specification of how preferable each point is as an exemplar.Figure 1D shows the effect of the value of the common input preference on the number of clusters. This relation is nearly identical to the relation found by exactly minimizing the squared error (2).

图1A显示了应用于25个二维数据点(3)的亲和力传播动态，以负平方误差作为的相似性。亲和力传播的一个优点是，不需要事先指定范例的数量。取而代之的是，适当的范例数量从消息传递方法中出现，并取决于输入的范例偏好。这使得自动模型选择，基于事先指定每个点作为范例的优选程度.图1D显示了共同输入偏好的值对聚类数量的影响。这种关系与精确最小化平方误差(2)所发现的关系几乎相同。

We next studied the problem of clustering images of faces using the standard optimization criterion of squared error. We used both
affinity propagation and k-centers clustering to identify exemplars among 900 grayscale images extracted from the Olivetti face database (3). Affinity propagation found exemplars with much lower squared error than the best of 100 runs of k-centers clustering (Fig. 2A), which took about the same amount of computer time.We asked whether a huge number of random restarts of k-centers clustering could achieve the same squared error. Figure 2B shows the error achieved by one run of affinity propagation and the distribution of errors achieved by 10,000 runs of k-centers clustering, plotted against the number of clusters. Affinity propagation uniformly achieved much lower error in more than two orders of magnitude less time. Another popular optimization criterion is the sum of absolute pixel differences (which better tolerates outlying pixel intensities), so we repeated the above procedure using this error measure. Affinity propagation again uniformly achieved lower error (Fig. 2C).

接下来我们研究了利用平方误差的标准优化准则对人脸图像进行聚类的问题。我们同时使用了亲和传播和k中心聚类来识别从Olivetti人脸数据库中提取的900张灰度图像中的范例。亲和传播发现的样本的方差误差比k-centers聚类的100次最佳运行（见图2A）要低得多，而k-centers聚类需要的计算机时间差不多。图2B显示了亲和传播的一次运行所实现的误差和k-centers聚类的10000次运行所实现的误差分布，并对聚类数量进行了绘制。亲和传播均匀地实现了更低的误差，时间减少了两个数量级以上。另一个流行的优化标准是绝对像素差值的总和（它能更好地容忍离群的像素强度），所以我们使用这个误差度量重复上述过程。亲和传播再次统一实现了较低的误差（见图2C）。

Many tasks require the identification of exemplars among sparsely related data, i.e., where most similarities are either unknown or large and negative. To examine affinity propagation in this context, we addressed the task of clustering putative exons to find genes, using the sparse similarity matrix derived from microarray data and reported in (4). In that work, 75,066 segments of DNA (60 bases long) corresponding to putative exons were mined from the genome of mouse chromosome 1. Their transcription levels were measured across 12 tissue samples, and the similarity between every pair of putative exons (data points) was computed. The measure of similarity between putative exons was based on their proximity in the genome and the degree of coordination of their transcription levels across
the 12 tissues. To account for putative exons that are not exons (e.g., introns), we included an additional artificial exemplar and determined the similarity of each other data point to this “nonexon exemplar” using statistics taken over the entire data set. The resulting 75,067 × 75,067 similarity matrix (3) consisted of 99.73% similarities with values of −∞, corresponding to distant DNA segments that could not possibly be part of the same gene. We applied affinity propagation to this similarity matrix, but because messages need not be exchanged between point i and k if s(i,k) = −∞, each iteration of affinity propagation required exchanging messages between only a tiny subset (0.27% or 15 million) of data point pairs.

许多任务需要在稀疏相关的数据中识别范例，即大多数相似性是未知的或大而负的。为了研究这种情况下的亲和力传播，我们解决了聚类推定外显子以寻找基因的任务，使用从微阵列数据中得出的稀疏相似性矩阵，并在（4）中报道。在该工作中，从小鼠1号染色体的基因组中挖掘出对应于推定外显子的75,066段DNA（60个碱基长）。在12个组织样本中测量它们的转录水平，并计算每对推定外显子（数据点）之间的相似性。根据外显子在基因组中的接近程度和12个组织中转录水平的协调程度来衡量它们之间的相似性。为了考虑到假定的外显子不是外显子（例如，内含子），我们包括一个额外的人工范例，并确定每个其他数据点的相似性，这个 "非外显子范例"使用统计整个数据集。所得的75,067×75,067相似度矩阵由99.73%的相似度组成，其值为-∞，对应于不可能是同一基因的远距离DNA片段。我们对这个相似度矩阵应用了亲和力传播，但由于如果s(i,k)=-∞，点i和k之间不需要交换信息，所以亲和力传播的每次迭代只需要在极小的数据点对子集(0.27%或1500万)之间交换信息。

在这里插入图片描述
Fig. 2. Clustering faces. Exemplars minimizing the standard squared error measure of similarity were identified from 900 normalized face images (3). For a common preference of −600, affinity propagation found 62 clusters, and the average squared error was 108. For comparison, the best of 100 runs of k-centers clustering with different random initializations achieved a worse average squared error of 119. (A) The 15 images with highest squared error under either affinity propagation or k-centers clustering are shown in the top row. The middle and bottom rows show the exemplars assigned by the two methods, and the boxes show which of the two methods performed better for that image, in terms of squared error. Affinity propagation found higher-quality exemplars. (B) The average squared error achieved by a single run of affinity propagation and 10,000 runs of k-centers clustering, versus the number of clusters. The colored bands show different percentiles of squared error, and the number of exemplars corresponding to the result from (A) is indicated. © The above procedure was repeated using the sum of absolute errors as the measure of similarity, which is also a popular optimization criterion.

图2 脸部聚类。从900张归一化人脸图像中识别出相似度标准平方误差最小的范例。对于-600的共同偏好，亲和力传播发现62个簇，平均平方误差为108。作为比较，在100次运行中，用不同的随机初始化的k中心聚类的最好的一次，取得了较差的平均平方误差为119。(A)上行显示的是亲和传播或k中心聚类下平方误差最高的15幅图像。中行和下行显示了两种方法分配的范例，方框显示了两种方法中哪种方法对该图像的平方误差表现更好。亲和传播发现了更高质量的范例。(B)亲和传播的单次运行和k中心聚类的10,000次运行所实现的平均平方误差与聚类数量的关系。彩色的带子显示了平方误差的不同百分位数，并表示了与（A）的结果相对应的范例数量。(C)使用绝对误差之和作为相似度的衡量标准，重复上述过程，这也是一种流行的优化标准。

Figure 3A illustrates the identification of gene clusters and the assignment of some data points to the nonexon exemplar. The reconstruction errors for affinity propagation and kcenters clustering are compared in Fig. 3B. For each number of clusters, affinity propagation was run once and took 6 min, whereas k-centers clustering was run 10,000 times and took 208 hours. To address the question of how well these methods perform in detecting bona fide gene segments, Fig. 3C plots the truepositive (TP) rate against the false-positive (FP) rate, using the labels provided in the RefSeq database (5). Affinity propagation achieved significantly higher TP rates, especially at low FP rates, which are most important to biologists. At a FP rate of 3%, affinity propagation achieved a TP rate of 39%, whereas the best k-centers clustering result was 17%. For comparison, at the same FP rate, the best TP rate for hierarchical agglomerative clustering (2)was 19%, and the engineering tool described in (4), which accounts for additional biological knowledge, achieved a TP rate of 43%.

图3A示出了基因簇的识别和一些数据点对非外显子范例的分配。图3B中比较了亲和传播和kcenters聚类的重建误差。对于每个簇的数量，亲和传播运行一次，耗时6分钟，而k-centers聚类运行10000次，耗时208小时。为了解决这些方法在检测真正基因段方面的表现如何的问题，图3C利用RefSeq数据库中提供的标签，绘制了真阳性(TP)率与假阳性(FP)率的对比图。亲和传播实现了显著更高的TP率，特别是在低FP率下，这对生物学家来说是最重要的。在3%的FP率下，亲和传播实现了39%的TP率，而最好的k中心聚类结果是17%。相比之下，在相同的FP率下，分层聚类的最佳TP率为19%，而(4)中所述的工程工具，占额外的生物知识，实现了43%的TP率。

在这里插入图片描述
Fig. 3. Detecting genes. Affinity propagation was used to detect putative exons (data points) comprising genes from mouse chromosome 1. Here, squared error is not appropriate as a measure of similarity, but instead similarity values were derived from a cost function measuring proximity of the putative exons in the genome and coexpression of the putative exons across 12 tissue samples (3). (A) A small portion of the data and the emergence of clusters during each iteration of affinity propagation are shown. In each picture,
the 100 boxes outlined in black correspond to 100 data points (from a total of 75,066 putative exons), and the 12 colored blocks in each box indicate the transcription levels of the corresponding DNA segment in 12 tissue samples. The box on the far left corresponds to an artificial data point with infinite preference that is used to account for nonexon regions (e.g., introns). Lines connecting data points indicate potential assignments, where gray lines indicate assignments that currently have weak evidence and solid lines indicate assignments that currently have strong evidence. (B) Performance on minimizing the reconstruction error of genes, for different numbers of detected clusters. For each number of clusters, affinity propagation took 6 min, whereas 10,000 runs of k-centers clustering took 208 hours on the same computer. In each case, affinity propagation achieved a significantly lower reconstruction error than k-centers clustering. © A plot of true-positive rate versus false-positive rate for detecting exons [using labels from RefSeq (5)] shows that affinity propagation also performs better at detecting biologically verified exons than k-centers clustering.

图3.检测基因。亲和力传播用于检测由小鼠1号染色体的基因组成的推定外显子（数据点）。在这里，平方误差并不适合作为相似性的衡量标准，而是从测量基因组中推定外显子的接近度和跨12个组织样本的共表达的成本函数得出相似性值。(A)一小部分数据和亲和力传播的每一次迭代过程中出现的簇显示。在每张图片中。黑色的100个方框对应100个数据点（来自总共75,066个推定外显子），每个方框中的12个彩色块表示12个组织样本中相应DNA片段的转录水平。最左边的方框对应于一个具有无限优先权的人工数据点，该数据点用于核算非外显子区域（如内含子）。连接数据点的线表示潜在的赋值，其中灰色线表示目前证据较弱的赋值，实线表示目前证据较强的赋值。(B)对于不同数量的检测到的聚类，将基因的重建误差最小化的性能。对于每个簇的数量，亲和传播需要6分钟，而在同一台计算机上运行10000次k-centers聚类需要208小时。在每一种情况下，亲和传播实现了比k中心聚类显著更低的重建误差。(C)检测外显子的真阳性率与假阳性率对比图[使用RefSeq(5)的标签]显示，亲和传播在检测生物验证外显子方面的表现也比k-centers聚类更好。

Affinity propagation’s ability to operate on the basis of nonstandard optimization criteria makes it suitable for exploratory data analysis using unusual measures of similarity. Unlike metricspace clustering techniques such as k-means clustering (1), affinity propagation can be applied to problems where the data do not lie in a continuous space. Indeed, it can be applied to problems where the similarities are not symmetric [i.e., s(i,k) ≠ s(k,i)] and to problems where the similarities do not satisfy the triangle inequality [i.e., s(i,k) < s(i,j) + s( j,k)]. To identify a small number of sentences in a draft of this manuscript that summarize other sentences, we treated each sentence as a “bag of words” (6) and computed the similarity of sentence i to sentence k based on the cost of encoding the words in sentence i using
the words in sentence k. We found that 97% of the resulting similarities (2, 3) were not symmetric. The preferences were adjusted to identify (using λ- 0.8) different numbers of representative exemplar sentences (2), and the solution withfour sentences is shown in Fig. 4A.

亲和传播能够在非标准优化标准的基础上进行操作，这使得它适合于使用不寻常的相似性测量方法进行探索性数据分析。与度量空间聚类技术（如k-means聚类（1））不同，亲和传播可以应用于数据不在连续空间的问题。事实上，它可以应用于相似性不对称的问题[即s(i,k)≠s(k,i)]，也可以应用于相似性不满足三角形不等式的问题[即s(i,k)<s(i,j)+s( j,k)]。为了识别本手稿中少量总结其他句子的句子，我们将每个句子视为一个 "词袋"(6)，并根据句子i中词的编码成本计算句子i与句子k的相似度，用句子k中的单词，我们发现97%的结果相似度）不是对称的。调整偏好以确定当使用λ- 0.8时不同数量的代表性例句(2)，有4个句子的解决方案如图4A所示。

We also applied affinity propagation to explore the problem of identifying a restricted number of Canadian and American cities that
are most easily accessible by large subsets of other cities, in terms of estimated commercial airline travel time. Each data point was a city, and the similarity s(i,k) was set to the negative time it takes to travel from city i to city k by airline, including estimated stopover delays (3). Due to headwinds, the transit time was in many cases different depending on the direction of travel, so that 36% of the similarities were asymmetric. Further, for 97% of city pairs i and k, there was a third city j such that the triangle inequality was violated, because the trip from i to k included a long stopover delay in city j so it took longer than the sum of the durations of the trips from i to j and j to k. When the number of “most accessible cities” was constrained to be seven (by adjusting the input preference appropriately), the cities shown in Fig. 4, B to E, were identified. It is interesting that several major cities were not selected, either because heavy international travel makes them inappropriate as easily accessible domestic destinations (e.g., New York City, Los Angeles) or because their neighborhoods can be more efficiently accessed through other destinations (e.g., Atlanta, Philadelphia, and Minneapolis account for Chicago’s destinations, while avoiding potential airport delays).

我们还应用了亲和力传播来探索识别加拿大和美国有限数量的城市的问题，这些城市是是其他城市的大子集最容易到达的地方，从估计的商业航空旅行时间来看。每个数据点都是一个城市，相似度s(i,k)设置为乘坐航空公司从城市i到城市k所需时间的负值，包括估计的中途停留延误。由于逆风的影响，在很多情况下，中转时间根据旅行方向的不同而不同，因此36%的相似度是不对称的。此外，对于97%的城市对i和k，有第三个城市j，这样就违反了三角形不等式，因为从i到k的行程中包含了在城市j的长时间停留延迟，所以它的时间比从i到j和j到k的行程持续时间之和还要长。当 "最容易到达的城市 "的数量被约束为7个（通过适当调整输入偏好），图4中B到E所示的城市被确定。有趣的是，有几个主要城市没有被选中，要么是因为大量的国际旅行使它们不适合作为容易到达的国内目的地，如纽约市、洛杉矶，要么是因为它们的附近地区可以通过其他目的地更有效地到达，如亚特兰大、费城和明尼阿波利斯占芝加哥的目的地，同时避免潜在的机场延。

在这里插入图片描述

Fig.4.ldentifying key sentences and air-travel routing.Affinity propagation can be used to explore the identification of exemplars on the basis of nonstandard optimization criteria.(A)Similarities between pairs of sentences in a draft of this manuscript were constructed by matching words. Four exemplar sentences were identified by affinity propagation and are shown.(B)Affinity propagation was applied to
similarities derived from air-travel efficiency (measured by estimated travel time) between the 456 busiest commercial airports in Canada and the United States—-the travel times for both direct flights (shown in blue) and indirect flights(not shown)，including the mean transfer time of up to a maximum of one stopover, were used as negative similarities (3).©Seven exemplars identified by affinity propagation are color-coded, and the assignments of other cities to these exemplars is shown.Cities located quite near to exemplar cities may be members of other more distant exemplars due to the lack of direct flights between them (e.g.，,Atlantic City is 100 km from Philadelphia, but is closer in flight time to Atlanta).(D)The inset shows that the Canada-USA border roughly divides the Toronto and Philadelphia clusters, due to a larger availability of domestic flights compared to international flights. However, this is not the case on the west coast as shown in(E),because extraordinarily frequent airltine service between Vancouver and Seattle connects Canadian cities in the northwest to Seattle.

图4 识别关键句子和空中旅行路线。基于非标准优化标准，可以使用亲和传播来探索范例的识别。（A）该手稿草稿中成对句子之间的相似性是通过匹配单词来构造的。通过亲和力传播识别并显示了四个范例性句子。(B)亲和力传播应用于从加拿大和美国456个最繁忙的商业机场之间的航空旅行效率，其以估计旅行时间衡量，来得出的相似性，直航(蓝色显示)和间接航班(未显示)的旅行时间，包括最多一次停留的平均转机时间，被用作负相似性。(C)通过亲和力传播确定的7个示范城市用颜色编码，并显示出其他城市对这些示范城市的分配情况。离示范城市相当近的城市可能是其他较远的示范城市的成员，因为它们之间没有直航，例如，大西洋城距离费城100公里，但在飞行时间上更接近亚特兰大。(D)插图显示，加拿大与美国的边界大致划分了多伦多和费城的群落，这是因为与国际航班相比，国内航班的可用性更大。然而，西海岸的情况并非如此，如(E)所示，因为温哥华和西雅图之间非常频繁的空运服务将西北部的加拿大城市连接到了西雅图。

Affinity propagation can be viewed as a method that searches for minima of an energy function (7) that depends on a set of N hidden
labels, c1,…,cN, corresponding to the N data points. Each label indicates the exemplar to which the point belongs, so that s(i,ci) is the
similarity of data point i to its exemplar. c_i = i is a special case indicating that point i is itself an exemplar, so that s(i,ci) is the input preference for point i. Not all configurations of the labels are valid; a configuration c is valid when for every point i, if some other point i′ has chosen i as its exemplar (i.e., c_i′= i), then i must be an exemplar (i.e., ci = i). The energy of a valid configuration is
$E(c)=-\sum_{i=1}^{N} s(i,c_i)$ . Exactly minimizing the energy is computationally intractable, because a special case of this minimization problem is the NP-hard k-median problem (8). However, the update rules for affinity propagation correspond to fixed-point recursions for minimizing a Bethe free-energy (9) approximation. Affinity propagation is most easily derived as an instance of the
max-sum algorithm in a factor graph (10) describing the constraints on the labels and the energy function (2).

亲和传播可以看作是一种寻找能量函数的最小值的方法，该方法依赖于一组N个隐藏标签c1，...，cN，对应N个数据点。每个标签表示该点所属的范例，所以s(i,ci)是数据点i与其范例的相似度，c~i~=i是一种特殊情况，表示点i本身就是范例，所以s(i,ci)是点i的输入偏好，并不是所有标签的配置都是有效的，当对于每一个点i来说，如果其他一些点i′选择了i作为范例(即。c~i′~=i），则i一定是一个范例（即ci=i）。一个有效配置的能量是 $E(c)=-\sum_{i=1}^{N} s(i,c_i)$ . 准确地将能量最小化在计算上是难以解决的，因为这个最小化问题的一个特殊情况是NP-hard的k-median问题(8)。然而，亲和传播的更新规则对应于最小化Bethe自由能量近似的定点递归。亲和传播最容易推导为描述标签和能量函数约束的因子图(10)中最大和算法的实例。

In some degenerate cases, the energy function may have multiple minima with corresponding multiple fixed points of the update rules, and these may prevent convergence. For example, if s(1,2) = s(2,1) and s(1,1) = s(2,2), then the solutions c1 = c2 = 1 and c1 = c2 = 2 both achieve the same energy. In this case, affinity propagation may oscillate, with both data points alternating between being exemplars and nonexemplars. In practice, we found that oscillations could always be avoided by adding a tiny amount of noise to the similarities to prevent degenerate situations, or by increasing the damping factor. Affinity propagation has several advantages over related techniques. Methods such as k-centers clustering (1), k-means clustering (1), and the expectation maximization (EM) algorithm (11) store a relatively small set of estimated cluster centers at each step. These techniques are improved upon by methods that begin
with a large number of clusters and then prune them (12), but they still rely on random sampling and make hard pruning decisions that cannot be recovered from. In contrast, by simultaneously considering all data points as candidate centers and gradually identifying clusters, affinity propagation is able to avoid many of the poor solutions caused by unlucky initializations and hard decisions. Markov chain Monte Carlo techniques (13) randomly search for good solutions, but do not share affinity propagation’s advantage of considering many possible solutions all at once.

在一些退化的情况下，能量函数可能有多个最小值，并有相应的多个更新规则的固定点，这些可能会阻止收敛。例如，如果s(1,2)=s(2,1)和s(1,1)=s(2,2)，那么解c1=c2=1和c1=c2=2都实现了相同的能量。在这种情况下，亲和力传播可能会发生振荡，两个数据点在作为示范点和非示范点之间交替出现。在实践中，我们发现，通过在相似性中加入极小量的噪声以防止退化的情况，或者通过增加阻尼系数，总是可以避免振荡。与相关技术相比，亲和传播有几个优势。如k中心聚类、k-means聚类和期望值最大化(EM)算法等方法在每一步都会存储一组相对较小的估计聚类中心。这些技术是通过从大量聚类开始，然后修剪聚类的方法进行改进的，但它们仍然依赖于随机抽样，并做出无法恢复的硬修剪决策。相比之下，通过同时将所有数据点视为候选中心，并逐步识别聚类，亲和力传播能够避免许多由不幸运的初始化和硬性决策造成的不良解。马尔科夫链蒙特卡洛技术随机搜索好的解，但不具备亲和力传播的优势，即一次考虑许多可能的解。

Hierarchical agglomerative clustering (14) and spectral clustering (15) solve the quite different problem of recursively comparing pairs of
points to find partitions of the data. These techniques do not require that all points within a cluster be similar to a single center and are thus not well-suited to many tasks. In particular, two points that should not be in the same cluster may be grouped together by an unfortunate sequence of pairwise groupings.

层次聚类(14)和光谱聚类(15)解决的是完全不同的问题，即递归地比较成对的点以找到数据的分区。这些技术并不要求聚类内的所有点都类似于一个中心，因此并不适合于许多任务。特别是，两个不应该在同一个聚类中的点可能会被一个不幸的成对分组序列组合在一起。

In (8), it was shown that the related metric k-median problem could be relaxed to form a linear program with a constant factor approximation. There, the input was assumed to be metric,i.e., nonnegative, symmetric, and satisfying the triangle inequality. In contrast, affinity propagation can take as input general nonmetric similarities. Affinity propagation also provides a conceptually
new approach that works well in practice. Whereas the linear programming relaxation is hard to solve and sophisticated software packages need to be applied (e.g., CPLEX), affinity propagation makes use of intuitive message updates that can be implemented in a few lines of code (2).

在(8)中,证明了相关的度量k中值问题可以放宽到形成一个常数因子近似的线性程序。在那里，输入被假设为度量的，即非负的，对称的，并满足三角形不等式。与此相反，亲和传播可以将一般的非度量相似性作为输入。亲和传播还提供了一种概念上的新方法，在实践中效果良好。线性编程放松很难解决，需要应用复杂的软件包（如CPLEX），而亲和传播则利用直观的消息更新，只需几行代码就可以实现（2）。

Affinity propagation is related in spirit to techniques recently used to obtain record-breaking results in quite different disciplines (16). The approach of recursively propagating messages (17) in a “loopy graph” has been used to approach Shannon’s limit in error-correcting decoding (18, 19), solve random satisfiability problems with an order-of-magnitude increase in size (20), solve instances of the NP-hard twodimensional phase-unwrapping problem (21), and efficiently estimate depth from pairs of stereo images (22). Yet, to our knowledge, affinity propagation is the first method to make use of this idea to solve the age-old, fundamental problem of clustering data. Because of its simplicity, general applicability, and performance, we believe affinity propagation will prove to be of broad value in science and engineering.

亲和传播在精神上与最近用于在完全不同的学科中获得破纪录结果的技术有关(16)。在 "循环图 "中递归传播信息(17)的方法已经被用于接近纠错解码中的香农极限(18，19)，解决随机可满足性问题，其大小增加了一个数量级(20)，解决NP-困难的二维相位解包问题的实例(21)，以及有效地估计一对立体图像的深度(22)。然而，据我们所知，亲和传播是第一个利用这一思想来解决聚类数据这一古老的基本问题的方法。由于其简单性、普遍适用性和性能，我们相信亲和传播将被证明在科学和工程中具有广泛的价值。

References and Notes

1.J.MacQueen,in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, L.Le Cam,J.Neyman,Eds (Univ.of California Press,Berkeley,CA,1967),vol.1,pp.281-297.
2.Supporting material is available on Science Online.
3.Software implementations of affinity propagation,along with the data sets and similarities used to obtain the results described in this manuscript,are available at www.psi.toronto.edulaffinitypropagation.
4.B.J.Frey et al.Nat.Genet.37,991(2005).
5.K.D.Pruitt,T.Tatusova,D.R.Maglott,Nucleic Acids Res.31,34(2003).
6.C.D.Manning,H.Schutze,Foundations of Statistical Natural Language Processing (MIT Press,Cambridge，MA,1999).
7.J.J.Hopfield,Proc.Natl.Acad.Sci.U.S.4.79,2554(1982).
8. M.Charikar,S.Guha,A.Tardos,D.B.Shmoys,J.Comput.Syst. Sci.65,129 (2002).
9.J.S.Yedidia,W.T. Freeman,，Y. Weiss,IEEE Trans.Inf.Theory 51,2282(2005).
10.F.R. Kschischang,B.J.Frey，H.-A.Loeliger,IEEE Trans. lnf.Theory 47,498(2001).
11.A.P. Dempster，N.M.Laird，D.B.Rubin，Proc. R.Stat. Soc.B 39,1(1977).
12.S.Dasgupta,L. J.Schulman,Proc.16th Conf.UAl(Morgan Kaufman,San Francisco,CA,2000),pp.152-159.
13.S.jain，R.M.Neal, J. Comput.Graph.Stat.13,158 (2004).
14.R.R.Sokal,C.D.Michener,Univ.Kans.Sci.Bull. 38,1409 (1958).
15.J.Shi,J.Malik,IEEE Trans.Pattern Anal.Mach.Intell.22,aw q888 (2000).
16.M.Mézard,Science 301,1685 (2003).
17. J.Pearl，Probabilistic Reasoning in Intelligent Systems(Morgan Kaufman,San Mateo,CA,1988).
18.D.J.C.MacKay,IEEE Trans.Inf.Theory 45,399 (1999).
19.C.Berrou,A. Glavieux,IEEE Trans.Commun. 44,1261(1996).
20.M.Mezard,G.Parisi，R.Zecchina,Science 297,812(2002).
21.B.J. Frey，R.Koetter，N. Petrovic,in Proc. 14th Conf.NIPS (MIT Press,Cambridge,MA，2002)，pp.737-743.
22.T.Meltzer,C.Yanover,Y. Weiss,in Proc.10th Conf.ICCv(IEEE Computer Society Press,Los Alamitos,CA,2005),pp.428-435.
23. We thank B.Freeman,G.Hinton，R.Koetter,Y.LeCun,S.Roweis,and Y.Weiss for helpful discussions and P.Dayan,G.Hinton D.MacKay，M.Mezard,S.Roweis,and C.Tomasi for comments on a previous draft of thismanuscript. We acknowledge funding from Natural Sciences and Engineering Research Council of Canada, Genome Canada/Ontario Genomics Institute,and the Canadian Institutes of Health Research.B.J.F.is a Fellow of the Canadian Institute for Advanced Research.

Yzhang98

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Clustering by Passing Messages Between Data Points【论文翻译】

Clustering by Passing Messages Between Data PointsBrendan J. Frey* and Delbert DueckClustering data by identifying a subset of representative examples is important for processing sensory signals and detecting patterns in data. Such “exemplars” can be fou
复制链接

扫一扫