【论文翻译】Clustering by Passing Messages Between Data Points

【文献翻译】Clustering by Passing Messages Between Data Points

论文题目/作者信息:Clustering by Passing Messages Between Data Points (Brendan J. Frey and Delbert Dueck)
翻译人:jingxingv

Abstract

Clustering data by identifying a subset of representative examples is important for processing sensory signals and detecting patterns in data. Such “exemplars” can be found by randomly choosing an initial subset of data points and then iteratively refining it, but this works well only if that initial choice is close to a good solution. We devised a method called “affinity propagation,” which takes as input measures of similarity between pairs of data points.

Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. We used affinity propagation to cluster images of faces, detect genes in microarray data, identify representative sentences in this manuscript, and identify cities that are efficiently accessed by airline travel.

Affinity propagation found clusters with much lower error than other methods, and it did so in less than one-hundredth the amount of time.

在处理感知信号和检测数据中的模式中,通过识具有代表性的例子进行数据聚类是十分重要的,这样的exemplars可以通过随机选择初始化数据集合,然后迭代精炼该集合,该方法仅仅在随机选择接近于一个良好的解时是很有用的。我们提出一种方法叫做affinity propagation,其将成对数据间的的相似性度量作为输入。数据之间交换实值消息,直到一个高质量exemplars的集合和其对应的簇逐渐产生。我们使用affinity propagation在人脸的图像分类,微阵列数据中检测基因,识别原稿中代表性的句子,识别在航空旅行中可以获得的有效城市。Affinity propagation在相比其他方法上找到更低误差的簇,并且花费不到百分之一的时间(与其他方法比较)。

正文

Clustering data based on a measure of similarity is a critical step in scientific data analysis and in engineering systems. A common approach is to use data to learn a set of centers such that the sum of squared errors between data points and their nearest centers is small.

基于相似性度量的数据聚类是科学数据分析和工程系统中的关键步骤。一种常见的方法是使用数据来学习一组中心,以便数据点和它们最近的中心之间的平方误差之和最小。

When the centers are selected from actual data points, they are called“exemplars.” The popular k-centers clustering technique (1) begins with an initial set of randomly selected exemplars and iteratively refines this set so as to decrease the sum of squared errors.

当从实际数据点中选择中心时,它们被称为“exemplars”。流行的k-centers聚类技术就是开始于一些随机的exemplars集,并以减少误差平方和为目标,迭代地改进这个集合。

k-centers clustering is quite sensitive to the initial selection of exemplars, so it is usually rerun many times with different initializations in an attempt to find a good solution. However,this works well only when the number of clusters is small and chances are good that at least one random initialization is close to a good solution.

k-centers聚类对于初始选择的exemplars十分敏感!通常需要重新计算很多不同初始化情况试图找到最优的解。然而这种方法只在聚类规模小,并且初始化靠近一个良好的解时,效果良好。

We take a quite different approach and introduce a method that simultaneously considers all data points as potential exemplars. By viewing each data point as a node in a network, we devised a method that recursively transmits real-valued messages along edges of the network until a good set of exemplars and corresponding clusters emerges.

我们采用了一种完全不同的方法,并引入了一种同时将所有数据点视为潜在异常的方法。通过将每个数据点视为网络中的一个节点,我们设计了一种方法,它沿着网络的边缘重复传输实值消息,直到出现一组好的exemplars和相应的簇。

As described later, messages are updated on the basis of simple formulas that search for minima of an appropriately chosen energy function. At any point in time, the magnitude of each message reflects the current affinity that one data point has for choosing another data point as its exemplar, so we call our method “affinity propagation.” Figure 1A illustrates how clusters gradually emerge during the message-passing procedure.

正如后面所述的那样,消息基于简单的公式进行更新,该公式搜索适当选择的能量函数的最小值。在任何时间点,每个消息的大小反映了一个数据点选择另一个数据点作为其样本的当前亲和力,因此我们称我们的方法为“affinity propagation”,图1A说明了簇是如何在消息传递过程中逐渐出现的。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Mv71sQ0M-1596174939554)(D:%5C%E5%8D%9A%E5%AE%A2%5Ctypora_photo%5Cimage-20200728115607529.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7xPkXdCP-1596174939559)(D:/%E5%8D%9A%E5%AE%A2/typora_photo/image-20200731114450407.png)]

Fig.1. AP是如何工作的。 (A)图示了二维数据点的亲和力传播,其中负欧氏距离(平方误差)被用来测量相似性。 每人的点是根据当前证据着色的,即它是一个聚类中心(样本)。 从i点到k点的箭头的黑暗对应于传递给我的强度 这一点我属于样本点K。(B)“责任”r(I,k)从数据点发送到候选样本,并指出每个数据点对候选样本o的支持程度 其他候选人的样本。 ©“Availability”a(i,k)从候选样本发送到数据点,并指出每个候选样本在多大程度上可作为该样本的聚类中心数据点。 (D)显示了输入偏好值(所有数据点通用)对已识别样本数(簇数)的影响。 在(A)中使用的值是ALS 显示,这是从两两相似点的中位数计算的。

Affinity propagation takes as input a collection of real-valued similarities between data points, where the similarity s ( i , k ) s(i,k) s(i,k) indicates how well the data point with index k k k is suited to be the exemplar for data point i i i.

AP把数据点之间实值的相似度作为输入,相似度s(i,k)表示多大程度上k索引的数据点适合作为数据点 i 的exemplar。

When the goal is to minimize squared error, each similarity is set to a negative squared error (Euclidean distance): For points x i x_i xi and x k x_k xk, s ( i , k ) = − ∣ ∣ x i − x k ∣ ∣ 2 s(i,k) =−||xi − xk||^2 s(i,k)=xixk2 . Indeed, the method described here can be applied when the optimization criterion is much more general.

当目标是最小化平方误差时,每个相似性被设置为负平方误差(欧几里得距离),对于点 x i x_i xi x k x_k xk s ( i , k ) = − ∣ ∣ x i − x k ∣ ∣ s(i,k)= -||x_i - x_k || s(i,k)=xixk。实际上,这里描述的方法可以应用在优化标准更一般的情况。

Later, we describe tasks where similarities are derived for pairs of images, pairs of microarray measurements, pairs of English sentences, and pairs of cities. When an exemplar-dependent probability model is available, s ( i , k ) s(i,k) s(i,k) can be set to the log-likelihood of data point i i i given that its exemplar is point k k k. Alternatively, when appropriate, similarities may be set by hand.

随后,我们描述任务:相似度从成对的图像、成对的微阵列测量、成对的英语句子和成对的城市中导出。当exemplar依赖的概率模型是可以得的时候,假设其样本是点k,则 s ( i , k ) s(i,k) s(ik)可以被设置为数据点 i i i 的对数似然性。或者,在适当的时候,相似性可以手动设置。

Rather than requiring that the number of clusters be prespecified, affinity propagation takes as input a real number s ( k , k ) s(k,k) s(k,k) for each data point k k k so that data points with larger values of s ( k , k ) s(k,k) s(k,k) are more likely to be chosen as exemplars.

相似性传播不是要求预先指定聚类的数目,而是将每个数据点 k k k 的实数 s ( k , k ) s(k,k) s(kk) 作为输入,使得具有较大 s ( k , k ) s(k,k) s(kk) 值的数据点更有可能被选择作为输出。

These values are referred to as “preferences.” The number of identified exemplars (number of clusters) is influenced by the values of the input preferences, but also emerges from the message-passing procedure. If a priori, all data points are equally suitable as exemplars, the preferences should be set to a common value—this value can be varied to produce different numbers of clusters.

这些值(即输入的s(k,k))被叫做preferences(参考度)。已识别exemplars的数量(聚类的数量)受输入偏好值的影响,但也会在消息传递过程中慢慢浮现。假使有一个先验,所有数据点都是同等适合作为exemplars,那么参考度应该设定为同大小的值——这个值的改变可以引起不同类别数,该值可以设定为输入相似度的中值(产生适中的簇数目)或者他们的最小值(产生一个小数量的簇数目)

The shared value could be the median of the input similarities (resulting in a moderate number of clusters) or their minimum (resulting in a small number of clusters).

共享值可以是输入相似性的中间值(产生中等数量的聚类)或它们的最小值(产生少量的聚类)。

There are two kinds of message exchanged between data points, and each takes into account a different kind of competition. Messages can be combined at any stage to decide which points are exemplars and, for every other point, which exemplar it belongs to. The“responsibility” r(i,k), sent from data point i to candidate exemplar point k, reflects the accumulated evidence for how well-suited point k is to serve as the exemplar for point i, taking into account other potential exemplars for point i (Fig. 1B).

数据点之间有两种信息交换,每一种都考虑到不同种类的竞争。可以在任何阶段将这些信息组合起来,以决定哪些点是exemplars,对于每一个其他点,它属于哪个exemplars,从点i发送消息到点k的responsibility r(i,k)反映考虑i的其他候选样本,点K是如何 适合作为点i的样本累积证据(图B)。

The “availability” a(i,k), sent from candidate exemplar point k to point i,reflects the accumulated evidence for how

appropriate it would be for point i to choose point k as its exemplar, taking into account the support from other points that point k should be an exemplar (Fig. 1C). r(i,k) and a(i,k) can be viewed as log-probability ratios. To begin with, the availabilities are initialized to zero: a(i,k) = 0. Then, the responsibilities are computed using the rule
r ( i , k ) ← s ( i , k ) − max ⁡ k ′ s . t . k ′ ≠ k { a ( i , k ′ ) + s ( i , k ′ ) } ( 1 ) r(i,k)\leftarrow s(i,k) - \max\limits_{k' s.t.k'\neq k}\{a(i,k')+s(i,k')\}\\ (1) r(i,k)s(i,k)ks.t.k=kmax{a(i,k)+s(i,k)}(1)

从候选exemplar 点 k发送到点 i 的availability” a(i,k)反映了,考虑到其他点的支持,点i选择点k作为其示例是合适的(图1C)。r(i,k)和a(i,k) 可以看作是对数概率比。首先,可用性初始化为零: a(i,k) = 0,然后,使用规则计算。

In the first iteration, because the availabilities are zero, r ( i , k ) r(i,k) r(i,k) is set to the input similarity between point i i i and point k k k as its exemplar, minus the largest of the similarities between point i i i and other candidate exemplars. This competitive update is data-driven and does not take into account how many other points favor each candidate exemplar.

在第一次迭代中,因为可用性为零,所以 r ( i , k ) r(i,k) r(ik) 被设置为点 i i i 和点 k k k 之间的输入相似度作为其样本,减去点 i i i 和其他候选样本之间的最大相似度。这种竞争性的更新是数据驱动的,没有考虑每个候选样本有多少其他优势。

In later iterations,when some points are effectively assigned to other exemplars, their availabilities will drop below zero as prescribed by the update rule below.

在以后的迭代中,当一些点被有效地分配给其他样本时,它们的可用性将下降到零以下,如下面的更新规则所规定的。

These negative availabilities will decrease the effective values of some of the input similarities s ( i , k ′ ) s(i,k′) s(i,k) in the above rule, removing the corresponding candidate exemplars from competition. For k = i k = i k=i, the responsibility r ( k , k ) r(k,k) r(k,k) is set to the input preference that point k k k be chosen as an exemplar, s ( k , k ) s(k,k) s(k,k), minus the largest of the similarities between point i i i and all other candidate exemplars.

这些负可用性将降低上述规则中一些输入相似性 s ( i , k ′ ) s(i,k′) s(ik) 的有效值,从竞争中移除相应的候选样本。对于 k = i k = i k=i,责任 r ( k , k ) r(k,k) r(kk) 被设置为输入偏好,即点 k k k 被选择为样本 s ( k , k ) s(k,k) s(kk),减去点 i i i和所有其他候选样本之间的最大相似度。

This “self-responsibility” reflects accumulated evidence that point k k k is an exemplar, based on its input preference tempered by how ill-suited it is to be assigned to another exemplar.

这种“自我责任”反映了越来越多的证据,表明 k k k 点是一个样本,这是基于它的输入偏好,再加上它被分配给另一个样本的不合适性。

Whereas the above responsibility update lets all candidate exemplars compete for ownership of a data point, the following availability update gathers evidence from data points as to whether each candidate exemplar would make a good exemplar:
a ( i , k ) ← min ⁡ { 0 , r ( k , k ) + ∑ i ′ s . t . i ′ ⊈ { i , k } m a x { 0 , r ( i ′ , k ) } } ( 2 ) a(i,k)\leftarrow\min\{0,r(k,k)+\sum\limits_{i's.t.i'\nsubseteq\{i,k\}} max\{0,r(i',k)\} \}\\ (2) a(i,k)min{0,r(k,k)+is.t.i{i,k}max{0,r(i,k)}}(2)
尽管上述责任更新让所有候选样本竞争数据点的所有权,但以下可用性更新从数据点收集证据,以确定每个候选样本是否会成为好样本:

The availability a(i,k) is set to the self responsibility r(k,k) plus the sum of the positive responsibilities candidate exemplar k receives from other points. Only the positive portions of incoming responsibilities are added, because it is only necessary for a good exemplar to explain some data points well (positive responsibilities), regardless of how poorly it explains other data points (negative responsibilities). If the self responsibility r(k,k) is negative (indicating that point k is currently better suited as belonging to another exemplar rather than being an exemplar itself), the availability of point k as an exemplar can be increased if some other points have positive responsibilities for point k being their exemplar.

可用性a(i,k)被设置为自我责任r(k,k)加上候选样本k从其他点接收的积极责任的总和。只增加了引入责任的积极部分,因为一个好的样本只需要很好地解释一些数据点(积极责任),而不管它如何解释其他数据点(消极责任)。如果自我责任r(k,k)是负的(表明点k当前更适合属于另一个样本而不是样本本身),如果一些其他点对 k 作为它们的样本具有正的责任,则点k作为样本的可用性可以增加。

To limit the influence of strong incoming positive responsibilities, the total sum is thresholded so that it cannot go above zero. The “self-availability” a(k,k) is updated differently:
a ( k , k ) ← ∑ i ′ s . t . i ′ ≠ k m a x { 0 , r ( i ′ , k ) } } ( 3 ) a(k,k)\leftarrow \sum\limits_{i's.t.i'\neq k} max\{0,r(i',k)\} \}\\(3) a(k,k)is.t.i=kmax{0,r(i,k)}}(3)
This message reflects accumulated evidence that point k is an exemplar, based on the positive responsibilities sent to candidate exemplar k from other points.

该消息反映了基于从其他点发送给候选样本k的积极责任,点k是样本的累积证据。

The above update rules require only simple,local computations that are easily implemented (2), and messages need only be exchanged between pairs of points with known similarities. At any point during affinity propagation, availabilities and responsibilities can be combined to identify exemplars. For point i, the value of k that maximizes a(i,k) + r(i,k) either identifies point i as an exemplar if k = i, or identifies the data point that is the exemplar for point i.

上述更新规则只需要简单的、易于实现的局部计算(2),并且消息只需要在具有已知相似性的点对之间交换。在亲和传播过程中的任何时候,效用能力和责任可以结合起来识别样本。对于点i,最大化a(i,k) + r(i,k)的k值要么将点I标识为样本(如果k = i ),要么标识作为点I的范例的数据点。

The message-passing procedure may be terminated after a fixed number of iterations, after changes in the messages fall below a threshold, or after the local decisions stay constant for some number of iterations. When updating the messages,it is important that they be damped to avoid numerical oscillations that arise in some circumstances.

消息传递过程可以在固定次数的迭代之后、在消息中的变化低于阈值之后、或者在局部决策在一定次数的迭代中保持不变之后终止。当更新消息时,重要的是对它们进行阻尼,以避免在某些情况下出现数值振荡。

Each message is set to l times its value from the previous iteration plus 1 – l times its prescribed updated value, where the damping factor l is between 0 and 1. In all of our experiments (3), we used a default damping factor of l = 0.5, and each iteration of affinity propagation consisted of (i) updating all responsibilities given the availabilities, (ii) updating all availabilities given the responsibilities, and (iii) combining availabilities and responsibilities to monitor the exemplar decisions and terminate the algorithm when these decisions did not change for 10 iterations.

每个消息都被设置为l倍于其先前迭代的值加上1 l倍于其规定的更新值,其中阻尼因子l在0和1之间。在我们所有的实验(3)中,我们使用了默认的阻尼因子l = 0.5,亲和传播的每次迭代包括(I)更新给定可用性的所有责任,(ii)更新给定责任的所有可用性,以及(iii)组合可用性和责任来监控前雇员决策,当这些决定在10次迭代中没有改变时,终止算法。

Figure 1A shows the dynamics of affinity propagation applied to 25 two-dimensional data points (3), using negative squared error as the similarity.

图1A显示了应用于25个二维数据点(3)的相似性传播的动力学,使用负平方误差作为相似性。

One advantage of affinity propagation is that the number of exemplars need not be specified beforehand. Instead, the appropriate number of exemplars emerges from the message passing method and depends on the input exemplar preferences. This enables automatic model selection, based on a prior specification of how preferable each point is as an exemplar.

相似性传播的一个优点是样本的数量不需要事先指定。相反,适当数量的样本出现在消息传递方法中,并取决于输入样本的首选项。这使得自动模型选择成为可能,基于每个点作为样本的优选程度的预先说明。

Figure 1D shows the effect of the value of the common input preference on the number of clusters. This relation is nearly identical to the relation found by exactly minimizing the squared error (2).

图1D显示了公共输入偏好值对集群数量的影响。这种关系与通过精确地最小化平方误差(2)得到的关系几乎相同。

We next studied the problem of clustering images of faces using the standard optimization criterion of squared error.

接下来,我们研究了使用标准优化平方误差对人脸图像进行聚类的问题。

We used both affinity propagation and k-centers clustering to identify exemplars among 900 grayscale images extracted from the Olivetti face database (3).

我们使用相似性传播和k中心聚类来识别从Olivetti人脸数据库中提取的900幅灰度图像中的样本(3)。

Affinity propagation found exemplars with much lower squared error than the best of 100 runs of k-centers clustering (Fig. 2A), which took about the same amount of computer time.

相似性传播发现样本的平方误差比k-中心聚类的100次运行中最好的一次要低得多(图2A),这花费了大约相同的计算机时间。

We asked whether a huge number of random restarts of k-centers clustering could achieve the same squared error. Figure 2B shows the error achieved by one run of affinity propagation and the distribution of errors achieved by 10,000 runs of k-centers clustering, plotted against the number of clusters.

我们询问大量随机重启k中心聚类是否能达到同样的平方误差。图2B显示了一次相似性传播产生的误差,以及10,000次k中心聚类产生的误差分布,并与聚类数进行了对比。

Affinity propagation uniformly achieved much lower error in more than two orders of magnitude less time. Another popular optimization criterion is the sum of absolute pixel differences (which better tolerates outlying pixel intensities), so we repeated the above procedure using this error measure. Affinity propagation again uniformly achieved lower error (Fig. 2C).

AP在少于两个数量级的时间内统一地实现了低得多的误差。另一个流行的优化标准是绝对像素差异的总和(它更好地容忍外围像素强度),因此我们使用这个误差度量重复上述过程。相似性传播再次均匀地实现了较低的误差(图2C)。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ZpS4odl9-1596174939563)(D:/%E5%8D%9A%E5%AE%A2/typora_photo/image-20200731115410668.png)]

Fig.2.聚集的面孔。 从900幅归一化人脸图像(3)中识别出最小化标准平方误差测度的示例)。 对于−600的共同偏好,亲和力传播 离子发现62个团簇,平均平方误差为108。 作为比较,100次不同随机初始化的k中心聚类的最佳结果达到了较差的平均平方误差第119条。 (A)在亲和传播或k点聚类下,误差最高的15幅图像显示在顶行。 中间和底部行显示t分配的示例 WO方法和框显示了这两种方法中的哪一种在平方误差方面对该图像表现得更好。 亲和力传播发现了更高质量的样本。 (B)平均平方误差a 通过一次亲和传播和10,000次k中心聚类来实现,而不是簇的数量。 彩色波段显示不同的百分位数的平方误差,和数字 给出了与(A)结果相对应的样本。 ©重复上述程序,使用绝对误差之和作为相似性的度量,这也是一种流行的优化c 里蒂翁。

Many tasks require the identification of exemplars among sparsely related data, i.e., where most similarities are either unknown or large and negative.

许多任务需要在相关性小的数据中识别样本,在这些数据中,大多数相似性要么未知,要么大且负面。

To examine affinity propagation in this context, we addressed the task of clustering putative exons to find genes, using the sparse similarity matrix derived from microarray data and reported in (4). In that work, 75,066 segments of DNA (60 bases long) corresponding to putative exons were mined from the genome of mouse chromosome 1. Their transcription levels were measured across 12 tissue samples, and the similarity between every pair of putative exons(data points) was computed. The measure of similarity between putative exons was based on their proximity in the genome and the degree of coordination of their transcription levels across the 12 tissues. To account for putative exons that are not exons (e.g., introns), we included an additional artificial exemplar and determined the similarity of each other data point to this “nonexon exemplar” using statistics taken over the entire data set. The resulting 75,067 × 75,067 similarity matrix (3) consisted of 99.73% similarities with values of −∞, corresponding to distant DNA segments that could not possibly be part of the same gene. We applied affinity propagation to this similarity matrix, but because messages need not be exchanged between point i and k if s(i,k) = −∞, each iteration of affinity propagation required exchanging messages between only a tiny subset (0.27% or 15 million) of data point pairs.

为了检查文中的AP,我们使用从微阵列数据中得到的稀疏相似矩阵,并在(4)中报告,解决了聚类推定外显子以发现基因的任务。在这项工作中,从小鼠1号染色体的基因组中挖掘出75,066段对应于推定外显子的DNA (60个碱基长)。在12个组织样本中测量它们的转录水平,并计算每对推定外显子(数据点)之间的相似性。推定外显子之间相似性的度量是基于它们在基因组中的相似性以及它们在12个组织中转录水平的协调程度。为了说明非外显子(如内含子)的推定外显子,我们加入了一个额外的人工样本,并使用整个数据集的统计数据确定了每个其他数据点与这个“非外显子样本”的相似性。所得的75,067 × 75,067相似性矩阵(3)由99.73%的相似性组成,其值为,对应于不可能是同一基因的一部分的遥远的DNA片段。我们将相似性传播应用于这个相似性矩阵,但是如果s(i,k)=∞,则由于消息不需要在点I和k之间交换,相似性传播的每次迭代都需要交换消息仅在数据点对的极小子集(0.27%或1500万)之间。

Figure 3A illustrates the identification of gene clusters and the assignment of some data points to the nonexon exemplar. The reconstruction errors for affinity propagation and k centers clustering are compared in Fig. 3B. For each number of clusters, affinity propagation was run once and took 6 min, whereas k-centers clustering was run 10,000 times and took 208 hours. To address the question of how well these methods perform in detecting bona fide gene segments, Fig. 3C plots the true positive (TP) rate against the false-positive (FP)rate, using the labels provided in the Ref Seq database (5). Affinity propagation achieved significantly higher TP rates, especially at low FP rates, which are most important to biologists. At a FP rate of 3%, affinity propagation achieved a TP rate of 39%, whereas the best k-centers clustering result was 17%. For comparison, at the same FP rate, the best TP rate for hierarchical agglomerative clustering (2) was 19%, and the engineering tool described in (4), which accounts for additional biological knowledge, achieved a TP rate of 43%.

图3A说明了基因簇的识别和一些数据点对非样本的分配。图3B比较了亲和传播和k中心聚类的重构误差。对于每一个数量的聚类,亲和度计算运行一次并花费6分钟,而k中心聚类运行10,000次并花费208小时。为了解决这些方法在检测真正的基因片段中表现如何的问题,图3C使用参考序列数据库(5)中提供的标记绘制了真阳性率与假阳性率的关系图。亲和繁殖获得了显著更高的总磷率,特别是在低磷率下,这对于生物学家来说是最重要的。在3%的概率密度下,亲和传播达到了39%的概率密度,而最好的k中心聚类结果是17%。相比之下,在相同的过滤速率下,层次凝聚聚类(2)的最佳过滤速率为19%,而(4)中描述的工程工具(考虑了额外的生物学知识)达到了43%的过滤速率。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-M5rCWHYO-1596174939576)(D:/%E5%8D%9A%E5%AE%A2/typora_photo/image-20200731115808147.png)]

Fig.3. 检测基因。 亲和繁殖用于检测由小鼠染色体1基因组成的推测外显子(数据点。 这里,平方误差不适合作为Similari的度量 相反,相似度值来源于一个成本函数,测量基因组中假定外显子的接近程度,并在12个组织样本(3)中共同表达假定外显子)。 (A)A 在亲和传播的每次迭代中,显示了数据的商城部分和集群的出现。 在每一张图片中,用黑色概述的100个框对应100个数据点(来自t 每个盒子中的12个彩色块表示12个组织样本中相应DNA片段的转录水平。 最左边的盒子对应一个具有无限偏好的人工数据点,用于解释非xon区域(例如内含子)。 连接数据点的线表示潜在的赋值,其中灰色线表示assi 目前有弱证据和实线的GNES表示当前有强证据的作业。 (B)对不同数目基因的重建误差最小化的性能检测到的集群的RS。 对于每个簇数,亲和传播需要6分钟,而10,000次k中心聚类在同一台计算机上需要208小时。 在每种情况下,亲和力传播与k中心聚类相比,重建误差显著降低。 ©用于检测外显子的真阳性率与假阳性率图[使用来自RefSeq(5)的标签]显示t 帽子亲和力传播在检测生物验证的外显子方面也比k中心聚类表现得更好。

Affinity propagation’s ability to operate on the basis of nonstandard optimization criteria makes it suitable for exploratory data analysis using unusual measures of similarity. Unlike metric space clustering techniques such as k-means clustering (1), affinity propagation can be applied to problems where the data do not lie in a continuous space. Indeed, it can be applied to problems where the similarities are not symmetric [i.e.,$ s(i,k) ≠ s(k,i)$] and to problems where the similarities do not satisfy the triangle inequality[i.e., s ( i , k ) < s ( i , j ) + s ( j , k ) s(i,k) < s(i,j) + s( j,k) s(i,k)<s(i,j)+s(j,k)].

AP在非标准优化标准的基础上运行的能力使得它适合于使用异常相似性度量的探索性数据分析。不同于度量空间聚类技术,如k-均值聚类(1),相似性传播可以应用于数据不在连续空间的问题。实际上,它可以应用于相似性不是对称的问题[即s(i,k)s(k,i)]和相似性不满足三角形不等式的问题,[即s(i,k) < s(i,j) + s( j,k)]。

To identify a small number of sentences in a draft of this manuscript that summarize other sentences, we treated each sentence as a “bag of words” (6) and computed the similarity of sentence i to sentence k based on the cost of encoding the words in sentence i using the words in sentence k. We found that 97% of the resulting similarities (2, 3) were not symmetric. The preferences were adjusted to identify (using l = 0.8) different numbers of representative exemplar sentences (2), and the solution with four sentences is shown in Fig. 4A.

为了在这份手稿的草稿中找出少量总结其他句子的句子,我们将每个句子视为“一袋单词”(6),并根据使用句子k中的单词对句子I中的单词进行编码的成本,计算句子I与句子k的相似度。我们发现97%的相似度(2,3)不是对称的。调整偏好以识别(使用l = 0.8)不同数量的代表性例句(2),并且具有四个句子的解决方案在图4A中示出。

We also applied affinity propagation to explore the problem of identifying a restricted number of Canadian and American cities that are most easily accessible by large subsets of other cities, in terms of estimated commercial airline travel time. Each data point was a city, and the similarity s(i,k) was set to the negative time it takes to travel from city i to city k by airline, including estimated stopover delays (3). Due to headwinds, the transit time was in many cases different depending on the direction of travel, so that 36% of the similarities were asymmetric. Further, for 97% of city pairs i and k, there was a third city j such that the triangle inequality was violated, because the trip from i to k included a long stopover delay in city j so it took longer than the sum of the durations of the trips from i to j and j to k.

我们还应用了相似性传播来探索这样一个问题,即根据估计的商业航空旅行时间,识别最容易被其他城市的大子集访问的有限数量的加拿大和美国城市。每个数据点是一个城市,相似性s(i,k)被设置为乘飞机从城市I到城市k的负时间,包括估计的中途停留时间(3)。由于逆风,运输时间在许多情况下因行驶方向而异,因此36%的相似性是不对称的。此外,对于97%的城市对I和k来说,有第三个城市j违反了三角不等式,因为从I到k的旅行包含了很长的中途停留延迟。

When the number of “most accessible cities” was constrained to be seven (by adjusting the input preference appropriately), the cities shown in Fig. 4, B to E, were identified. It is interesting that several major cities were not selected, either because heavy international travel makes them inappropriate as easily accessible domestic destinations (e.g., New York City, Los Angeles) or because their neighborhoods can be more efficiently accessed through other destinations (e.g., Atlanta, Philadelphia, and Minneapolis account for Chicago’s destinations, while avoiding potential airport delays).

在城市j中,因此它花费的时间比从I到j和从j到k的旅行持续时间的总和还长。当“最容易到达的城市”的数量被限制为7个时(通过适当地调整输入偏好),图4中所示的城市B到E被识别。有趣的是,有几个主要城市没有被选中,也是因为繁重的国际旅行使它们不适合作为容易到达的国内目的地(例如,纽约城市,洛杉矶)或因为他们的邻居可以更有效地通过其他目的地(例如,亚特兰大,菲尔-阿德尔菲亚和明尼阿波利斯占了芝加哥的目的地,同时避免了潜在的机场延误)。

Affinity propagation can be viewed as a method that searches for minima of an energy function (7) that depends on a set of N hidden labels,$ c_1,…,c_N$, corresponding to the N data points. Each label indicates the exemplar to which the point belongs, so that s ( i , c i ) s(i,c_i) s(i,ci) is the similarity of data point i to its exemplar. c i = i c_i = i ci=i is a special case indicating that point i is itself an exemplar, so that s ( i , c i ) s(i,c_i) s(i,ci) is the input preference for point i. Not all configurations of the labels are valid; a configuration c is valid when for every point i, if some other point i′ has chosen i as its exemplar (i.e., ci′ = i), then i must be an exemplar (i.e., ci = i). The energy of a valid configuration is E ( c ) = − ∑ i = 1 N s ( i , c i ) E(c) = −∑i=1 N_s(i,ci) E(c)=i=1Ns(i,ci). Exactly minimizing the energy is computationally intractable, because a special case of this minimization problem is the NP-hard k-median problem (8). However, the update rules for affinity propagation correspond to fixed-point recursions for minimizing a Bethe free-energy (9) approximation. Affinity propagation is most easily derived as an instance of the max-sum algorithm in a factor graph (10) describing the constraints on the labels and the energy function (2).

亲和传播可被视为一种搜索能量函数(7)的最小值的方法,该能量函数依赖于对应于N个数据的一组N个隐藏标签c1,…,cN,点数。每个标签指示该点所属的样本,因此s(i,ci)是数据点I与其样本的相似度。ci = i是一种特殊情况,表明I点本身是一个样本,因此s(i,ci)是I点的输入参考。并非所有标签配置都有效;当对于每个点I,如果某个其他点I’已经选择I作为其样本(即,ci’= I),则I必须是样本(即,ci = i),则配置c是有效的。一个有效组态的能量是E©=∑I = 1N s(I,ci)。实际上最小化能量在计算上是棘手的,因为这个最小化问题的一个特例是NP-hard k-中值问题(8)。然而,亲和传播的更新规则对应于用于最小化贝氏自由能(9)近似值的定点递归。亲和传播最容易被描述为因子图(10)中的最大和算法的实例,该因子图描述了标签和能量函数(2)上的约束。

In some degenerate cases, the energy function may have multiple minima with corresponding multiple fixed points of the update rules, and these may prevent convergence. For example, if $s(1,2) = s(2,1) $and s ( 1 , 1 ) = s ( 2 , 2 ) s(1,1) = s(2,2) s(1,1)=s(2,2), then the solutions c 1 = c 2 = 1 a n d c 1 = c 2 = 2 c1 = c2 = 1 and c1 = c2 = 2 c1=c2=1andc1=c2=2 both achieve the same energy. In this case, affinity propagation may oscillate, with both data points alternating between being exemplars and non exemplars. In practice, we found that oscillations could always be avoided by adding a tiny amount of noise to the similarities to prevent degenerate situations,or by increasing the damping factor.

在一些退化的情况下,能量函数可能具有多个最小值和相应的更新规则的多个固定点,并且这些可能阻止收敛。例如,如果s(1,2) = s(2,1)和s(1,1) = s(2,2),那么解c1 = c2 = 1和c1 = c2 = 2都获得相同的能量。在这种情况下,相似性传播可能会振荡,两个数据点在样本和非样本之间交替。在实践中,我们发现,通过在相似处添加少量噪声以防止退化情况,或者通过增加阻尼因子,振荡总是可以避免的。

Affinity propagation has several advantages over related techniques. Methods such as k-centers clustering (1), k-means clustering(1), and the expectation maximization (EM) algorithm (11) store a relatively small set of estimated cluster centers at each step. These techniques are improved upon by methods that begin with a large number of clusters and then prune them (12), but they still rely on random sampling and make hard pruning decisions that cannot be recovered from. In contrast, by simultaneously considering all data points as candidate centers and gradually identifying clusters, affinity propagation is able to avoid many of the poor solutions caused by unlucky initializations and hard decisions. Markov chain Monte Carlo techniques (13) randomly search for good solutions, but do not share affinity propagation’s advantage of considering many possible solutions all at once.

AP相对于相关技术有几个优点。诸如k-中心聚类(1)、k-均值聚类(1)和期望最大化(EM)算法(11)等方法在每一步都存储一组相对较小的估计聚类中心。这些技术通过一些方法得到了改进,这些方法从大量的聚类开始,然后对它们进行删减(12),但它们仍然依赖于随机抽样,并做出难以弥补的删减决定。相比之下,通过同时考虑所有数据点作为候选中心并逐渐识别聚类,相似性算法能够避免许多由不吉利的初始化和困难的判定引起的不良解。马尔可夫链蒙特卡罗技术(13)随机搜索好的解决方案,但不分享亲和传播的优势,考虑许多可能的解决方案在同一时间。

Hierarchical agglomerative clustering (14) and spectral clustering (15) solve the quite different problem of recursively comparing pairs of points to find partitions of the data. These techniques do not require that all points within a cluster be similar to a single center and are thus not well-suited to many tasks. In particular, two points that should not be in the same cluster may be grouped together by an unfortunate sequence of pairwise groupings.

分层凝聚聚类(14)和谱聚类(15)解决了递归比较成对的点以找到数据分区的不同问题。这些技术不要求集群内的所有点都类似于单个中心,因此不太适合许多任务。特别是,不应该在同一个簇中的两个点可能被成对分组的不幸序列组合在一起。

In (8), it was shown that the related metric k-median problem could be relaxed to form a linear program with a constant factor approximation. There, the input was assumed to be metric, i.e., nonnegative, symmetric, and satisfying the triangle inequality. In contrast, affinity propagation can take as input general non metric similarities. Affinity propagation also provides a conceptually new approach that works well in practice. Where as the linear programming relaxation is hard to solve and sophisticated software packages need to be applied (e.g., CPLEX), affinity propagation makes use of intuitive message updates that can be implemented in a few lines of code (2).

在(8)中,我们证明了相关的度量k-中值问题可以用常数因子近似来松弛成线性规划。在这里,假设输入是度量的,即非负的,对称的,满足三角不等式的。相反,亲和传播可以将一般的非度量相似性作为输入。亲和力传播还提供了一种概念上新的方法,在实践中效果良好。由于线性规划松弛很难解决,需要应用复杂的软件包(例如,CPLEX),亲和传播利用直观的消息更新,这些更新可以在几行代码中实现(2)。

Affinity propagation is related in spirit to techniques recently used to obtain record-breaking results in quite different disciplines (16). The approach of recursively propagating messages(17) in a “loopy graph” has been used to approach Shannon’s limit in error-correcting decoding (18, 19), solve random satisfiability problems with an order-of-magnitude increase in size (20), solve instances of the NP-hard two dimensional phase-unwrapping problem (21), and efficiently estimate depth from pairs of stereo images (22). Yet, to our knowledge, affinity propagation is the first method to make use of this idea to solve the age-old, fundamental problem of clustering data. Because of its simplicity, general applicability, and performance, we believe affinity propagation will prove to be of broad value in science and engineering.

AP在精神上与最近在不同学科中获得破纪录结果的技术有关(16)。在“循环图”中递归传播消息(17)的方法已被用于逼近纠错解码(18,19)中的香农极限,解决随机可满足性问题(20),解决NP难二维相位展开问题(21),以及从成对的立体图像有效地估计深度(22)。利用这种古老的知识传播方法,是解决这一问题的首要途径。由于它的简单性,普遍适用性和性能,我们相信亲和传播将被证明在科学和工程中具有广泛的价值。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-0XtJGDM4-1596174939579)(D:/%E5%8D%9A%E5%AE%A2/typora_photo/image-20200731121855630.png)]

ffinity propagation will prove to be of broad value in science and engineering.

AP在精神上与最近在不同学科中获得破纪录结果的技术有关(16)。在“循环图”中递归传播消息(17)的方法已被用于逼近纠错解码(18,19)中的香农极限,解决随机可满足性问题(20),解决NP难二维相位展开问题(21),以及从成对的立体图像有效地估计深度(22)。利用这种古老的知识传播方法,是解决这一问题的首要途径。由于它的简单性,普遍适用性和性能,我们相信亲和传播将被证明在科学和工程中具有广泛的价值。

[外链图片转存中…(img-0XtJGDM4-1596174939579)]

Fig.4.识别关键句子和空中旅行路线。 AP可以在非标准优化准则的基础上探索样本的识别。 (A)两者之间的相同之处这份手稿草稿中的EN对句子是由匹配的单词构成的。 通过AP,确定了四个例句。 (B)亲和性传播被应用 从加拿大和美国456个最繁忙的商业机场之间的空中旅行效率(用估计的旅行时间来衡量)得出的相似之处-这两个机场的旅行时间都是直接的 航班(以蓝色显示)和间接航班(未显示),包括最长一次停留的平均转移时间,被用作负相似性(3)。 ©确定了七个样本 亲和传播是彩色编码的,并显示了其他城市对这些样本的分配。 位于样本城市附近的城市可能是其他更远的样本的成员他们之间缺乏直飞航班(例如,大西洋城离费城100公里,但离亚特兰大更近)。 (D)嵌体表明,加拿大-美国边界大致上是分开的 e多伦多和费城集群,由于国内航班比国际航班更多。 然而,西海岸的情况并非如此,如(E)所示,因为额外 温哥华和西雅图之间常见的航空服务连接着加拿大西北部的城市和西雅图。

初版,多有不足,部分未达要求,改进中…

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值