AffinityPropagation creates clusters by sending messages between pairs of samples until convergence. A dataset is then described using a small number of exemplars, which are identified as those most representative of other samples. The messages sent between pairs represent the suitability for one sample to be the exemplar of the other, which is updated in response to the values from other pairs. This updating happens iteratively until convergence, at which point the final exemplars are chosen, and hence the final clustering is given.
Affinity Propagation can be interesting as it chooses the number of clusters based on the data provided. For this purpose, the two important parameters are the preference, which controls how many exemplars are used, and the damping factor which damps the responsibility and availability messages to avoid numerical oscillations when updating these messages.
算法原理
The messages sent between points belong to one of two categories.
The first is the responsibility 𝑟(𝑖, 𝑘), which is the accumulated evidence that sample 𝑘 should be the exemplar for sample 𝑖.
The second is the availability 𝑎(𝑖, 𝑘) which is the accumulated evidence that sample 𝑖 should choose sample 𝑘 to be its exemplar, and considers the values for all other samples that 𝑘 should be an exemplar.
In this way, exemplars are chosen by samples if they are (1) similar enough to many samples and (2) chosen by many samples to be representative of themselves.
More formally, the responsibility of a sample 𝑘 to be the exemplar of sample 𝑖 is given by:
r
(
i
,
k
)
←
s
(
i
,
k
)
−
max
[
a
(
i
,
k
′
)
+
s
(
i
,
k
′
)
∀
k
′
≠
k
]
r(i, k) \leftarrow s(i, k) - \max [ a(i, k') + s(i, k') \forall k' \neq k ]
r(i,k)←s(i,k)−max[a(i,k′)+s(i,k′)∀k′=k]
Where 𝑠(𝑖, 𝑘) is the similarity between samples 𝑖 and 𝑘. The availability of sample 𝑘 to be the exemplar of sample 𝑖 is given by:
a
(
i
,
k
)
←
min
[
0
,
r
(
k
,
k
)
+
∑
i
′
s
.
t
.
i
′
∉
{
i
,
k
}
r
(
i
′
,
k
)
]
a(i, k) \leftarrow \min [0, r(k, k) + \sum_{i'~s.t.~i' \notin \{i, k\}}{r(i', k)}]
a(i,k)←min[0,r(k,k)+i′ s.t. i′∈/{i,k}∑r(i′,k)]
To begin with, all values for 𝑟 and 𝑎 are set to zero, and the calculation of each iterates until convergence. As discussed above, in order to avoid numerical oscillations when updating the messages, the damping factor 𝜆 is introduced to iteration process:
r
t
+
1
(
i
,
k
)
=
λ
⋅
r
t
(
i
,
k
)
+
(
1
−
λ
)
⋅
r
t
+
1
(
i
,
k
)
r_{t+1}(i, k) = \lambda\cdot r_{t}(i, k) + (1-\lambda)\cdot r_{t+1}(i, k)
rt+1(i,k)=λ⋅rt(i,k)+(1−λ)⋅rt+1(i,k)
a
t
+
1
(
i
,
k
)
=
λ
⋅
a
t
(
i
,
k
)
+
(
1
−
λ
)
⋅
a
t
+
1
(
i
,
k
)
a_{t+1}(i, k) = \lambda\cdot a_{t}(i, k) + (1-\lambda)\cdot a_{t+1}(i, k)
at+1(i,k)=λ⋅at(i,k)+(1−λ)⋅at+1(i,k)
where 𝑡 indicates the iteration times.
实现
示例代码1
from sklearn.cluster import AffinityPropagation
from sklearn import metrics
from sklearn.datasets import make_blobs
# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=300, centers=centers, cluster_std=0.5,
random_state=0)
# X.shape=(300,2) 300个sample 的坐标
# labels_true.shape=(300,) X 中每个 sample 的类别所属的类别,值为 0,1,2
# Affinity Propagation 是无监督学习,所以 true label 没有作用
# #############################################################################
# Compute Affinity Propagation
af = AffinityPropagation(preference=-50).fit(X)
# af 是一个object, 聚类结果封装在 af 中
# 聚类的中心对应的坐标:
print('cluster centers: ',af.ccluster_centers_)
#cluster centers: [[ 1.03325861 1.15123595]
# [ 0.93494652 -0.95302339]
# [-1.18459092 -1.11968959]]
# 作为聚类中心的点在 X 中的indice
cluster_centers_indices = af.cluster_centers_indices_
# array([160, 250, 272], dtype=int64)
# 算法对X中每个sample所属的类别做出的判断(label)
labels = af.labels_
# labels.shape=(300,)
# 聚类的种类数,也就是算法把X中的数据划分成几类
n_clusters_ = len(cluster_centers_indices)
# 对聚类结果的好坏进行评估
print('Estimated number of clusters: %d' % n_clusters_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
% metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
% metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(X, labels, metric='sqeuclidean'))
# #############################################################################
# Plot result
import matplotlib.pyplot as plt
from itertools import cycle
plt.close('all')
plt.figure(1)
plt.clf()
colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
class_members = labels == k
cluster_center = X[cluster_centers_indices[k]]
plt.plot(X[class_members, 0], X[class_members, 1], col + '.')
plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=14)
for x in X[class_members]:
plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
结果:
Estimated number of clusters: 3
Homogeneity: 0.872
Completeness: 0.872
V-measure: 0.872
Adjusted Rand Index: 0.912
Adjusted Mutual Information: 0.871
Silhouette Coefficient: 0.753
效果图:
af
AffinityPropagation(affinity='euclidean', convergence_iter=15, copy=True,
damping=0.5, max_iter=200, preference=-50, verbose=False)
示例代码2
from sklearn.cluster import AffinityPropagation
import numpy as np
X = np.array([[1, 2], [1, 4], [1, 0],[4, 2], [4, 4], [4, 0]])
clustering = AffinityPropagation(preference=-3).fit(X)
AffinityPropagation(affinity='euclidean', convergence_iter=15, copy=True,
damping=0.5, max_iter=200, preference=-3, verbose=False)
clustering.labels_
array([0, 0, 0, 1, 1, 1])
# Methods: predict 判断新输入的数据属于哪一类
clustering.predict([[0, 0], [4, 4]])
array([0, 1])
# Methods: 显示 clustering 的 parameters
clustering.get_params()
{'affinity': 'euclidean',
'convergence_iter': 15,
'copy': True,
'damping': 0.5,
'max_iter': 200,
'preference': -3,
'verbose': False}