这篇博文列出了主要的公式,和讲解视频一起看效果更佳!
讲解视频
Similarity and Dissimilarity Coefficients——相似度不相似度的定义
首先我们来看一下两个系数的定义方法,Similarity and dissimilarity coefficients代表着两个数据之间的相似和不相似的程度
一个典型的dissimilarity coefficient是the Minkowski metric
d
i
,
j
=
[
∑
k
=
1
m
(
a
k
i
−
a
k
j
)
q
]
1
q
d_{i,j}=\bigg[\sum_{k=1}^m(a_{ki}-a_{kj})^q\bigg]^{\frac{1}{q}}
di,j=[k=1∑m(aki−akj)q]q1
其中
q
>
0
q>0
q>0
定义similarity coefficient
s
i
j
=
∑
k
=
1
m
a
k
i
a
k
j
∑
k
=
1
m
(
a
k
i
+
a
k
j
−
a
k
i
a
k
j
)
s_{ij}=\frac{\sum_{k=1}^{m}a_{ki}a_{kj}}{\sum_{k=1}^{m}(a_{ki}+a_{kj}-a_{ki}a_{kj})}
sij=∑k=1m(aki+akj−akiakj)∑k=1makiakj
Cluster Representatives——选代表
方法一:
{
r
1
,
r
2
}
=
a
r
g
min
(
i
,
j
)
s
i
j
\{r_1,r_2\}=arg\min_{(i,j)}s_{ij}
{r1,r2}=arg(i,j)minsij
r
k
=
a
r
g
min
i
∈
{
1
,
2
,
.
.
.
,
k
−
1
}
∑
j
=
1
k
−
1
s
i
r
j
k
=
3
,
4
,
.
.
.
p
r_k=arg\min_{i\in\{1,2,...,k-1\}}\sum_{j=1}^{k-1}s_ir_j\\ k=3,4,...p
rk=argi∈{1,2,...,k−1}minj=1∑k−1sirjk=3,4,...p
其中
r
k
r_k
rk表示的是第k个cluter的index
方法二:
{
r
1
,
r
2
.
.
.
r
p
}
=
a
r
g
min
r
∈
{
1
,
2
,
.
.
.
p
}
{
∑
i
=
1
n
∑
r
<
i
s
r
i
r
j
∣
r
i
,
r
j
∈
{
1
,
2
,
.
.
.
,
n
}
}
\{r_1,r_2...r_p\}=arg\min_{r\in\{1,2,...p\}}\bigg\{\sum_{i=1}^n\sum_{r<i}s_{r_ir_j}|r_i,r_j\in\{1,2,...,n\}\bigg\}
{r1,r2...rp}=argr∈{1,2,...p}min{i=1∑nr<i∑srirj∣ri,rj∈{1,2,...,n}}
Linear Assignment Model
M
a
x
i
m
i
z
e
∑
i
=
1
n
∑
k
=
1
p
s
i
r
k
x
i
k
o
r
m
i
n
i
m
i
z
e
∑
i
=
1
n
∑
k
=
1
p
d
i
r
k
x
i
k
Maximize\quad\sum_{i=1}^n\sum_{k=1}^ps_{ir_k}x_{ik}\quad or \quad minimize\quad\sum_{i=1}^n\sum_{k=1}^pd_{ir_k}x_{ik}
Maximizei=1∑nk=1∑psirkxikorminimizei=1∑nk=1∑pdirkxik
s
u
b
j
e
c
t
t
o
∑
i
=
1
n
x
i
k
=
1
k
=
1
,
2
,
.
.
.
p
subject\;to\quad\sum_{i=1}^nx_{ik}=1\qquad k=1,2,...p
subjecttoi=1∑nxik=1k=1,2,...p
∑
k
=
1
p
x
i
k
≤
u
i
=
1
,
2
,
.
.
.
n
\sum_{k=1}^px_{ik}\le u\qquad i=1,2,...n
k=1∑pxik≤ui=1,2,...n
x
i
k
≥
0
i
=
1
,
2
,
.
.
.
,
n
;
k
=
1
,
2
,
.
.
.
,
p
x_{ik}\ge0\qquad i=1,2,...,n;\quad k=1,2,...,p
xik≥0i=1,2,...,n;k=1,2,...,p
其中
x
i
k
x_{ik}
xik是二值决策变量
Assignment Clustering Algorithm
Step 0: Set
I
=
{
i
∣
1
,
2
,
.
.
.
,
n
}
I=\{i|1,2,...,n\}
I={i∣1,2,...,n}and
K
=
{
k
∣
1
,
2
,
.
.
.
p
}
.
K=\{k|1,2,...p\}.
K={k∣1,2,...p}.
Step 1: Load the number of clusters n and the upper bound of data per cluster u:
Step 2: Load or compute the similarity coefficients between every pair of data.
Step 3: Determine cluster representatives using (15) and (16) or (17), then remove
r
k
(
k
∣
1
,
2
,
.
.
.
p
)
r_k\;(k|1,2,...p)
rk(k∣1,2,...p)from I.
Step 4: Determine
(
v
,
w
)
=
a
r
g
max
i
∈
I
,
k
∈
K
s
i
r
k
(v,w)=arg\max_{i\in I,k\in\ K}s_{ir_k}
(v,w)=argmaxi∈I,k∈ Ksirk.
Step 5: If the number of data in cluster w is u; then remove w from K and go to Step 4; otherwise, assign datum v to cluster w and delete v from I:
Step 6: If
I
≠
∅
I\ne \emptyset
I=∅; go to Step 4.
Step 7: Evaluate the clustering result using one or more performance criteria.