算法描述
输入样本集:
X
=
{
x
1
,
x
2
,
.
.
.
x
n
}
X=\{x_1, x_2,... x_n\}
X={x1,x2,...xn}
聚类簇数K
输出划分
C
=
{
c
1
,
c
2
,
.
.
.
c
k
}
C=\{c_1, c_2, ... c_k\}
C={c1,c2,...ck}
最大迭代次数N
- 从数据集X中随机选择k个数据作为初始k个质心(centroids) { μ 1 , μ 2 , . . . μ k } \{\mu_1, \mu_2,... \mu_k\} {μ1,μ2,...μk}
- 对于n=1,2,…N,
a) 将C初始化为 C t = ∅ , t = 1 , 2 , . . . k C_t=\varnothing,t=1,2,...k Ct=∅,t=1,2,...k
b) 对于i=1,2,…m,计算样本 x i x_i xi和各个质心 μ j \mu_j μj的距离:
d i − > j = ∣ ∣ x i − μ j ∣ ∣ 2 d_{i->j}=||x_i-\mu_j||^2 di−>j=∣∣xi−μj∣∣2,
将 x i x_i xi标记最小的为 d i − > j d_{i->j} di−>j所对应的类别 λ i \lambda_i λi,此时更新
C λ i = C λ i ⋃ { x i } C_{\lambda_i}=C_{\lambda_i} \bigcup \{x_i\} Cλi=Cλi⋃{xi}
c) 对于j=1,2,…k, 对 C j C_j Cj中所有的样本点重新计算质心
μ j = 1 ∣ C j ∣ ∑ x ∈ C j x \mu_j=\frac 1 {|C_j|} \sum_{x\in C_j}x μj=∣Cj∣1x∈Cj∑x
d) 如果所有k个质心没有变换转到步骤3,否则返回步骤2 - 输出C
算法实例
样例数据如下:
X = {(0,1), (1,2), (1,1), (3,3), (3,2), (4,4)}
计算推演
- 选择(0,1), (1,1) 为质心, C 1 = ∅ ; C 2 = ∅ C_1=\varnothing; C_2=\varnothing C1=∅;C2=∅
- 迭代计算
2-1) 计算距离
节点1:
d
1
−
>
1
=
0
d_{1->1}=0
d1−>1=0
d
1
−
>
2
=
(
0
−
1
)
2
+
(
1
−
2
)
2
=
2
d_{1->2}=(0-1)^2 + (1-2)^2 = 2
d1−>2=(0−1)2+(1−2)2=2
离质心1近,所以标记为
C
1
C_1
C1
节点2:
d
2
−
>
1
=
(
1
−
0
)
2
+
(
2
−
1
)
2
=
2
d_{2->1}=(1-0)^2 + (2-1)^2 = 2
d2−>1=(1−0)2+(2−1)2=2
d
2
−
>
2
=
0
d_{2->2}=0
d2−>2=0
离质心2近,所以标记为
C
2
C_2
C2
节点3:
d
3
−
>
1
=
(
1
−
0
)
2
+
(
1
−
1
)
2
=
1
d_{3->1}=(1-0)^2 + (1-1)^2 = 1
d3−>1=(1−0)2+(1−1)2=1
d
3
−
>
2
=
(
1
−
1
)
2
+
(
1
−
2
)
2
=
1
d_{3->2}=(1-1)^2 + (1-2)^2 = 1
d3−>2=(1−1)2+(1−2)2=1
相同距离,按优先归到
C
1
C_1
C1
节点4:
d
4
−
>
1
=
(
3
−
0
)
2
+
(
3
−
1
)
2
=
13
d_{4->1}=(3-0)^2 + (3-1)^2 = 13
d4−>1=(3−0)2+(3−1)2=13
d
4
−
>
2
=
(
3
−
1
)
2
+
(
3
−
2
)
2
=
5
d_{4->2}=(3-1)^2 + (3-2)^2 = 5
d4−>2=(3−1)2+(3−2)2=5
离质心2近,所以标记为
C
2
C_2
C2
节点5:
d
5
−
>
1
=
(
3
−
0
)
2
+
(
2
−
1
)
2
=
10
d_{5->1}=(3-0)^2 + (2-1)^2 = 10
d5−>1=(3−0)2+(2−1)2=10
d
5
−
>
2
=
(
3
−
1
)
2
+
(
2
−
2
)
2
=
4
d_{5->2}=(3-1)^2 + (2-2)^2 = 4
d5−>2=(3−1)2+(2−2)2=4
离质心2近,所以标记为
C
2
C_2
C2
节点5:
d
6
−
>
1
=
(
4
−
0
)
2
+
(
4
−
1
)
2
=
25
d_{6->1}=(4-0)^2 + (4-1)^2 = 25
d6−>1=(4−0)2+(4−1)2=25
d
6
−
>
2
=
(
4
−
1
)
2
+
(
4
−
2
)
2
=
13
d_{6->2}=(4-1)^2 + (4-2)^2 = 13
d6−>2=(4−1)2+(4−2)2=13
离质心2近,所以标记为
C
2
C_2
C2
更新簇
C
1
=
∅
⋃
{
x
1
,
x
3
}
=
{
x
1
,
x
3
}
C_1 = \varnothing \bigcup \{x_1,x_3\} = \{x_1,x_3\}
C1=∅⋃{x1,x3}={x1,x3}
C
2
=
∅
⋃
{
x
2
,
x
4
,
x
5
,
x
6
}
=
{
x
2
,
x
4
,
x
5
,
x
6
}
C_2 = \varnothing \bigcup \{x_2,x_4,x_5,x_6\} = \{x_2,x_4,x_5,x_6\}
C2=∅⋃{x2,x4,x5,x6}={x2,x4,x5,x6}
更新质心
μ
1
=
(
(
0
+
1
)
/
2
,
(
1
+
1
)
/
2
)
=
(
0.5
,
1
)
\mu_1 = ((0+1)/2, (1+1)/2) = (0.5,1)
μ1=((0+1)/2,(1+1)/2)=(0.5,1)
μ
2
=
(
(
1
+
3
+
3
+
4
)
/
4
,
(
2
+
3
+
2
+
4
)
/
4
)
=
(
2.75
,
2.75
)
\mu_2 = ((1+3+3+4)/4,(2+3+2+4)/4) = (2.75,2.75)
μ2=((1+3+3+4)/4,(2+3+2+4)/4)=(2.75,2.75)
{
μ
1
,
μ
2
}
=
{
(
0.5
,
1
)
,
(
2.75
,
2.75
)
}
\{\mu_1, \mu_2\} = \{(0.5,1), (2.75,2.75)\}
{μ1,μ2}={(0.5,1),(2.75,2.75)}
2-2) 计算距离
节点1:
d
1
−
>
μ
1
=
(
0
−
0.5
)
2
+
(
1
−
1
)
2
=
0.25
d_{1->\mu_1}=(0-0.5)^2 + (1-1)^2 = 0.25
d1−>μ1=(0−0.5)2+(1−1)2=0.25
d
1
−
>
μ
2
=
(
0
−
2.75
)
2
+
(
1
−
2.75
)
2
=
10.62
d_{1->\mu_2}=(0-2.75)^2 + (1-2.75)^2 = 10.62
d1−>μ2=(0−2.75)2+(1−2.75)2=10.62
离质心1近,所以标记为
C
1
C_1
C1
节点2:
$d_{2->\mu_1}=(1-0.5)^2 + (2-1)^2 = 1.25 $
$d_{2->\mu_2}=(1-2.75)^2 + (2-2.75)^2 = 3.62 $
离质心1近,所以标记为
C
1
C_1
C1
节点3:
d
3
−
>
μ
1
=
(
1
−
0.5
)
2
+
(
1
−
1
)
2
=
0.25
d_{3->\mu_1}=(1-0.5)^2 + (1-1)^2 = 0.25
d3−>μ1=(1−0.5)2+(1−1)2=0.25
d
3
−
>
μ
2
=
(
1
−
2.75
)
2
+
(
1
−
2.75
)
2
=
6.12
d_{3->\mu_2}=(1-2.75)^2 + (1-2.75)^2 = 6.12
d3−>μ2=(1−2.75)2+(1−2.75)2=6.12
离质心1近,所以标记为
C
1
C_1
C1
节点4:
d
4
−
>
μ
1
=
(
3
−
0.5
)
2
+
(
3
−
1
)
2
=
10.25
d_{4->\mu_1}=(3-0.5)^2 + (3-1)^2 = 10.25
d4−>μ1=(3−0.5)2+(3−1)2=10.25
d
4
−
>
μ
2
=
(
3
−
2.75
)
2
+
(
3
−
2.75
)
2
=
2.5
d_{4->\mu_2}=(3-2.75)^2 + (3-2.75)^2 = 2.5
d4−>μ2=(3−2.75)2+(3−2.75)2=2.5
离质心2近,所以标记为
C
2
C_2
C2
节点5:
d
5
−
>
μ
1
=
(
3
−
0.5
)
2
+
(
2
−
1
)
2
=
7.25
d_{5->\mu_1}=(3-0.5)^2 + (2-1)^2 = 7.25
d5−>μ1=(3−0.5)2+(2−1)2=7.25
d
5
−
>
μ
2
=
(
3
−
2.75
)
2
+
(
2
−
2.75
)
2
=
0.12
d_{5->\mu_2}=(3-2.75)^2 + (2-2.75)^2 = 0.12
d5−>μ2=(3−2.75)2+(2−2.75)2=0.12
离质心2近,所以标记为
C
2
C_2
C2
节点6:
d
6
−
>
μ
1
=
(
4
−
0.5
)
2
+
(
4
−
1
)
2
=
21.25
d_{6->\mu_1}=(4-0.5)^2 + (4-1)^2 = 21.25
d6−>μ1=(4−0.5)2+(4−1)2=21.25
d
6
−
>
μ
2
=
(
4
−
2.75
)
2
+
(
4
−
2.75
)
2
=
3.12
d_{6->\mu_2}=(4-2.75)^2 + (4-2.75)^2 = 3.12
d6−>μ2=(4−2.75)2+(4−2.75)2=3.12
离质心2近,所以标记为
C
2
C_2
C2
更新簇
C
1
=
{
x
1
,
x
2
,
x
3
}
C_1 = \{x_1,x_2,x_3\}
C1={x1,x2,x3}
C
2
=
{
x
5
,
x
4
,
x
6
}
C_2 = \{x_5,x_4,x_6\}
C2={x5,x4,x6}
更新质心
μ
1
=
(
(
1
+
1
+
0
)
/
3
,
(
2
+
1
+
1
/
)
3
)
=
(
0.67
,
1.33
)
\mu_1 = ((1+1+0)/3, (2+1+1/)3) = (0.67,1.33)
μ1=((1+1+0)/3,(2+1+1/)3)=(0.67,1.33)
μ
2
=
(
(
3
+
3
+
4
)
/
3
,
(
3
+
2
+
4
)
/
3
)
=
(
3.33
,
3
)
\mu_2 = ((3+3+4)/3,(3+2+4)/3) = (3.33,3)
μ2=((3+3+4)/3,(3+2+4)/3)=(3.33,3)
{
μ
1
,
μ
2
}
=
{
(
0.67
,
1.33
)
,
(
3.33
,
3
)
}
\{\mu_1, \mu_2\} = \{ (0.67,1.33), (3.33,3)\}
{μ1,μ2}={(0.67,1.33),(3.33,3)}
2-3) 计算距离
节点1:
d
1
−
>
μ
1
=
(
0
−
0.67
)
2
+
(
1
−
1.33
)
2
=
0.56
d_{1->\mu_1}=(0-0.67)^2 + (1-1.33)^2 = 0.56
d1−>μ1=(0−0.67)2+(1−1.33)2=0.56(最近)
d
1
−
>
μ
2
=
(
0
−
3.33
)
2
+
(
1
−
3
)
2
=
15.09
d_{1->\mu_2}=(0-3.33)^2 + (1-3)^2 = 15.09
d1−>μ2=(0−3.33)2+(1−3)2=15.09
节点2:
d
2
−
>
μ
1
=
(
1
−
0.67
)
2
+
(
2
−
1.33
)
2
=
0.56
d_{2->\mu_1}=(1-0.67)^2 + (2-1.33)^2 = 0.56
d2−>μ1=(1−0.67)2+(2−1.33)2=0.56(最近)
d
2
−
>
μ
2
=
(
1
−
3.33
)
2
+
(
2
−
3
)
2
=
5.43
d_{2->\mu_2}=(1-3.33)^2 + (2-3)^2 = 5.43
d2−>μ2=(1−3.33)2+(2−3)2=5.43
节点3:
d
3
−
>
μ
1
=
(
1
−
0.67
)
2
+
(
1
−
1.33
)
2
=
0.56
d_{3->\mu_1}=(1-0.67)^2 + (1-1.33)^2 = 0.56
d3−>μ1=(1−0.67)2+(1−1.33)2=0.56(最近)
d
3
−
>
μ
2
=
(
1
−
3.33
)
2
+
(
1
−
3
)
2
=
8.43
d_{3->\mu_2}=(1-3.33)^2 + (1-3)^2 = 8.43
d3−>μ2=(1−3.33)2+(1−3)2=8.43
节点4:
d
4
−
>
μ
1
=
(
3
−
0.67
)
2
+
(
3
−
1.33
)
2
=
8.21
d_{4->\mu_1}=(3-0.67)^2 + (3-1.33)^2 = 8.21
d4−>μ1=(3−0.67)2+(3−1.33)2=8.21
d
4
−
>
μ
2
=
(
3
−
3.33
)
2
+
(
3
−
3
)
2
=
0.11
d_{4->\mu_2}=(3-3.33)^2 + (3-3)^2 = 0.11
d4−>μ2=(3−3.33)2+(3−3)2=0.11(最近)
节点5:
d
5
−
>
μ
1
=
(
3
−
0.67
)
2
+
(
2
−
1.33
)
2
=
5.89
d_{5->\mu_1}=(3-0.67)^2 + (2-1.33)^2 = 5.89
d5−>μ1=(3−0.67)2+(2−1.33)2=5.89
d
5
−
>
μ
2
=
(
3
−
3.33
)
2
+
(
2
−
3
)
2
=
1.11
d_{5->\mu_2}=(3-3.33)^2 + (2-3)^2 = 1.11
d5−>μ2=(3−3.33)2+(2−3)2=1.11(最近)
节点6:
d
6
−
>
μ
1
=
(
4
−
0.67
)
2
+
(
4
−
1.33
)
2
=
18.22
d_{6->\mu_1}=(4-0.67)^2 + (4-1.33)^2 = 18.22
d6−>μ1=(4−0.67)2+(4−1.33)2=18.22
d
6
−
>
μ
2
=
(
4
−
3.33
)
2
+
(
4
−
3
)
2
=
1.45
d_{6->\mu_2}=(4-3.33)^2 + (4-3)^2 = 1.45
d6−>μ2=(4−3.33)2+(4−3)2=1.45(最近)
更新簇
没有变化
更新质心
质心没有变化,结束迭代,输出C
将质心画出来(×)
可见“质心”就是各个点形成形状的“质量重心”,他反应的是与簇中各个样本点都最近的一个位置。