声明:本文作为自己的学习笔记,欢迎大家于本人学习交流,转载请注明出处
本文参考
作者:周志华书名:《机器学习》出版社:清华大学出版社>
感谢此书籍原创者
如果本文有侵犯您的知识产权和版权问题,请通知本人,本人会立即做出处理并删除文章
电子邮件:louhergetup@163.com >
k-means算法
假定样本集
D
=
{
x
1
,
x
2
,
…
,
x
m
}
D=\left\{\boldsymbol{x}_{1}, \boldsymbol{x}_{2}, \ldots, \boldsymbol{x}_{m}\right\}
D={x1,x2,…,xm} 包含
m
m
m 个无标记样本,每个样本
x
i
=
(
x
i
1
;
x
i
2
;
…
;
x
i
n
)
\boldsymbol{x}_{i}=\left(x_{i 1} ; x_{i 2} ; \dots ; x_{i n}\right)
xi=(xi1;xi2;…;xin) 是一个
n
n
n 特征向量,则聚类算法将样本集
D
D
D 划分为
k
k
k 个不相交的簇
{
C
l
∣
l
=
1
,
2
;
…
,
k
}
\left\{C_{l} | l=1,2 ; \ldots, k\right\}
{Cl∣l=1,2;…,k},其中
C
l
′
∩
l
′
≠
l
C
l
=
∅
C_{l^{\prime}} \cap_{l^{\prime} \neq l} C_{l}=\varnothing
Cl′∩l′=lCl=∅ 且
D
=
⋃
l
=
1
k
C
l
D=\bigcup_{l=1}^{k} C_{l}
D=⋃l=1kCl
给定样本集
D
=
{
x
1
,
x
2
,
…
,
x
m
}
D=\left\{\boldsymbol{x}_{1}, \boldsymbol{x}_{2}, \ldots, \boldsymbol{x}_{m}\right\}
D={x1,x2,…,xm},,k -means 算法针对聚类所得簇划分
C
=
{
C
1
,
C
2
,
…
,
C
k
}
\mathcal{C}=\left\{C_{1}, C_{2}, \ldots, C_{k}\right\}
C={C1,C2,…,Ck} 最小化平方误差
E
=
∑
i
=
1
k
∑
x
∈
C
i
∥
x
−
μ
i
∥
2
2
E=\sum_{i=1}^{k} \sum_{\boldsymbol{x} \in C_{i}}\left\|\boldsymbol{x}-\boldsymbol{\mu}_{i}\right\|_{2}^{2}
E=i=1∑kx∈Ci∑∥x−μi∥22
其中
μ
i
=
1
∣
C
i
∣
∑
x
∈
C
i
x
\boldsymbol{\mu}_{i}=\frac{1}{\left|C_{i}\right|} \sum_{\boldsymbol{x} \in C_{i}} \boldsymbol{x}
μi=∣Ci∣1∑x∈Cix 是簇
C
i
C_{i}
Ci 的均值向量。直观来看,上式在一定程度上刻画了簇内样本环绕均值向量的紧密程度,
E
E
E 值越小则簇内样本相似度越高。
python 实现
实验环境
- 操作系统:Windows 7 64 位
- 软件环境:Anaconda3
- 编程语言:Python 3.7
实现代码
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
plt.figure(figsize=(12,12))
n_samples = 1500
random_state = 170
X, y = make_blobs(n_samples=n_samples, random_state=random_state)
y_pred = KMeans(n_clusters=2, random_state=random_state).fit_predict(X)
plt.subplot(221)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.title("Incorrect Number of Blobs")
transformation = [[0.60834549, -0.63667341], [-0.40887718, 0.85253229]]
X_aniso = np.dot(X, transformation)
y_pred = KMeans(n_clusters=3, random_state=random_state).fit_predict(X_aniso)
plt.subplot(222)
plt.scatter(X_aniso[:, 0], X_aniso[:, 1], c=y_pred)
plt.title("Anisotropicly Disributed Blobs")
X_varied, y_varied = make_blobs(n_samples=n_samples,
cluster_std=[1.0, 2.5, 0.5],
random_state=random_state)
y_pred = KMeans(n_clusters=3, random_state=random_state).fit_predict(X_varied)
plt.subplot(223)
plt.scatter(X_varied[:, 0], X_varied[:, 1], c=y_pred)
plt.title("Unequal Variance")
X_filtered = np.vstack((X[y == 0][:500], X[y == 1][:100], X[y == 2][:10]))
y_pred = KMeans(n_clusters=3,
random_state=random_state).fit_predict(X_filtered)
plt.subplot(224)
plt.scatter(X_filtered[:, 0], X_filtered[:, 1], c=y_pred)
plt.title("Unevenly Sized Blobs")
plt.show()