kmeans 和 GMM 的简单实现及基于 EM 的训练。
jupyter notebook 版本见 repo。
首先,导入相关的模块。
from __future__ import print_function
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
从正态分布,随机生成两组一维数据,正态分布的期望和方差如代码所示:
np.random.seed(1111)
mu1, mu2 = 1, 3
sigma1, sigma2 = 0.7, 0.4
num1, num2 = 20, 20
data1 = np.random.randn(num1) * sigma1 + mu1
data2 = np.random.randn(num2) * sigma2 + mu2
print('cluster 1 with mean {} std {}'.format(mu1, sigma1))
print('cluster 2 with mean {} std {}'.format(mu2, sigma2))
cluster 1 with mean 1 std 0.7
cluster 2 with mean 3 std 0.4
生成的数据图示如下:
def plot(y, **kwargs):
plt.scatter(y, np.zeros_like(y), **kwargs)
plot(data1, c='r')
plt.scatter(mu1, 0, c='r', s=100, marker='^')
plot(data2, c='g')
plt.scatter(mu2, 0, c='g', s=100, marker='^')
许多场景下,我们提前不知道数据是如何生成的,即数据没有标注:
data = np.concatenate([data1, data2])
plot(data)
K-Means 聚类是一个迭代算法。给定目标聚类数目 K,算法流程如下:
1. 初始化 K 个中心点。
2. 对每个中心点 ci c i ,计算各个数据点 xj x j 到此中心点的距离 distij d i s t i j (e.g. 欧式距离)。
3. 对每个数据点,将其归为距离最近的中心。
4. 对每个中心点 ci c i ,利用其下的数据点,重新估计值 ci=∑xj∈cixj c i = ∑ x j ∈ c i x j 。
5. 重复 2 ~ 4,至于满足终止条件。
def kmeans(x, nb_class=2, max_iter=10, verbose=False):
# random initiailize centers, step 1
means = x[np.random.choice(range(len(x)), nb_class, replace=False)]
dist = np.zeros([nb_class, len(x)])
for i in range(max_iter):
# calculate distance, step 2
for j in range(nb_class):
for k, point in enumerate(x):
dist[j][k] = (means[j] - point) ** 2
# update means, step 3 & 4
min_idx = np.argmin(dist, 0)
for j in range(nb_class):
ele = [x[k] for k, idx in enumerate(min_idx) if idx == j]
means[j] = np.mean(ele)
# calculate total distance for early stop
total_dist = 0
for j in range(nb_class):
total_dist = sum([(means[j] - x[k]) ** 2 for k, idx in enumerate(min_idx) if idx == j])
if verbose:
print("Iter {}, total dist {}".format(i, total_dist))
for j in range(nb_class):
for k, point in enumerate(x):
dist[j][k] = (means[j] - point) ** 2
min_idx = np.argmin(dist, 0)
clusters = []
for i in range(nb_class):
ele = [x[k] for k, idx in enumerate(min_idx) if idx == i]
clusters.append(np.asarray(ele))
return means, clusters
# original clusters
plt.subplot(2, 1, 1)
plot(data1, c='r')
plt.scatter(mu1, 0, c='r', s=100, marker='^')
plot(data2, c='g')
plt.scatter(mu2, 0, c='g', s=100, marker='^')
# k-mean clusters
centers, clusters = kmeans(data, max_iter=10, verbose=True)
print('Estimated Center 1 - val: {:.2f}'.format(centers[0]))
print('Estimated Center 2 - val: {:.2f}'.format(centers[1]))
plt.subplot(2, 1, 2)
plot(clusters[0], c='r')
plt.scatter(centers[0], 0, c='r', s=100, marker='^')
plot(clusters[1], c='g')
plt.scatter(centers[1], 0, c='g', s=100, marker='^')
Iter 0, total dist 24.6199215453
Iter 1, total dist 10.047227048
Iter 2, total dist 5.06138695227
Iter 3, total dist 5.06138695227
Iter 4, total dist 5.06138695227
Iter 5, total dist 5.06138695227
Iter 6, total dist 5.06138695227
Iter 7, total dist 5.06138695227
Iter 8, total dist 5.06138695227
Iter 9, total dist 5.06138695227
Estimated Center 1 - val: 0.70
Estimated Center 2 - val: 2.80
kmeans 算法对初始值非常敏感。这里我们随机挑选两个数据点做为中心点的初始值。实际中可能需要更鲁棒初始化方法,此处不再展开。
kmeans 采用的是硬件判决,即距离哪个中心点的距离最近,分配给谁。在上面的示例,两类重合的点,被错误归为了另一类。缺乏置信度。下面介绍本文的主题——高斯混合模型(GMM)。GMM 采用的是软判决的方式来进行聚类。
GMM 假设数据服从如下公布:
p(x)=∑