简述
高斯混合模型,就是说用多个高斯函数去描述不同的元素分布。
通过EM方法来迭代生成不同的高斯模型的各个参数。
具体的EM算法的理论网上很多,但推荐各位先看完这个算法思路之后,再去看理论推导就更加好了。
更新方法
- μ i ′ = ∑ j = 1 m η j i ∗ x j ∑ j = 1 m η j i \mu_i^{'} = \frac{\sum_{j=1}^m{\eta_{ji} * x_j}}{\sum_{j=1}^m{\eta_{ji}}} μi′=∑j=1mηji∑j=1mηji∗xj
- Σ i ′ = ∑ j = 1 m η j i ∗ ( x j − μ i ′ ) ∗ ( x j − μ i ′ ) T ∑ j = 1 m η j i \Sigma_i^{'} = \frac{\sum_{j=1}^m{\eta_{ji} * (x_j - \mu_i^{'}) * (x_j - \mu_i^{'})^T}}{\sum_{j=1}^m{\eta_{ji}}} Σi′=∑j=1mηji∑j=1mηji∗(xj−μi′)∗(xj−μi′)T
- α i ′ = ∑ j = 1 m η j i m \alpha_i^{'} = \frac{\sum_{j=1}^m{\eta_{ji}}}{m} αi′=m∑j=1mηji
- m是点数量
- η i j = P ( z j = i ∣ x j ) \eta_{ij}=P(z_j=i|x_j) ηij=P(zj=i∣xj) 这个概率就是在第 i i i个高斯模型下样本 x j x_j xj的概率
通过这样不断地迭代,最后,用这个
η
i
j
\eta_{ij}
ηij来计算最后的聚类结果
即,对于每一个样本,属于概率最高的高斯分布的所对应的高斯分布。
Python实现
- 导入数据
from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt
iris = datasets.load_iris()
- 算法实现
from scipy import stats
def GMMs(X, k=3, steps=10):
def p(x, mu, sigma):
n = len(x)
div = (2 * np.pi) ** (n / 2) * (abs(np.linalg.det(sigma)) ** 0.5)
expOn = -0.5 * ( np.dot( (x - mu).T, np.dot(np.linalg.inv(sigma), (x - mu)) ) )
return np.exp(expOn) / div
def init(X):
_, n = X.shape
return np.random.rand(k, n), 2 * np.random.rand(k, n, n) + 1, np.random.rand(k)
# k个Gausssian distribution
mus, sigmas, alphas = init(X)
# EM algorithm
# E-step
mat = np.zeros((len(X), k))
for times in range(steps):
for j, x in enumerate(X):
temp, tempP = 0, 0
for i in range(k):
tempP = p(x, mus[i], sigmas[i])
temp += tempP
mat[j][i] = alphas[i] * tempP
mat[j] /= temp
for i in range(k):
# updata mus
mus[i] = np.dot(mat[:, i].T, X) / sum(mat[:, i])
# update sigmas
temp = np.zeros(sigmas[0].shape)
for j in range(len(X)):
data = (X[j] - mus[i]).reshape(4, 1)
temp += mat[j][i] * np.dot(data, data.T)
temp /= sum(mat[:, i])
sigmas[i] = temp
alphas[i] = sum(mat[:, i]) / len(X)
# clustering
Ans = np.zeros(len(X))
for j, x in enumerate(X):
temp, tempP = 0, 0
for i in range(k):
tempP = p(x, mus[i], sigmas[i])
temp += tempP
mat[j][i] = alphas[i] * tempP
mat[j] /= temp
Ans[j] = np.argmax(mat[j])
return Ans
test_y = GMMs(iris.data, steps=20)
- 画图
from sklearn.decomposition import PCA
X_reduced = PCA(n_components=2).fit_transform(iris.data)
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=test_y, cmap=plt.cm.Set1)
- 评估
def evaluate(y, t):
a, b, c, d = [0 for i in range(4)]
for i in range(len(y)):
for j in range(i+1, len(y)):
if y[i] == y[j] and t[i] == t[j]:
a += 1
elif y[i] == y[j] and t[i] != t[j]:
b += 1
elif y[i] != y[j] and t[i] == t[j]:
c += 1
elif y[i] != y[j] and t[i] != t[j]:
d += 1
return a, b, c, d
def external_index(a, b, c, d, m):
JC = a / (a + b + c)
FMI = np.sqrt(a**2 / ((a + b) * (a + c)))
RI = 2 * ( a + d ) / ( m * (m + 1) )
return JC, FMI, RI
def evaluate_it(y, t):
a, b, c, d = evaluate(y, t)
return external_index(a, b, c, d, len(y))
Index | value |
---|---|
JC | 0.8187638512681605 |
FMI | 0.9003627122239571 |
RI | 0.921766004415011 |