1. EM算法
EM算法,即期望最大化算法,用于含有隐变量的概率模型的参数估计,分为两个步骤:
- E步骤:计算隐变量的期望值,给定当前参数估计和观测数据
- M步骤:最大化期望值,更新参数估计
假设我们有数据
X
X
X和隐变量
Z
Z
Z,模型参数为
θ
\theta
θ。我们想要最大化对数似然函数
l
o
g
P
(
X
∣
θ
)
logP(X|\theta)
logP(X∣θ)。
E步骤:
Q
(
θ
,
θ
(
t
)
)
=
E
Z
∣
X
,
θ
(
t
)
[
l
o
g
P
(
X
,
Z
∣
θ
)
]
Q(\theta,\theta^{(t)})=E_{Z|X,\theta^{(t)}}[logP(X,Z|\theta)]
Q(θ,θ(t))=EZ∣X,θ(t)[logP(X,Z∣θ)]
M步骤:
θ
(
t
+
1
)
=
arg max
θ
Q
(
θ
,
θ
(
t
)
)
\theta^{(t+1)}=\argmax_{\theta}Q(\theta,\theta^{(t)})
θ(t+1)=θargmaxQ(θ,θ(t))
2. K-means
思路:
核心代码如下
def kmeans(X, k, epsilon=1e-5):
center = np.zeros((k, X.shape[1] - 1))
for i in range(k):
# 最后一列是label 不要
center[i, :] = X[np.random.randint(0, high=X.shape[0]), :-1]
iterations = 0
while True:
iterations += 1
distance = np.zeros(k)
# 根据center来重新给每个point分类
for i in range(X.shape[0]):
for j in range(k):
distance[j] = np.linalg.norm(X[i, :-1] - center[j, :])
X[i, -1] = np.argmin(distance)
# 根据每个point的分类来更新center
new_center = np.zeros((k, X.shape[1] - 1))
count = np.zeros(k)
# 通过计算均值来求center
for i in range(X.shape[0]):
new_center[int(X[i, -1]), :] += X[i, :-1]
count[int(X[i, -1])] += 1
for i in range(k):
new_center[i, :] = new_center[i, :] / count[i]
if np.linalg.norm(new_center - center) < epsilon:
break
else:
center = new_center
return X, center, iterations
3. GMM
3.1 公式推导
多元高斯分布的密度函数如下
p
(
x
∣
μ
,
Σ
)
=
1
(
2
π
)
d
2
∣
Σ
∣
1
2
e
x
p
(
−
1
2
(
x
−
μ
)
T
Σ
−
1
(
x
−
μ
)
)
p(x|\mu,\Sigma)=\frac{1}{(2\pi)^{\frac{d}{2}}|\Sigma|^{\frac{1}{2}}}exp(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu))
p(x∣μ,Σ)=(2π)2d∣Σ∣211exp(−21(x−μ)TΣ−1(x−μ))
X
=
{
x
1
,
x
2
,
.
.
.
,
x
n
}
X=\{x_1,x_2,...,x_n\}
X={x1,x2,...,xn}为
n
×
d
n\times d
n×d维,对于一个样本
x
i
x_i
xi,认为是多个多元高斯分布所生成,通过线性叠加来表征,假设为
k
k
k个高斯分布,则
p
(
x
i
)
=
∑
j
=
1
k
π
j
p
(
x
i
∣
μ
j
,
Σ
j
)
p(x_i)=\sum_{j=1}^{k}\pi_jp(x_i|\mu_j,\Sigma_j)
p(xi)=j=1∑kπjp(xi∣μj,Σj)
∑
j
=
1
k
π
j
=
1
\sum_{j=1}^{k}\pi_j=1
j=1∑kπj=1
相当于是从
k
k
k个高斯分布中挑选出一个所生成,设一个
k
k
k维二值变量
z
z
z,其中一个特定元素
z
j
=
1
z_j=1
zj=1,其余的元素均为0,即
z
j
∈
{
0
,
1
}
z_j\in \{0,1\}
zj∈{0,1}
∑
j
z
j
=
1
\sum_jz_j=1
j∑zj=1
所以
π
j
\pi_j
πj加权平均概率值可以表征
z
z
z的分布,即
z
z
z的先验分布为
p
(
z
)
=
∏
j
=
1
k
π
j
z
j
p(z)=\prod_{j=1}^k\pi_j^{z_j}
p(z)=j=1∏kπjzj
而在看到
x
i
x_i
xi情况下
z
z
z的后验概率为
γ
(
z
i
j
)
≡
p
(
z
j
=
1
∣
x
i
)
=
p
(
z
j
=
1
)
p
(
x
i
∣
z
j
=
1
)
p
(
x
i
)
=
π
j
p
(
x
i
∣
μ
j
,
Σ
j
)
∑
j
=
1
k
π
j
p
(
x
i
∣
μ
j
,
Σ
j
)
\gamma(z_{ij})\equiv p(z_j=1|x_i)=\frac{p(z_j=1)p(x_i|z_j=1)}{p(x_i)}=\frac{\pi_jp(x_i|\mu_j,\Sigma_j)}{\sum\limits_{j=1}^{k}\pi_jp(x_i|\mu_j,\Sigma_j)}
γ(zij)≡p(zj=1∣xi)=p(xi)p(zj=1)p(xi∣zj=1)=j=1∑kπjp(xi∣μj,Σj)πjp(xi∣μj,Σj)
当后验概率已知时,GMM将样本分为
k
k
k个簇
C
=
C
1
,
C
2
,
.
.
.
,
C
k
C=C_1,C_2,...,C_k
C=C1,C2,...,Ck,对每一个样本,其类别为
j
j
j,
j
=
arg max
j
γ
(
z
j
)
j=\argmax\limits_j\gamma(z_j)
j=jargmaxγ(zj),所以我们可以用MLE来求解样本的类别分布
ln
p
(
X
∣
π
,
μ
,
Σ
)
=
ln
∏
i
=
1
n
p
(
x
i
)
=
∑
i
=
1
n
ln
∑
j
=
1
k
π
j
p
(
x
i
∣
μ
j
,
Σ
j
)
\ln p(X|\pi,\mu,\Sigma)=\ln\prod_{i=1}^{n}p(x_i)=\sum_{i=1}^{n}\ln{\sum\limits_{j=1}^{k}\pi_jp(x_i|\mu_j,\Sigma_j)}
lnp(X∣π,μ,Σ)=lni=1∏np(xi)=i=1∑nlnj=1∑kπjp(xi∣μj,Σj)
上式最大化求
μ
j
\mu_j
μj
∂
ln
p
(
X
∣
π
,
μ
,
Σ
)
∂
μ
j
=
∑
i
=
1
n
π
j
p
(
x
i
∣
μ
j
,
Σ
j
)
∑
j
=
1
k
π
j
p
(
x
i
∣
μ
j
,
Σ
j
)
Σ
j
−
1
(
x
i
−
μ
j
)
=
0
\frac{\partial \ln p(X|\pi,\mu,\Sigma)}{\partial \mu_j}=\sum_{i=1}^{n}\frac{\pi_jp(x_i|\mu_j,\Sigma_j)}{{\sum\limits_{j=1}^{k}\pi_jp(x_i|\mu_j,\Sigma_j)}}\Sigma^{-1}_j(x_i-\mu_j)=0
∂μj∂lnp(X∣π,μ,Σ)=i=1∑nj=1∑kπjp(xi∣μj,Σj)πjp(xi∣μj,Σj)Σj−1(xi−μj)=0
即
∑
i
=
1
n
γ
(
z
i
j
)
Σ
j
−
1
(
x
i
−
μ
j
)
=
0
\sum_{i=1}^{n}\gamma(z_{ij})\Sigma^{-1}_j(x_i-\mu_j)=0
i=1∑nγ(zij)Σj−1(xi−μj)=0
得到
μ
j
=
∑
i
=
1
n
γ
(
z
i
j
)
x
i
∑
i
=
1
n
γ
(
z
i
j
)
\mu_j=\frac{\sum\limits_{i=1}^{n}\gamma(z_{ij})x_i}{\sum\limits_{i=1}^{n}\gamma(z_{ij})}
μj=i=1∑nγ(zij)i=1∑nγ(zij)xi
令
n
j
=
∑
i
=
1
n
γ
(
z
i
j
)
n_j=\sum\limits_{i=1}^{n}\gamma(z_{ij})
nj=i=1∑nγ(zij)
得到
μ
j
=
1
n
j
∑
i
=
1
n
γ
(
z
i
j
)
x
i
\mu_j=\frac{1}{n_j}{\sum\limits_{i=1}^{n}\gamma(z_{ij})x_i}
μj=nj1i=1∑nγ(zij)xi
同理求
Σ
j
\Sigma_j
Σj
∂
ln
p
(
X
∣
π
,
μ
,
Σ
)
∂
Σ
j
=
∑
i
=
1
n
π
j
p
(
x
i
∣
μ
j
,
Σ
j
)
∑
j
=
1
k
π
j
p
(
x
i
∣
μ
j
,
Σ
j
)
(
Σ
j
−
1
−
Σ
j
−
1
(
x
i
−
μ
j
)
(
x
i
−
μ
j
)
T
Σ
j
−
1
)
=
0
\frac{\partial \ln p(X|\pi,\mu,\Sigma)}{\partial \Sigma_j}=\sum_{i=1}^{n}\frac{\pi_jp(x_i|\mu_j,\Sigma_j)}{{\sum\limits_{j=1}^{k}\pi_jp(x_i|\mu_j,\Sigma_j)}}(\Sigma^{-1}_j-\Sigma^{-1}_j{(x_i-\mu_j)}{(x_i-\mu_j)}^T\Sigma^{-1}_j)=0
∂Σj∂lnp(X∣π,μ,Σ)=i=1∑nj=1∑kπjp(xi∣μj,Σj)πjp(xi∣μj,Σj)(Σj−1−Σj−1(xi−μj)(xi−μj)TΣj−1)=0
得到
Σ
j
=
1
n
j
∑
i
=
1
n
γ
(
z
i
j
)
(
x
i
−
μ
j
)
(
x
i
−
μ
j
)
T
\Sigma_j=\frac{1}{n_j}{\sum\limits_{i=1}^{n}\gamma(z_{ij}){(x_i-\mu_j)}{(x_i-\mu_j)}^T}
Σj=nj1i=1∑nγ(zij)(xi−μj)(xi−μj)T
接下里求
π
j
\pi_j
πj,由拉格朗日极值法有
ln
p
(
X
∣
π
,
μ
,
Σ
)
+
λ
(
∑
j
=
1
k
π
j
−
1
)
\ln p(X|\pi,\mu,\Sigma)+\lambda(\sum_{j=1}^{k}\pi_j-1)
lnp(X∣π,μ,Σ)+λ(j=1∑kπj−1)
继续求导
∂
ln
p
(
X
∣
π
,
μ
,
Σ
)
+
λ
(
∑
j
=
1
k
π
j
−
1
)
∂
π
j
=
∑
i
=
1
n
p
(
x
i
∣
μ
j
,
Σ
j
)
∑
j
=
1
k
π
j
p
(
x
i
∣
μ
j
,
Σ
j
)
+
λ
=
0
\frac{\partial \ln p(X|\pi,\mu,\Sigma)+\lambda(\sum\limits_{j=1}^{k}\pi_j-1)}{\partial \pi_j}=\sum_{i=1}^{n}\frac{p(x_i|\mu_j,\Sigma_j)}{{\sum\limits_{j=1}^{k}\pi_jp(x_i|\mu_j,\Sigma_j)}}+\lambda=0
∂πj∂lnp(X∣π,μ,Σ)+λ(j=1∑kπj−1)=i=1∑nj=1∑kπjp(xi∣μj,Σj)p(xi∣μj,Σj)+λ=0
π
j
\pi_j
πj对
j
∈
{
1
,
2
,
.
.
.
k
}
j\in\{1,2,...k\}
j∈{1,2,...k}求和
得到
∑
i
=
1
n
∑
j
=
1
k
π
j
p
(
x
i
∣
μ
j
,
Σ
j
)
∑
j
=
1
k
π
j
p
(
x
i
∣
μ
j
,
Σ
j
)
+
λ
∑
j
=
1
k
π
j
=
0
\sum_{i=1}^{n}\frac{\sum\limits_{j=1}^{k}\pi_jp(x_i|\mu_j,\Sigma_j)}{{\sum\limits_{j=1}^{k}\pi_jp(x_i|\mu_j,\Sigma_j)}}+\lambda\sum_{j=1}^{k}\pi_j=0
i=1∑nj=1∑kπjp(xi∣μj,Σj)j=1∑kπjp(xi∣μj,Σj)+λj=1∑kπj=0
代入约束条件
∑
j
=
1
k
π
j
=
1
\sum_{j=1}^{k}\pi_j=1
j=1∑kπj=1
得到
n
=
−
λ
n=-\lambda
n=−λ
代入
∑
i
=
1
n
p
(
x
i
∣
μ
j
,
Σ
j
)
∑
j
=
1
k
π
j
p
(
x
i
∣
μ
j
,
Σ
j
)
+
λ
=
0
\sum_{i=1}^{n}\frac{p(x_i|\mu_j,\Sigma_j)}{{\sum\limits_{j=1}^{k}\pi_jp(x_i|\mu_j,\Sigma_j)}}+\lambda=0
i=1∑nj=1∑kπjp(xi∣μj,Σj)p(xi∣μj,Σj)+λ=0
得到
π
j
=
n
j
n
\pi_j=\frac{n_j}{n}
πj=nnj
3.2 算法实现
GMM算法过程如下:
1.随机初始化参数 π i , μ i , Σ i , i ∈ { 1 , 2 , … , k } \pi_{i}, \mu_{i}, \Sigma_{i}, \quad i \in \{1, 2, \ldots, k\} πi,μi,Σi,i∈{1,2,…,k}
2.E步:根据式 γ ( z i j ) = π j p ( x i ∣ μ j , Σ j ) ∑ j = 1 k π j p ( x i ∣ μ j , Σ j ) \gamma(z_{ij})=\frac{\pi_{j}p(x_{i}|\mu_{j},\Sigma_{j})}{\sum\limits_{j=1}^{k}\pi_{j}p(x_{i}|\mu_{j},\Sigma_{j})} γ(zij)=j=1∑kπjp(xi∣μj,Σj)πjp(xi∣μj,Σj)计算每个样本由各个混合高斯成分生成的后验概率
3.M步:用下式更新参数 π j , μ j , Σ j , j ∈ { 1 , 2 , … , k } \pi_{j}, \mu_{j}, \Sigma_{j}, \quad j \in \{1, 2, \ldots, k\} πj,μj,Σj,j∈{1,2,…,k}
π j = n j n μ j = 1 n j ∑ i = 1 n γ ( z i j ) x i Σ j = ∑ i = 1 n γ ( z i j ) ( x i − μ j ) ( x i − μ j ) T n j \begin{align*} \pi_j &= \frac{n_j}{n} \\ \mu_j &= \frac{1}{n_j}\sum_{i=1}^n\gamma(z_{ij})x_i \\ \Sigma_j &= \frac{\sum\limits_{i=1}^n\gamma(z_{ij})(x_i-\mu_j)(x_i-\mu_j)^T}{n_j} \end{align*} πjμjΣj=nnj=nj1i=1∑nγ(zij)xi=nji=1∑nγ(zij)(xi−μj)(xi−μj)T
4.如果参数值不再发生变化,根据 j = arg max j γ ( z i j ) j=\argmax\limits_{j}\gamma(z_{ij}) j=jargmaxγ(zij) 计算标签,否则,返回第2步
E步核心代码如下
def e_step(X, mu_list, sigma_list, pi_list):
k = mu_list.shape[0]
gamma_z = np.zeros((X.shape[0], k))
for i in range(X.shape[0]):
pi_pdf_sum = 0
pi_pdf = np.zeros(k)
for j in range(k):
pi_pdf[j] = pi_list[j] * multivariate_normal.pdf(
X[i], mean=mu_list[j], cov=sigma_list[j]
)
pi_pdf_sum += pi_pdf
for j in range(k):
gamma_z[i, j] = pi_pdf[j] / pi_pdf_sum
M步核心代码如下
def m_step(X, mu_list, gamma_z):
k = mu_list.shape[0]
n = X.shape[0]
dim = X.shape[1]
mu_list_new = np.zeros(mu_list.shape)
sigma_list_new = np.zeros((k, dim, dim))
pi_list_new = np.zeros(k)
for j in range(k):
gamma = gamma_z[:, j]
n_j = np.sum(gamma)
pi_list_new[j] = n_j / n
gamma = gamma.reshape(n, 1)
mu_list_new[j, :] = (gamma.T @ X) / n_j
sigma_list_new[j] = (
(X - mu_list[j]).T @ np.multiply((X - mu_list[j]), gamma)
) / n_j
return mu_list_new, sigma_list_new, pi_list_new
4. 完整代码
import numpy as np
import matplotlib.pyplot as plt
import itertools
import copy
from scipy.stats import multivariate_normal
import pandas as pd
def generate_data(k, n, dimension, mu_list, sigma_list):
"""
使用高斯分布产生一组数据 共k个高斯分布 每个分布产生n个数据
得到带有真实类别标签的X 以便于绘图
"""
X = np.zeros((n * k, dimension + 1))
for i in range(k):
X[i * n : (i + 1) * n, :dimension] = np.random.multivariate_normal(
mu_list[i], sigma_list[i], size=n
)
X[i * n : (i + 1) * n, dimension : dimension + 1] = i
return X
def kmeans(X, k, epsilon=1e-5):
center = np.zeros((k, X.shape[1] - 1))
for i in range(k):
# 最后一列是label 不要
center[i, :] = X[np.random.randint(0, high=X.shape[0]), :-1]
iterations = 0
while True:
iterations += 1
distance = np.zeros(k)
# 根据center来重新给每个point分类
for i in range(X.shape[0]):
for j in range(k):
distance[j] = np.linalg.norm(X[i, :-1] - center[j, :])
X[i, -1] = np.argmin(distance)
# 根据每个point的分类来更新center
new_center = np.zeros((k, X.shape[1] - 1))
count = np.zeros(k)
# 通过计算均值来求center
for i in range(X.shape[0]):
new_center[int(X[i, -1]), :] += X[i, :-1]
count[int(X[i, -1])] += 1
for i in range(k):
new_center[i, :] = new_center[i, :] / count[i]
if np.linalg.norm(new_center - center) < epsilon:
break
else:
center = new_center
return X, center, iterations
def e_step(X, mu_list, sigma_list, pi_list):
k = mu_list.shape[0]
gamma_z = np.zeros((X.shape[0], k))
for i in range(X.shape[0]):
pi_pdf_sum = 0
pi_pdf = np.zeros(k)
for j in range(k):
pi_pdf[j] = pi_list[j] * multivariate_normal.pdf(
X[i], mean=mu_list[j], cov=sigma_list[j]
)
pi_pdf_sum += pi_pdf[j]
for j in range(k):
gamma_z[i, j] = pi_pdf[j] / pi_pdf_sum
return gamma_z
def m_step(X, mu_list, gamma_z):
k = mu_list.shape[0]
n = X.shape[0]
dim = X.shape[1]
mu_list_new = np.zeros(mu_list.shape)
sigma_list_new = np.zeros((k, dim, dim))
pi_list_new = np.zeros(k)
for j in range(k):
gamma = gamma_z[:, j]
n_j = np.sum(gamma)
pi_list_new[j] = n_j / n
gamma = gamma.reshape(n, 1)
mu_list_new[j, :] = (gamma.T @ X) / n_j
sigma_list_new[j] = (
(X - mu_list[j]).T @ np.multiply((X - mu_list[j]), gamma)
) / n_j
return mu_list_new, sigma_list_new, pi_list_new
def log_likelihood(x, mu_list, sigma_list, pi_list):
log_l = 0
for i in range(x.shape[0]):
pi_times_pdf_sum = 0
for j in range(mu_list.shape[0]):
pi_times_pdf_sum += pi_list[j] * multivariate_normal.pdf(
x[j], mean=mu_list[j], cov=sigma_list[j]
)
log_l += np.log(pi_times_pdf_sum)
return log_l
def GMM(X, k, epsilon=1e-5):
x = X[:, :-1]
pi_list = np.ones(k) * (1.0 / k)
sigma_list = np.array([0.1 * np.eye(x.shape[1])] * k)
# 随机选第1个初始点,依次选择与当前mu中样本点距离最大的点作为初始簇中心点
mu_list = [x[np.random.randint(0, k) + 1]]
for times in range(k - 1):
temp_ans = []
for i in range(x.shape[0]):
temp_ans.append(
np.sum([np.linalg.norm(x[i] - mu_list[j]) for j in range(len(mu_list))])
)
mu_list.append(x[np.argmax(temp_ans)])
mu_list = np.array(mu_list)
old_log_l = log_likelihood(x, mu_list, sigma_list, pi_list)
iterations = 0
log_l_list = pd.DataFrame(columns=("Iterations", "log likelihood"))
while True:
gamma_z = e_step(x, mu_list, sigma_list, pi_list)
mu_list, sigma_list, pi_list = m_step(x, mu_list, gamma_z)
new_log_l = log_likelihood(x, mu_list, sigma_list, pi_list)
if iterations % 10 == 0:
new_row = pd.DataFrame(
{"Iterations": [iterations], "log likelihood": [old_log_l]}
)
log_l_list = pd.concat([log_l_list, new_row], ignore_index=True)
if old_log_l < new_log_l and (new_log_l - old_log_l) < epsilon:
break
old_log_l = new_log_l
iterations += 1
# 计算标签
for i in range(X.shape[0]):
X[i, -1] = np.argmax(gamma_z[i, :])
return X, iterations, log_l_list
def show(X, center=None, title=None):
plt.style.use("seaborn-v0_8-dark")
plt.scatter(X[:, 0], X[:, 1], c=X[:, 2], marker=".", s=40, cmap="YlGnBu")
if not center is None:
plt.scatter(center[:, 0], center[:, 1], c="r", marker="x", s=250)
if not title is None:
plt.title(title)
plt.show()
def accuracy(label, predict, k):
predicts = list(itertools.permutations(range(k), k)) # 做一些映射
counts = np.zeros(len(predicts))
for i in range(len(predicts)):
for j in range(label.shape[0]):
if int(label[j]) == predicts[i][int(predict[j])]:
counts[i] += 1
return np.max(counts) / label.shape[0]
def get_result_kmeans(X, k, label, epsilon):
X, center, iterations = kmeans(X, k, epsilon=epsilon)
print(center)
acc = accuracy(label, X[:, -1], k)
show(
X,
center,
title="epsilon="
+ str(epsilon)
+ ", iterations="
+ str(iterations)
+ ", accuracy="
+ str(acc),
)
def get_result_gmm(X, k, label, epsilon):
X, iterations, log_l_list = GMM(X, k, epsilon=epsilon)
print(log_l_list)
acc = accuracy(label, X[:, -1], k)
show(
X,
title="epsilon="
+ str(epsilon)
+ ", iterations="
+ str(iterations)
+ ", accuracy="
+ str(acc),
)
def get_result(algorithm, X, k, label, epsilons=[1, 1e-5, 1e-10]):
if algorithm == "kmeans":
for e in epsilons:
get_result_kmeans(X, k, label, epsilon=e)
elif algorithm == "gmm":
for e in epsilons:
get_result_gmm(X, k, label, epsilon=e)
if __name__ == "__main__":
k = 3
n = 300
dimension = 2
mu_list = np.array([[2, 6], [8, 10], [8, 2]])
sigma_list = np.array([[[1, 0], [0, 3]], [[3, 0], [0, 2]], [[3, 0], [0, 3]]])
X = generate_data(k, n, dimension, mu_list, sigma_list)
label = copy.deepcopy(X[:, -1]) # 副本 防止改变X的最后一列而导致真正的label改变
print(mu_list)
show(X, mu_list, title="Real Data Distribution")
get_result(algorithm="kmeans", X=X, k=k, label=label)
# get_result(algorithm="gmm", X=X, k=k, label=label)