哈工大机器学习实验三(基于EM算法的k-means和GMM求解)

哈工大机器学习实验三(基于EM算法的k-means和GMM求解)

1. EM算法

EM算法,即期望最大化算法,用于含有隐变量的概率模型的参数估计,分为两个步骤:

  • E步骤:计算隐变量的期望值,给定当前参数估计和观测数据
  • M步骤:最大化期望值,更新参数估计

假设我们有数据 X X X和隐变量 Z Z Z,模型参数为 θ \theta θ。我们想要最大化对数似然函数 l o g P ( X ∣ θ ) logP(X|\theta) logP(Xθ)
E步骤:
Q ( θ , θ ( t ) ) = E Z ∣ X , θ ( t ) [ l o g P ( X , Z ∣ θ ) ] Q(\theta,\theta^{(t)})=E_{Z|X,\theta^{(t)}}[logP(X,Z|\theta)] Q(θ,θ(t))=EZX,θ(t)[logP(X,Zθ)]
M步骤:
θ ( t + 1 ) = arg max ⁡ θ Q ( θ , θ ( t ) ) \theta^{(t+1)}=\argmax_{\theta}Q(\theta,\theta^{(t)}) θ(t+1)=θargmaxQ(θ,θ(t))

2. K-means

思路:
在这里插入图片描述

核心代码如下

def kmeans(X, k, epsilon=1e-5):
    center = np.zeros((k, X.shape[1] - 1))
    for i in range(k):
        # 最后一列是label 不要
        center[i, :] = X[np.random.randint(0, high=X.shape[0]), :-1]
    iterations = 0
    while True:
        iterations += 1
        distance = np.zeros(k)
        # 根据center来重新给每个point分类
        for i in range(X.shape[0]):
            for j in range(k):
                distance[j] = np.linalg.norm(X[i, :-1] - center[j, :])
            X[i, -1] = np.argmin(distance)
        # 根据每个point的分类来更新center
        new_center = np.zeros((k, X.shape[1] - 1))
        count = np.zeros(k)
        # 通过计算均值来求center
        for i in range(X.shape[0]):
            new_center[int(X[i, -1]), :] += X[i, :-1]
            count[int(X[i, -1])] += 1
        for i in range(k):
            new_center[i, :] = new_center[i, :] / count[i]
        if np.linalg.norm(new_center - center) < epsilon:
            break
        else:
            center = new_center
    return X, center, iterations

3. GMM

3.1 公式推导

多元高斯分布的密度函数如下
p ( x ∣ μ , Σ ) = 1 ( 2 π ) d 2 ∣ Σ ∣ 1 2 e x p ( − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) ) p(x|\mu,\Sigma)=\frac{1}{(2\pi)^{\frac{d}{2}}|\Sigma|^{\frac{1}{2}}}exp(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)) p(xμ,Σ)=(2π)2d∣Σ211exp(21(xμ)TΣ1(xμ))
X = { x 1 , x 2 , . . . , x n } X=\{x_1,x_2,...,x_n\} X={x1,x2,...,xn} n × d n\times d n×d维,对于一个样本 x i x_i xi,认为是多个多元高斯分布所生成,通过线性叠加来表征,假设为 k k k个高斯分布,则
p ( x i ) = ∑ j = 1 k π j p ( x i ∣ μ j , Σ j ) p(x_i)=\sum_{j=1}^{k}\pi_jp(x_i|\mu_j,\Sigma_j) p(xi)=j=1kπjp(xiμj,Σj)
∑ j = 1 k π j = 1 \sum_{j=1}^{k}\pi_j=1 j=1kπj=1
相当于是从 k k k个高斯分布中挑选出一个所生成,设一个 k k k维二值变量 z z z,其中一个特定元素 z j = 1 z_j=1 zj=1,其余的元素均为0,即
z j ∈ { 0 , 1 } z_j\in \{0,1\} zj{0,1}
∑ j z j = 1 \sum_jz_j=1 jzj=1
所以 π j \pi_j πj加权平均概率值可以表征 z z z的分布,即 z z z的先验分布为
p ( z ) = ∏ j = 1 k π j z j p(z)=\prod_{j=1}^k\pi_j^{z_j} p(z)=j=1kπjzj
而在看到 x i x_i xi情况下 z z z的后验概率为
γ ( z i j ) ≡ p ( z j = 1 ∣ x i ) = p ( z j = 1 ) p ( x i ∣ z j = 1 ) p ( x i ) = π j p ( x i ∣ μ j , Σ j ) ∑ j = 1 k π j p ( x i ∣ μ j , Σ j ) \gamma(z_{ij})\equiv p(z_j=1|x_i)=\frac{p(z_j=1)p(x_i|z_j=1)}{p(x_i)}=\frac{\pi_jp(x_i|\mu_j,\Sigma_j)}{\sum\limits_{j=1}^{k}\pi_jp(x_i|\mu_j,\Sigma_j)} γ(zij)p(zj=1∣xi)=p(xi)p(zj=1)p(xizj=1)=j=1kπjp(xiμj,Σj)πjp(xiμj,Σj)
当后验概率已知时,GMM将样本分为 k k k个簇 C = C 1 , C 2 , . . . , C k C=C_1,C_2,...,C_k C=C1,C2,...,Ck,对每一个样本,其类别为 j j j j = arg max ⁡ j γ ( z j ) j=\argmax\limits_j\gamma(z_j) j=jargmaxγ(zj),所以我们可以用MLE来求解样本的类别分布
ln ⁡ p ( X ∣ π , μ , Σ ) = ln ⁡ ∏ i = 1 n p ( x i ) = ∑ i = 1 n ln ⁡ ∑ j = 1 k π j p ( x i ∣ μ j , Σ j ) \ln p(X|\pi,\mu,\Sigma)=\ln\prod_{i=1}^{n}p(x_i)=\sum_{i=1}^{n}\ln{\sum\limits_{j=1}^{k}\pi_jp(x_i|\mu_j,\Sigma_j)} lnp(Xπ,μ,Σ)=lni=1np(xi)=i=1nlnj=1kπjp(xiμj,Σj)
上式最大化求 μ j \mu_j μj
∂ ln ⁡ p ( X ∣ π , μ , Σ ) ∂ μ j = ∑ i = 1 n π j p ( x i ∣ μ j , Σ j ) ∑ j = 1 k π j p ( x i ∣ μ j , Σ j ) Σ j − 1 ( x i − μ j ) = 0 \frac{\partial \ln p(X|\pi,\mu,\Sigma)}{\partial \mu_j}=\sum_{i=1}^{n}\frac{\pi_jp(x_i|\mu_j,\Sigma_j)}{{\sum\limits_{j=1}^{k}\pi_jp(x_i|\mu_j,\Sigma_j)}}\Sigma^{-1}_j(x_i-\mu_j)=0 μjlnp(Xπ,μ,Σ)=i=1nj=1kπjp(xiμj,Σj)πjp(xiμj,Σj)Σj1(xiμj)=0

∑ i = 1 n γ ( z i j ) Σ j − 1 ( x i − μ j ) = 0 \sum_{i=1}^{n}\gamma(z_{ij})\Sigma^{-1}_j(x_i-\mu_j)=0 i=1nγ(zij)Σj1(xiμj)=0
得到 μ j = ∑ i = 1 n γ ( z i j ) x i ∑ i = 1 n γ ( z i j ) \mu_j=\frac{\sum\limits_{i=1}^{n}\gamma(z_{ij})x_i}{\sum\limits_{i=1}^{n}\gamma(z_{ij})} μj=i=1nγ(zij)i=1nγ(zij)xi
n j = ∑ i = 1 n γ ( z i j ) n_j=\sum\limits_{i=1}^{n}\gamma(z_{ij}) nj=i=1nγ(zij)
得到 μ j = 1 n j ∑ i = 1 n γ ( z i j ) x i \mu_j=\frac{1}{n_j}{\sum\limits_{i=1}^{n}\gamma(z_{ij})x_i} μj=nj1i=1nγ(zij)xi
同理求 Σ j \Sigma_j Σj
∂ ln ⁡ p ( X ∣ π , μ , Σ ) ∂ Σ j = ∑ i = 1 n π j p ( x i ∣ μ j , Σ j ) ∑ j = 1 k π j p ( x i ∣ μ j , Σ j ) ( Σ j − 1 − Σ j − 1 ( x i − μ j ) ( x i − μ j ) T Σ j − 1 ) = 0 \frac{\partial \ln p(X|\pi,\mu,\Sigma)}{\partial \Sigma_j}=\sum_{i=1}^{n}\frac{\pi_jp(x_i|\mu_j,\Sigma_j)}{{\sum\limits_{j=1}^{k}\pi_jp(x_i|\mu_j,\Sigma_j)}}(\Sigma^{-1}_j-\Sigma^{-1}_j{(x_i-\mu_j)}{(x_i-\mu_j)}^T\Sigma^{-1}_j)=0 Σjlnp(Xπ,μ,Σ)=i=1nj=1kπjp(xiμj,Σj)πjp(xiμj,Σj)(Σj1Σj1(xiμj)(xiμj)TΣj1)=0
得到 Σ j = 1 n j ∑ i = 1 n γ ( z i j ) ( x i − μ j ) ( x i − μ j ) T \Sigma_j=\frac{1}{n_j}{\sum\limits_{i=1}^{n}\gamma(z_{ij}){(x_i-\mu_j)}{(x_i-\mu_j)}^T} Σj=nj1i=1nγ(zij)(xiμj)(xiμj)T
接下里求 π j \pi_j πj,由拉格朗日极值法有
ln ⁡ p ( X ∣ π , μ , Σ ) + λ ( ∑ j = 1 k π j − 1 ) \ln p(X|\pi,\mu,\Sigma)+\lambda(\sum_{j=1}^{k}\pi_j-1) lnp(Xπ,μ,Σ)+λ(j=1kπj1)
继续求导
∂ ln ⁡ p ( X ∣ π , μ , Σ ) + λ ( ∑ j = 1 k π j − 1 ) ∂ π j = ∑ i = 1 n p ( x i ∣ μ j , Σ j ) ∑ j = 1 k π j p ( x i ∣ μ j , Σ j ) + λ = 0 \frac{\partial \ln p(X|\pi,\mu,\Sigma)+\lambda(\sum\limits_{j=1}^{k}\pi_j-1)}{\partial \pi_j}=\sum_{i=1}^{n}\frac{p(x_i|\mu_j,\Sigma_j)}{{\sum\limits_{j=1}^{k}\pi_jp(x_i|\mu_j,\Sigma_j)}}+\lambda=0 πjlnp(Xπ,μ,Σ)+λ(j=1kπj1)=i=1nj=1kπjp(xiμj,Σj)p(xiμj,Σj)+λ=0
π j \pi_j πj j ∈ { 1 , 2 , . . . k } j\in\{1,2,...k\} j{1,2,...k}求和
得到
∑ i = 1 n ∑ j = 1 k π j p ( x i ∣ μ j , Σ j ) ∑ j = 1 k π j p ( x i ∣ μ j , Σ j ) + λ ∑ j = 1 k π j = 0 \sum_{i=1}^{n}\frac{\sum\limits_{j=1}^{k}\pi_jp(x_i|\mu_j,\Sigma_j)}{{\sum\limits_{j=1}^{k}\pi_jp(x_i|\mu_j,\Sigma_j)}}+\lambda\sum_{j=1}^{k}\pi_j=0 i=1nj=1kπjp(xiμj,Σj)j=1kπjp(xiμj,Σj)+λj=1kπj=0
代入约束条件 ∑ j = 1 k π j = 1 \sum_{j=1}^{k}\pi_j=1 j=1kπj=1
得到
n = − λ n=-\lambda n=λ
代入 ∑ i = 1 n p ( x i ∣ μ j , Σ j ) ∑ j = 1 k π j p ( x i ∣ μ j , Σ j ) + λ = 0 \sum_{i=1}^{n}\frac{p(x_i|\mu_j,\Sigma_j)}{{\sum\limits_{j=1}^{k}\pi_jp(x_i|\mu_j,\Sigma_j)}}+\lambda=0 i=1nj=1kπjp(xiμj,Σj)p(xiμj,Σj)+λ=0
得到
π j = n j n \pi_j=\frac{n_j}{n} πj=nnj

3.2 算法实现

GMM算法过程如下:

1.随机初始化参数 π i , μ i , Σ i , i ∈ { 1 , 2 , … , k } \pi_{i}, \mu_{i}, \Sigma_{i}, \quad i \in \{1, 2, \ldots, k\} πi,μi,Σi,i{1,2,,k}

2.E步:根据式 γ ( z i j ) = π j p ( x i ∣ μ j , Σ j ) ∑ j = 1 k π j p ( x i ∣ μ j , Σ j ) \gamma(z_{ij})=\frac{\pi_{j}p(x_{i}|\mu_{j},\Sigma_{j})}{\sum\limits_{j=1}^{k}\pi_{j}p(x_{i}|\mu_{j},\Sigma_{j})} γ(zij)=j=1kπjp(xiμj,Σj)πjp(xiμj,Σj)计算每个样本由各个混合高斯成分生成的后验概率

3.M步:用下式更新参数 π j , μ j , Σ j , j ∈ { 1 , 2 , … , k } \pi_{j}, \mu_{j}, \Sigma_{j}, \quad j \in \{1, 2, \ldots, k\} πj,μj,Σj,j{1,2,,k}

π j = n j n μ j = 1 n j ∑ i = 1 n γ ( z i j ) x i Σ j = ∑ i = 1 n γ ( z i j ) ( x i − μ j ) ( x i − μ j ) T n j \begin{align*} \pi_j &= \frac{n_j}{n} \\ \mu_j &= \frac{1}{n_j}\sum_{i=1}^n\gamma(z_{ij})x_i \\ \Sigma_j &= \frac{\sum\limits_{i=1}^n\gamma(z_{ij})(x_i-\mu_j)(x_i-\mu_j)^T}{n_j} \end{align*} πjμjΣj=nnj=nj1i=1nγ(zij)xi=nji=1nγ(zij)(xiμj)(xiμj)T

4.如果参数值不再发生变化,根据 j = arg max ⁡ j γ ( z i j ) j=\argmax\limits_{j}\gamma(z_{ij}) j=jargmaxγ(zij) 计算标签,否则,返回第2步

E步核心代码如下

def e_step(X, mu_list, sigma_list, pi_list):
    k = mu_list.shape[0]
    gamma_z = np.zeros((X.shape[0], k))
    for i in range(X.shape[0]):
        pi_pdf_sum = 0
        pi_pdf = np.zeros(k)
        for j in range(k):
            pi_pdf[j] = pi_list[j] * multivariate_normal.pdf(
                X[i], mean=mu_list[j], cov=sigma_list[j]
            )
            pi_pdf_sum += pi_pdf
        for j in range(k):
            gamma_z[i, j] = pi_pdf[j] / pi_pdf_sum

M步核心代码如下

def m_step(X, mu_list, gamma_z):
    k = mu_list.shape[0]
    n = X.shape[0]
    dim = X.shape[1]
    mu_list_new = np.zeros(mu_list.shape)
    sigma_list_new = np.zeros((k, dim, dim))
    pi_list_new = np.zeros(k)
    for j in range(k):
        gamma = gamma_z[:, j]
        n_j = np.sum(gamma)
        pi_list_new[j] = n_j / n
        gamma = gamma.reshape(n, 1)
        mu_list_new[j, :] = (gamma.T @ X) / n_j
        sigma_list_new[j] = (
            (X - mu_list[j]).T @ np.multiply((X - mu_list[j]), gamma)
        ) / n_j
    return mu_list_new, sigma_list_new, pi_list_new

4. 完整代码

import numpy as np
import matplotlib.pyplot as plt
import itertools
import copy
from scipy.stats import multivariate_normal
import pandas as pd


def generate_data(k, n, dimension, mu_list, sigma_list):
    """
    使用高斯分布产生一组数据 共k个高斯分布 每个分布产生n个数据
    得到带有真实类别标签的X 以便于绘图
    """
    X = np.zeros((n * k, dimension + 1))
    for i in range(k):
        X[i * n : (i + 1) * n, :dimension] = np.random.multivariate_normal(
            mu_list[i], sigma_list[i], size=n
        )
        X[i * n : (i + 1) * n, dimension : dimension + 1] = i
    return X


def kmeans(X, k, epsilon=1e-5):
    center = np.zeros((k, X.shape[1] - 1))
    for i in range(k):
        # 最后一列是label 不要
        center[i, :] = X[np.random.randint(0, high=X.shape[0]), :-1]
    iterations = 0
    while True:
        iterations += 1
        distance = np.zeros(k)
        # 根据center来重新给每个point分类
        for i in range(X.shape[0]):
            for j in range(k):
                distance[j] = np.linalg.norm(X[i, :-1] - center[j, :])
            X[i, -1] = np.argmin(distance)
        # 根据每个point的分类来更新center
        new_center = np.zeros((k, X.shape[1] - 1))
        count = np.zeros(k)
        # 通过计算均值来求center
        for i in range(X.shape[0]):
            new_center[int(X[i, -1]), :] += X[i, :-1]
            count[int(X[i, -1])] += 1
        for i in range(k):
            new_center[i, :] = new_center[i, :] / count[i]
        if np.linalg.norm(new_center - center) < epsilon:
            break
        else:
            center = new_center
    return X, center, iterations


def e_step(X, mu_list, sigma_list, pi_list):
    k = mu_list.shape[0]
    gamma_z = np.zeros((X.shape[0], k))
    for i in range(X.shape[0]):
        pi_pdf_sum = 0
        pi_pdf = np.zeros(k)
        for j in range(k):
            pi_pdf[j] = pi_list[j] * multivariate_normal.pdf(
                X[i], mean=mu_list[j], cov=sigma_list[j]
            )
            pi_pdf_sum += pi_pdf[j]
        for j in range(k):
            gamma_z[i, j] = pi_pdf[j] / pi_pdf_sum
    return gamma_z


def m_step(X, mu_list, gamma_z):
    k = mu_list.shape[0]
    n = X.shape[0]
    dim = X.shape[1]
    mu_list_new = np.zeros(mu_list.shape)
    sigma_list_new = np.zeros((k, dim, dim))
    pi_list_new = np.zeros(k)
    for j in range(k):
        gamma = gamma_z[:, j]
        n_j = np.sum(gamma)
        pi_list_new[j] = n_j / n
        gamma = gamma.reshape(n, 1)
        mu_list_new[j, :] = (gamma.T @ X) / n_j
        sigma_list_new[j] = (
            (X - mu_list[j]).T @ np.multiply((X - mu_list[j]), gamma)
        ) / n_j
    return mu_list_new, sigma_list_new, pi_list_new


def log_likelihood(x, mu_list, sigma_list, pi_list):
    log_l = 0
    for i in range(x.shape[0]):
        pi_times_pdf_sum = 0
        for j in range(mu_list.shape[0]):
            pi_times_pdf_sum += pi_list[j] * multivariate_normal.pdf(
                x[j], mean=mu_list[j], cov=sigma_list[j]
            )
        log_l += np.log(pi_times_pdf_sum)
    return log_l


def GMM(X, k, epsilon=1e-5):
    x = X[:, :-1]
    pi_list = np.ones(k) * (1.0 / k)
    sigma_list = np.array([0.1 * np.eye(x.shape[1])] * k)
    # 随机选第1个初始点,依次选择与当前mu中样本点距离最大的点作为初始簇中心点
    mu_list = [x[np.random.randint(0, k) + 1]]
    for times in range(k - 1):
        temp_ans = []
        for i in range(x.shape[0]):
            temp_ans.append(
                np.sum([np.linalg.norm(x[i] - mu_list[j]) for j in range(len(mu_list))])
            )
        mu_list.append(x[np.argmax(temp_ans)])
    mu_list = np.array(mu_list)

    old_log_l = log_likelihood(x, mu_list, sigma_list, pi_list)
    iterations = 0
    log_l_list = pd.DataFrame(columns=("Iterations", "log likelihood"))
    while True:
        gamma_z = e_step(x, mu_list, sigma_list, pi_list)
        mu_list, sigma_list, pi_list = m_step(x, mu_list, gamma_z)
        new_log_l = log_likelihood(x, mu_list, sigma_list, pi_list)
        if iterations % 10 == 0:
            new_row = pd.DataFrame(
                {"Iterations": [iterations], "log likelihood": [old_log_l]}
            )
            log_l_list = pd.concat([log_l_list, new_row], ignore_index=True)
        if old_log_l < new_log_l and (new_log_l - old_log_l) < epsilon:
            break
        old_log_l = new_log_l
        iterations += 1
    # 计算标签
    for i in range(X.shape[0]):
        X[i, -1] = np.argmax(gamma_z[i, :])
    return X, iterations, log_l_list


def show(X, center=None, title=None):
    plt.style.use("seaborn-v0_8-dark")
    plt.scatter(X[:, 0], X[:, 1], c=X[:, 2], marker=".", s=40, cmap="YlGnBu")
    if not center is None:
        plt.scatter(center[:, 0], center[:, 1], c="r", marker="x", s=250)
    if not title is None:
        plt.title(title)
    plt.show()


def accuracy(label, predict, k):
    predicts = list(itertools.permutations(range(k), k))  # 做一些映射
    counts = np.zeros(len(predicts))
    for i in range(len(predicts)):
        for j in range(label.shape[0]):
            if int(label[j]) == predicts[i][int(predict[j])]:
                counts[i] += 1
    return np.max(counts) / label.shape[0]


def get_result_kmeans(X, k, label, epsilon):
    X, center, iterations = kmeans(X, k, epsilon=epsilon)
    print(center)
    acc = accuracy(label, X[:, -1], k)
    show(
        X,
        center,
        title="epsilon="
        + str(epsilon)
        + ", iterations="
        + str(iterations)
        + ", accuracy="
        + str(acc),
    )


def get_result_gmm(X, k, label, epsilon):
    X, iterations, log_l_list = GMM(X, k, epsilon=epsilon)
    print(log_l_list)
    acc = accuracy(label, X[:, -1], k)
    show(
        X,
        title="epsilon="
        + str(epsilon)
        + ", iterations="
        + str(iterations)
        + ", accuracy="
        + str(acc),
    )


def get_result(algorithm, X, k, label, epsilons=[1, 1e-5, 1e-10]):
    if algorithm == "kmeans":
        for e in epsilons:
            get_result_kmeans(X, k, label, epsilon=e)
    elif algorithm == "gmm":
        for e in epsilons:
            get_result_gmm(X, k, label, epsilon=e)


if __name__ == "__main__":
    k = 3
    n = 300
    dimension = 2
    mu_list = np.array([[2, 6], [8, 10], [8, 2]])
    sigma_list = np.array([[[1, 0], [0, 3]], [[3, 0], [0, 2]], [[3, 0], [0, 3]]])
    X = generate_data(k, n, dimension, mu_list, sigma_list)
    label = copy.deepcopy(X[:, -1])  # 副本 防止改变X的最后一列而导致真正的label改变
    print(mu_list)
    show(X, mu_list, title="Real Data Distribution")
    get_result(algorithm="kmeans", X=X, k=k, label=label)
    # get_result(algorithm="gmm", X=X, k=k, label=label)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值