GMM参数估计python实践

本文通过使用Gaussian Mixture Model (GMM)进行参数调优,对比了不同方差类型下模型的错误率和Bayesian Information Criterion (BIC),并展示了如何选择最优的covariance_type来提高模型性能。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

# !/usr/bin/python
# -*- coding:utf-8 -*-

import numpy as np
from sklearn.mixture import GaussianMixture
import matplotlib as mpl
import matplotlib.colors
import matplotlib.pyplot as plt

mpl.rcParams['font.sans-serif'] = [u'SimHei']
mpl.rcParams['axes.unicode_minus'] = False


def expand(a, b, rate=0.05):
    d = (b - a) * rate
    return a-d, b+d


def accuracy_rate(y1, y2):
    acc = np.mean(y1 == y2)
    return acc if acc > 0.5 else 1-acc


if __name__ == '__main__':
    np.random.seed(0)
    cov1 = np.diag((1, 2))
    N1 = 500
    N2 = 300
    N = N1 + N2
    x1 = np.random.multivariate_normal(mean=(1, 2), cov=cov1, size=N1)
    m = np.array(((1, 1), (1, 3)))
    x1 = x1.dot(m)
    x2 = np.random.multivariate_normal(mean=(-1, 10), cov=cov1, size=N2)
    x = np.vstack((x1, x2))
    y = np.array([0]*N1 + [1]*N2)

    types = ('spherical', 'diag', 'tied', 'full')
    err = np.empty(len(types))
    bic = np.empty(len(types))
    for i, type in enumerate(types):
        gmm = GaussianMixture(n_components=2, covariance_type=type, random_state=0)
        gmm.fit(x)
        err[i] = 1 - accuracy_rate(gmm.predict(x), y)
        bic[i] = gmm.bic(x)
    print '错误率:', err.ravel()
    print 'BIC:', bic.ravel()
    xpos = np.arange(4)
    ax = plt.axes()
    b1 = ax.bar(xpos-0.3, err, width=0.3, color='#77E0A0')
    b2 = ax.twinx().bar(xpos, bic, width=0.3, color='#FF8080')
    plt.grid(True)
    bic_min, bic_max = expand(bic.min(), bic.max())
    plt.ylim((bic_min, bic_max))
    plt.xticks(xpos, types)
    plt.legend([b1[0], b2[0]], (u'错误率', u'BIC'))
    plt.title(u'不同方差类型的误差率和BIC', fontsize=18)
    plt.show()

    optimal = bic.argmin()
    gmm = GaussianMixture(n_components=2, covariance_type=types[optimal], random_state=0)
    gmm.fit(x)
    print '均值 = \n', gmm.means_
    print '方差 = \n', gmm.covariances_
    y_hat = gmm.predict(x)

    cm_light = mpl.colors.ListedColormap(['#FF8080', '#77E0A0'])
    cm_dark = mpl.colors.ListedColormap(['r', 'g'])
    x1_min, x1_max = x[:, 0].min(), x[:, 0].max()
    x2_min, x2_max = x[:, 1].min(), x[:, 1].max()
    x1_min, x1_max = expand(x1_min, x1_max)
    x2_min, x2_max = expand(x2_min, x2_max)
    x1, x2 = np.mgrid[x1_min:x1_max:500j, x2_min:x2_max:500j]
    grid_test = np.stack((x1.flat, x2.flat), axis=1)
    grid_hat = gmm.predict(grid_test)
    grid_hat = grid_hat.reshape(x1.shape)
    if gmm.means_[0][0] > gmm.means_[1][0]:
        z = grid_hat == 0
        grid_hat[z] = 1
        grid_hat[~z] = 0
    plt.figure(figsize=(9, 7), facecolor='w')
    plt.pcolormesh(x1, x2, grid_hat, cmap=cm_light)
    plt.scatter(x[:, 0], x[:, 1], s=30, c=y, marker='o', cmap=cm_dark, edgecolors='k')

    ax1_min, ax1_max, ax2_min, ax2_max = plt.axis()
    plt.xlim((x1_min, x1_max))
    plt.ylim((x2_min, x2_max))
    plt.title(u'GMM调参:covariance_type=%s' % types[optimal], fontsize=20)
    plt.grid()
    plt.show()

 

 

### 使用EM算法估计GMM参数的方法 #### 参数初始化 为了应用EM算法于高斯混合模型(GMM),首先需要初始化参数。这涉及到随机选取初始的均值向量、协方差矩阵以及混合权重。这些初值的选择会影响最终的结果,因此建议多次运行并比较不同起始条件下的结果以获得最优解[^2]。 #### E步:计算后验概率 在E步中,基于当前参数设置,计算每个观测数据属于各个潜在高斯分量的概率。具体来说,就是利用贝叶斯定理来评估给定观察到的数据点来自特定高斯分布的可能性大小。此过程可以表示为: \[ \gamma(z_{nk}) = \frac{\pi_k N(x_n|\mu_k,\Sigma_k)}{\sum_{j=1}^{K}\pi_j N(x_n|\mu_j,\Sigma_j)} \] 其中 \(z_{nk}\) 表示第n个样本归属于第k个簇的概率;\(\pi_k\) 是先验概率或称为混合比例;\(N()\) 则代表多维正态密度函数[^3]。 #### M步:更新模型参数 接着进入M步,这里的目标是在上一步得到的责任分配基础上重新估算新的参数值——即各组别的平均数μ、标准偏差σ(或更一般情况下是协方差Σ)、还有它们对应的相对重要性的度量π。通过最大化加权后的对数似然函数完成这一目标: \[ \hat{\mu}_k=\frac{1}{N_k}\sum_{n=1}^{N}\gamma(z_{nk})x_n \\ \hat{\Sigma}_k=\frac{1}{N_k}\sum_{n=1}^{N}\gamma(z_{nk})(x_n-\hat{\mu}_k)(x_n-\hat{\mu}_k)^T\\ \hat{\pi}_k=\frac{N_k}{N} \] 此处\(N_k=\sum_{n=1}^{N}\gamma(z_{nk})\) 记录了被指派至第k类的所有实例总数目。 #### 迭代直至收敛 上述两阶段构成了单次迭代周期内的操作流程。整个训练过程将持续交替执行这两部分工作,直到满足预设停止准则为止—比如当连续两次迭代间的变化幅度小于某个很小阈值时认为达到了稳定状态。 下面给出一段简单的Python代码片段展示如何实现这个过程: ```python import numpy as np from scipy.stats import multivariate_normal def em_algorithm(X, n_components, max_iter=100, tol=1e-8): """Expectation Maximization algorithm for Gaussian Mixture Model.""" # Initialize parameters randomly or using some heuristic method. means = X[np.random.choice(range(len(X)), size=n_components), :] covariances = [np.cov(X.T)] * n_components weights = np.ones(n_components)/n_components prev_log_likelihood = float('inf') for iteration in range(max_iter): # Expectation step (E-step) responsibilities = [] log_likelihood = 0 for i in range(n_components): dist = multivariate_normal(means[i], covariances[i]) resp_i = weights[i]*dist.pdf(X) responsibilities.append(resp_i[:, None]) log_likelihood += sum(np.log(sum(responsibilities))) responsibilities = np.hstack(responsibilities) total_resps = responsibilities.sum(axis=-1)[:,None] normalized_resp = responsibilities / total_resps if abs(prev_log_likelihood - log_likelihood) < tol: break prev_log_likelihood = log_likelihood # Maximization step (M-step) nk = normalized_resp.sum(axis=0)+1.e-30 # Add small number to avoid division by zero new_means = np.dot(normalized_resp.T,X)/nk[:,None] diff = X[:,None,:] - new_means[None,:,:] new_covs = np.einsum('ik,ijk,ijl->kl',normalized_resp,diff,diff)/(nk+1.e-30)[...,None,None] new_weights = nk/nk.sum() means,covariances,weights=new_means,new_covs.diagonal(),new_weights.flatten() return {'means': means,'covariances': covariances,'weights': weights} # Example usage with synthetic data generation omitted here... ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值