高斯混合模型的实现(EM算法)

Input:观测数据 X n × d X_{n\times d} Xn×d, 类别数 c c c, 迭代停止条件 t o l tol tol
Output: 模型参数( p i c × 1 pi_{c\times 1} pic×1, m u c × d mu_{c\times d} muc×d, s i g m a c × d × d sigma_{c\times d\times d} sigmac×d×d )
“”"

c c c:类别数

n n n:数据总数

d d d:数据维度

p i c × 1 pi_{c\times 1} pic×1:每个类别的权重

m u c × d mu_{c\times d} muc×d:均值矩阵 m u mu mu[i,:] 第i类高斯分布的均值

s i g m a c × d × d sigma_{c\times d\times d} sigmac×d×d:协方差矩阵 s i g m a sigma sigma[c:,:,] 第 c c c类高斯分布的协方差矩阵

Q n × c Q_{n\times c} Qn×c: 隐变量 Q Q Q[i,k] 第 k k k个高斯分布对第 i i i个样本的响应度

“”"
Process:

  1. 初始化参数:( p i c × 1 pi_{c\times 1} pic×1, m u c × d mu_{c\times d} muc×d, s i g m a c × d × d sigma_{c\times d\times d} sigmac×d×d ), Q n × c Q_{n\times c} Qn×c

  2. 计算第t轮模型下的似然函数下界:

    l o s s t = ∑ i = 1 n ∑ k = 1 c Q [ i , k ] l o g ( p i [ k ] ∗ ϕ ( X i ∣ m u [ k , : ] , s i g m a [ k , : , : ] ) − Q [ i , k ] ∗ l o g Q [ i , k ] loss^t=\sum_{i=1}^n\sum_{k=1}^cQ[i,k]log(pi[k]*\phi (X_{i}|mu[k,:],sigma[k,:,:]) -Q[i,k]*logQ[i,k] losst=i=1nk=1cQ[i,k]log(pi[k]ϕ(Ximu[k,:],sigma[k,:,:])Q[i,k]logQ[i,k]

    ∣ l o s s t − l o s s t − 1 ∣ &lt; t o l \vert loss^t - loss^{t-1}\vert&lt;tol losstlosst1<tol 停止, 输出当前模型参数,否则转(3)

  3. E E E
    for i = 1:n
    for k = 1:c
    计算第 k k k个高斯分布对第 i i i个样本的相应程度
    Q [ i , k ] ← p i [ k ] ∗ ϕ ( X i ∣ m u [ k , : ] , s i g m a [ k , : , : ] ) ∑ k = 1 c p i [ k ] ∗ ϕ ( X i ∣ m u [ k , : ] , s i g m a [ k , : , : ] Q[i,k]\leftarrow \frac{pi[k]*\phi (X_{i}|mu[k,:],sigma[k,:,:])}{\sum_{k=1}^c pi[k]*\phi (X_{i}|mu[k,:],sigma[k,:,:]} Q[i,k]k=1cpi[k]ϕ(Ximu[k,:],sigma[k,:,:]pi[k]ϕ(Ximu[k,:],sigma[k,:,:])

  4. M M M
    更新权重 for k=1:c
    p i [ k ] ← ∑ i = 1 n Q [ i , k ] N pi[k]\leftarrow \frac{\sum_{i=1}^nQ[i,k] }{N} pi[k]Ni=1nQ[i,k]
    更新均值 for k=1:c
    m u [ k , : ] ← ∑ i = 1 n Q [ i , k ] ∗ X i ∑ i = 1 n Q [ i , k ] mu[k,:]\leftarrow \frac{\sum_{i=1}^n Q[i,k]*X_{i}}{\sum_{i=1}^n Q[i,k]} mu[k,:]i=1nQ[i,k]i=1nQ[i,k]Xi
    更新协方差 for k=1:c
    s i g m a [ k , : , : ] ← ∑ i = 1 n Q [ i , k ] ∗ ( X i − m u [ k , : ] ) ( X i − m u [ k , : ] ) T ∑ i = 1 n Q [ i , k ] sigma[k,:,:]\leftarrow \frac{\sum_{i=1}^n Q[i,k]*(X_{i}-mu[k,:])(X_{i}-mu[k,:])^T }{\sum_{i=1}^n Q[i,k]} sigma[k,:,:]i=1nQ[i,k]i=1nQ[i,k](Ximu[k,:])(Ximu[k,:])T
    转(2)


代码如下:
参考来源:https://numpy-ml.readthedocs.io/en/latest/numpy_ml.gmm.html#

import numpy as np
from numpy.testing import assert_allclose

class GMM(object):
	def __init__(self, C = 3, seed = None):
	
		self.C = C # 类别个数
        self.N = None # 样本总数
        self.d = None  # 每个样本的维度
        self.seed = seed
        if self.seed:
            np.random.seed(self.seed)
            
     def _initialize_params(self):

        C, d = self.C, self.d
        # 初始化 每个高斯分布的权重
        rr = np.random.rand(C)
        self.pi = rr / rr.sum()
        # 初始化隐变量 每个样本来自于哪个类别的概率
        self.Q = np.zeros(self.N, C)
        # 初始化均值向量 协方差矩阵
        self.mu = np.random.uniform(-5, 10, C * d).reshape(C, d)
        self.sigma = np.array([np.identity(d) for _ in range(C)])

        self.best_pi = None
        self.best_mu = None
        self.best_sigma = None
        self.best_elbo = -np.inf
        
     def likelihood_lower_bound(self):

        N = self.N
        C = self.C

        eps = np.finfo(float).eps
        expec1, expec2 = 0.0, 0.0
        # 计算当前参数下的似然
        for i in range(N):
            x_i = self.X[i]

            for k in range(C):
                pi_k = self.pi[k] #属于某一类的概率
                z_nk = self.Q[i,k] #样本属于某一类的隐变量概率

                mu_k = self.mu[k, :] # 第c类的均值
                sigma_k = self.sigma[k,:,:] # 第c类的协方差

                log_pi_k = np.log(pi_k + eps) # 对数先验概率
                log_p_x_i = log_gaussian_pdf(x_i, mu_k, sigma_k) # 对数概率密度(样本属于该类)

                prob = z_nk *  (log_p_x_i + log_pi_k) # 似然函数

                expec1 += prob
                expec2 += z_nk * np.log(z_nk + eps)

        loss = expec1 - expec2
        return loss
        
	def fit(self, X, max_iter=100, tol=1e-3, verbose=False):

        self.X = X
        self.N = X.shape[0]
        self.d = X.shape[1]
        self._initialize_params()
        prev_vlb = -np.inf
        for _iter in range(max_iter):
            try:
                self._E_step()
                self._M_step()
                vlb = self.likelihood_lower_bound()

                if verbose:
                    print("{}. Lower bound: {}".format(_iter + 1, vlb))
                converged = _iter > 0 and np.abs(vlb - prev_vlb) <= tol
                if np.isnan(vlb) or converged:
                    break
                prev_vlb = vlb
                if vlb > self.best_elbo:
                    self.best_elbo = vlb
                    self.best_mu = self.mu
                    self.best_pi = self.pi
                    self.best_sigma = self.sigma
            except np.linalg.LinAlgError:
                print("Singular matrix: components collapsed")
                return -1
        return 0

	def _E_step(self):

        for i in range(self.N):

            x_i = self.X[i,:]
            denom_vals = []
            for k in range(self.C):
                pi_k = self.pi[k]
                mu_k = self.mu[k, :]
                sigma_k = self.sigma[k,:,:]

                log_pi_k = np.log(pi_k)
                log_p_xi = log_gaussian_pdf(x_i, mu_k, sigma_k)

                denom_vals.append(log_p_xi + log_pi_k)

            log_denom = logsumexp(denom_vals) # 计算分母
            q_i = np.exp([num - log_denom for num in denom_vals])
            assert_allclose(np.sum(q_i), 1, err_msg="{}".format(np.sum(q_i)))
            # 更新Q
            self.Q[i,:] = q_i

    def _M_step(self):


        C, N, X = self.C, self.N, self.X
        denoms = np.sum(self.Q, axis=0)

        # 更新每个类别的先验
        self.pi = denoms / N

        nums_mu = [np.dot(self.Q[:, c], X) for c in range(C)]
        # 更新均值
        for ix, (num, den) in enumerate(zip(nums_mu, denoms)):
            self.mu[ix,:] = num / den if den > 0 else np.zeros_like(num)

        # 更新协方差矩阵
        for k in range(C):
            mu_k = self.mu[k,:]
            n_k = denoms[k]

            outer = np.zeros((self.d, self.d))
            for i in range(N):
                qik = self.Q[i,k]
                xi = self.X[i,:]
                outer += qik * np.outer(xi - mu_k, xi - mu_k)

            outer = outer / n_k if n_k > 0 else outer
            self.sigma[k,:,:] = outer

        assert_allclose(np.sum(self.pi), 1, err_msg="{}".format(np.sum(self.pi)))

'''两个辅助函数'''
def log_gaussian_pdf(x_i, mu, sigma):

    n = len(mu)
    a = n * np.log2(2 * np.pi)
    _, b = np.linalg.slogdet(sigma)
    y = np.linalg.solve(sigma, x_i - mu)
    c = np.dot(x_i - mu, y)
    return -0.5 * (a + b + c)

def logsumexp(log_probs, axis=None)

    _max = np.max(log_probs)
    ds = log_probs - _max
    exp_sum = np.exp(ds).sum(axis=axis)
    return _max + np.log(exp_sum)    
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值