# EM推导PLSA模型

## 回归EM算法

• E步骤：求当隐变量给定后当前估计的参数条件下的后验概率
• M步骤：最大化complete data对数似然函数的期望，把E步当做是已知值，得到新的参数值
• 不断迭代以上步骤直到收敛。

## plsa模型简介

PLSA应用于信息检索、过滤、自然语言处理等领域，考虑到词分布和主题分布，可以看做概率化的矩阵分解，采用EM算法来学习参数。

1. $p\left({d}_{i}\right)$$p(d_i)$的概率选中文档${d}_{i}$$d_i$
2. $p\left({z}_{k}|{d}_{i}\right)$$p(z_k | d_i)$的概率选中主题${z}_{k}$$z_k$
3. $p\left({w}_{j}|{z}_{k}\right)$$p(w_j|z_k)$的概率产生一个词${w}_{j}$$w_j$

E步： 求隐变量的后验概率

$p\left({z}_{k}|{d}_{i},{w}_{j}\right)=\frac{p\left({w}_{j}|{z}_{k}\right)p\left({z}_{k}|{d}_{i}\right)}{\sum _{l=1}^{K}p\left({w}_{j}|{z}_{l}\right)p\left({z}_{k}|{d}_{i}\right)}$$p(z_k | d_i, w_j) = \frac{ p(w_j | z_k) p(z_k | d_i)}{ \sum_{l=1} ^Kp(w_j | z_l) p(z_k | d_i)}$

M步 完整数据的似然函数的期望

$l=\sum _{i}\sum _{j}n\left({d}_{i},{w}_{j}\right)logp\left({d}_{i},{w}_{j}\right)$$l = \sum_i \sum_j n(d_i, w_j) log p(d_i, w_j)$
$=\sum _{i}\sum _{j}n\left({d}_{i},{w}_{j}\right)logp\left({w}_{j}|{d}_{i}\right)p\left({d}_{i}\right)$$= \sum_i \sum_j n(d_i, w_j) log p(w_j | d_i) p(d_i)$
$=\sum _{i}\sum _{j}n\left({d}_{i},{w}_{j}\right)logp\left({w}_{j}|{d}_{i}\right)+\sum _{i}\sum _{j}n\left({d}_{i},{w}_{j}\right)logp\left({d}_{i}\right)$$= \sum_i \sum_j n(d_i, w_j) log p(w_j | d_i) + \sum_i \sum_j n(d_i, w_j) log p(d_i)$

$E\left(l\right)=\sum _{i}\sum _{j}n\left({d}_{i},{w}_{j}\right)\sum _{k}p\left({z}_{k}|{d}_{i},{w}_{j}\right)log\left(p\left({w}_{j},{z}_{k}|{d}_{i}\right)\right)$$E(l) =\sum_i \sum_j n(d_i, w_j) \sum_k p(z_k | d_i, w_j) log(p(w_j, z_k|d_i))$

$=\sum _{i}\sum _{j}n\left({d}_{i},{w}_{j}\right)\sum _{k}p\left({z}_{k}|{d}_{i},{w}_{j}\right)log\left(p\left({z}_{k}|{d}_{i}\right)p\left({w}_{j}|{z}_{k}\right)\right)$$=\sum_i \sum_j n(d_i, w_j) \sum_k p(z_k | d_i, w_j) log(p(z_k|d_i) p(w_j | z_k))$

$\sum _{j=1}^{M}p\left({w}_{j}|{z}_{k}\right)=1$$\sum_{j=1} ^ M p(w_j | z_k) = 1$
$\sum _{k=1}^{K}p\left({z}_{k}|{d}_{i}\right)=1$$\sum_{k=1} ^K p(z_k | d_i) = 1$

$Lag=\sum _{i}\sum _{j}n\left({d}_{i},{w}_{j}\right)\sum _{k}p\left({z}_{k}|{d}_{i},{w}_{j}\right)log\left(p\left({z}_{k}|{d}_{i}\right)p\left({w}_{j}|{z}_{k}\right)\right)+\sum _{k=1}^{K}{\tau }_{k}\left(1-\sum _{j=1}^{M}p\left({w}_{j}|{z}_{k}\right)\right)+\sum _{i=1}^{N}{\rho }_{i}\left(1-\sum _{k=1}^{K}p\left({z}_{k}|{d}_{i}\right)\right)$$Lag = \sum_i \sum_j n(d_i,w_j) \sum_k p(z_k | d_i, w_j) log(p(z_k|d_i) p(w_j | z_k)) + \sum_{k=1} ^K \tau_k ( 1- \sum_{j=1} ^ M p(w_j | z_k)) +\sum_{i=1}^N \rho_i(1-\sum_{k=1} ^K p(z_k | d_i))$

$\frac{\mathrm{\partial }Lag}{\mathrm{\partial }p\left({w}_{j}|{z}_{k}\right)}=\frac{\sum _{i}n\left({d}_{i},{w}_{j}\right)p\left({z}_{k}|{d}_{i},{w}_{j}\right)}{p\left({w}_{j}|{z}_{k}\right)}-{\tau }_{k}=0$$\frac{\partial Lag}{\partial p(w_j|z_k)}= \frac{\sum_i n(d_i, w_j) p(z_k|d_i, w_j)}{p(w_j| z_k)} - \tau_k = 0$
$\sum _{i}n\left({d}_{i},{w}_{j}\right)p\left({z}_{k}|{d}_{i},{w}_{j}\right)={\tau }_{k}p\left({w}_{j}|{z}_{k}\right)$$\sum_i n(d_i, w_j) p(z_k|d_i, w_j) = \tau_kp(w_j| z_k)$
$\sum _{m=1}^{M}\sum _{i}n\left({d}_{i},{w}_{j}\right)p\left({z}_{k}|{d}_{i},{w}_{j}\right)=\sum _{m=1}^{M}{\tau }_{k}p\left({w}_{j}|{z}_{k}\right)$$\sum_{m=1} ^M \sum_i n(d_i, w_j) p(z_k|d_i, w_j) = \sum_{m=1} ^M\tau_kp(w_j| z_k)$

$\sum _{m=1}^{M}\sum _{i}n\left({d}_{i},{w}_{j}\right)p\left({z}_{k}|{d}_{i},{w}_{j}\right)={\tau }_{k}\sum _{m=1}^{M}p\left({w}_{j}|{z}_{k}\right)={\tau }_{k}$$\sum_{m=1} ^M \sum_i n(d_i, w_j) p(z_k|d_i, w_j) =\tau_k \sum_{m=1} ^Mp(w_j| z_k) = \tau_k$

${\tau }_{k}$$\tau_k$代人可得

$\sum _{i}n\left({d}_{i},{w}_{j}\right)p\left({z}_{k}|{d}_{i},{w}_{j}\right)=\sum _{m=1}^{M}\sum _{i}n\left({d}_{i},{w}_{j}\right)p\left({z}_{k}|{d}_{i},{w}_{j}\right)\ast p\left({w}_{j}|{z}_{k}\right)$$\sum_i n(d_i, w_j) p(z_k|d_i, w_j) =\sum_{m=1} ^M \sum_i n(d_i, w_j) p(z_k|d_i, w_j) *p(w_j | z_k)$

$p\left({w}_{j}|{z}_{k}\right)=\frac{\sum _{i}n\left({d}_{i},{w}_{j}\right)p\left({z}_{k}|{d}_{i},{w}_{j}\right)}{\sum _{m=1}^{M}\sum _{i}n\left({d}_{i},{w}_{j}\right)p\left({z}_{k}|{d}_{i},{w}_{j}\right)}$$p(w_j | z_k) = \frac{\sum_i n(d_i, w_j) p(z_k|d_i, w_j) }{\sum_{m=1} ^M \sum_i n(d_i, w_j) p(z_k|d_i, w_j)}$

$\frac{\mathrm{\partial }Lag}{\mathrm{\partial }p\left({z}_{k}|{d}_{i}\right)}=\frac{\sum _{j}n\left({d}_{i},{w}_{j}\right)p\left({z}_{k}|{d}_{i},{w}_{j}\right)}{p\left({z}_{k}|{d}_{i}\right)}-{\rho }_{i}=0$$\frac{\partial Lag}{\partial p(z_k|d_i)} = \frac{ \sum_j n(d_i, w_j) p(z_k|d_i, w_j) } {p(z_k | d_i)} - \rho_i = 0$

$\sum _{j}n\left({d}_{i},{w}_{j}\right)p\left({z}_{k}|{d}_{i},{w}_{j}\right)={\rho }_{i}p\left({z}_{k}|{d}_{i}\right)$$\sum_j n(d_i, w_j) p(z_k|d_i, w_j) = \rho_i p(z_k |d_i)$
$\sum _{k=1}^{K}\sum _{j}n\left({d}_{i},{w}_{j}\right)p\left({z}_{k}|{d}_{i},{w}_{j}\right)=\sum _{k=1}^{K}{\rho }_{i}p\left({z}_{k}|{d}_{i}\right)$$\sum_{k=1} ^ K\sum_j n(d_i, w_j) p(z_k|d_i, w_j) =\sum_{k=1} ^ K \rho_i p(z_k |d_i)$
$\sum _{k=1}^{K}\sum _{j}n\left({d}_{i},{w}_{j}\right)p\left({z}_{k}|{d}_{i},{w}_{j}\right)={\rho }_{i}\sum _{k=1}^{K}p\left({z}_{k}|{d}_{i}\right)={\rho }_{i}$$\sum_{k=1} ^ K\sum_j n(d_i, w_j) p(z_k|d_i, w_j) =\rho_i \sum_{k=1} ^ K p(z_k |d_i) = \rho_i$

${\rho }_{i}$$\rho_i$代入
$\sum _{j}n\left({d}_{i},{w}_{j}\right)p\left({z}_{k}|{d}_{i},{w}_{j}\right)=\sum _{k=1}^{K}\sum _{j}n\left({d}_{i},{w}_{j}\right)p\left({z}_{k}|{d}_{i},{w}_{j}\right)\ast p\left({z}_{k}|{d}_{i}\right)$$\sum_j n(d_i, w_j) p(z_k|d_i, w_j) = \sum_{k=1} ^ K\sum_j n(d_i, w_j) p(z_k|d_i, w_j) *p(z_k |d_i)$

$p\left({z}_{k}|{d}_{i}\right)=\frac{\sum _{j}n\left({d}_{i},{w}_{j}\right)p\left({z}_{k}|{d}_{i},{w}_{j}\right)}{\sum _{k=1}^{K}\sum _{j}n\left({d}_{i},{w}_{j}\right)p\left({z}_{k}|{d}_{i},{w}_{j}\right)}$$p(z_k |d_i) = \frac{\sum_j n(d_i, w_j) p(z_k|d_i, w_j)} {\sum_{k=1} ^ K\sum_j n(d_i, w_j) p(z_k|d_i, w_j) }$

M步更新这两个参数