通用EM
EM
- 通用EM
- 目标函数:极大似然函数 l o g P ( X ∣ θ ) = l o g Σ z P ( x , z ∣ θ ) logP(X|\theta)=log\Sigma_zP(x,z|\theta) logP(X∣θ)=logΣzP(x,z∣θ)
- 用于:不完整数据的对数似然函数
- 不知Z的数据,只知道Z的后验分布 P ( z ∣ x , θ o l d ) P(z|x,\theta^{old}) P(z∣x,θold)
- 考虑其期望 Q ( θ , θ o l d ) = E p ( z ∣ x , θ o l d ) ( l o g P ( x , z ∣ θ ) ) Q(\theta,\theta^{old})=E_{p(z|x,\theta^{old})}(log P(x,z|\theta)) Q(θ,θold)=Ep(z∣x,θold)(logP(x,z∣θ))
- 最大化期望 θ n e w = a r g m a x θ Q ( θ , θ o l d ) \theta^{new}=argmax_\theta Q(\theta,\theta^{old}) θnew=argmaxθQ(θ,θold)
- E:求 P ( z ∣ x , θ o l d ) P(z|x,\theta^{old}) P(z∣x,θold)
- M:
θ
n
e
w
=
a
r
g
m
a
x
θ
Q
(
θ
,
θ
o
l
d
)
\theta^{new}=argmax_\theta Q(\theta,\theta^{old})
θnew=argmaxθQ(θ,θold)
- why是启发式的,但却存在似然函数?
- Q ( θ , θ o l d ) = E p ( z ∣ x , θ o l d ) ( l o g P ( x , z ∣ θ ) ) = p ( x ; θ ) Q(\theta,\theta^{old})=E_{p(z|x,\theta^{old})}(log P(x,z|\theta))=p(x;\theta) Q(θ,θold)=Ep(z∣x,θold)(logP(x,z∣θ))=p(x;θ)
- why是启发式的,但却存在似然函数?
- 完整数据和不完整数据的比较
- 不完整数据:
l
o
g
p
(
x
)
=
Σ
i
l
o
g
Σ
z
p
(
x
i
∣
z
)
p
(
z
)
=
Σ
i
l
o
g
Σ
k
=
1
K
π
k
N
(
x
i
∣
μ
k
,
Σ
k
)
logp(x)=\Sigma_ilog \Sigma_zp(x_i|z)p(z)=\Sigma_ilog \Sigma_{k=1}^K\pi_kN(x_i|\mu_k,\Sigma_k)
logp(x)=ΣilogΣzp(xi∣z)p(z)=ΣilogΣk=1KπkN(xi∣μk,Σk)
- 不完整数据中,参数之间是耦合的,不存在封闭解
- 完整数据
- l o g p ( x , z ∣ θ ) = l o g p ( z ∣ θ ) p ( x ∣ z , θ ) = Σ i Σ k z i k ( l o g π k + l o g N ( x i ∣ μ k , Σ k ) ) logp(x,z|\theta)=logp(z|\theta)p(x|z,\theta)=\Sigma_i\Sigma_k z_{ik}(log\pi_k+logN(x_i|\mu_k,\Sigma_k)) logp(x,z∣θ)=logp(z∣θ)p(x∣z,θ)=ΣiΣkzik(logπk+logN(xi∣μk,Σk))
- E z ( l o g p ( x , z ∣ θ ) ) = Σ i Σ k E ( z i k ) ( l o g π k + l o g N ( x i ∣ μ k , Σ k ) ) = Σ i Σ k γ ( z i k ) ( l o g π k + l o g N ( x i ∣ μ k , Σ k ) ) E_z(logp(x,z|\theta))\\=\Sigma_i\Sigma_kE(z_{ik})(log\pi_k+logN(x_i|\mu_k,\Sigma_k))\\=\Sigma_i\Sigma_k\gamma(z_{ik})(log\pi_k+logN(x_i|\mu_k,\Sigma_k)) Ez(logp(x,z∣θ))=ΣiΣkE(zik)(logπk+logN(xi∣μk,Σk))=ΣiΣkγ(zik)(logπk+logN(xi∣μk,Σk))
E | M | 目标函数 | 对谁求导 | |
---|---|---|---|---|
通用(z) | P ( z ∥ x , θ o l d ) P(z\|x,\theta^{old}) P(z∥x,θold) | $\theta^{new}=argmax_\theta Q(\theta,\theta^{old})=log(p(x | \theta))$ | Q ( θ , θ o l d ) = E p ( z ∥ x , θ o l d ) ( l o g P ( x , z ∥ θ ) ) Q(\theta,\theta^{old})=E_{p(z\|x,\theta^{old})}(log P(x,z\|\theta)) Q(θ,θold)=Ep(z∥x,θold)(logP(x,z∥θ)) |
GMM(y) | γ ( z i k ) = p ( z i k = 1 ∥ x i ) = p ( z i k = 1 ) p ( x i ∥ z k = 1 ) Σ k = 1 K p ( z i k = 1 ) p ( x i ∥ z k = 1 ) = π k N ( x i ∥ μ k , Σ k ) Σ k = 1 K π k N ( x i ∥ μ k , Σ k ) \gamma(z_{ik})\\=p(z_{ik=1}\|x_i)\\=\frac{p(z_{ik}=1)p(x_i\|z_k=1)}{\Sigma_{k=1}^Kp(z_{ik}=1)p(x_i\|z_k=1)}\\=\frac{\pi_kN(x_i\|\mu_k,\Sigma_k)}{\Sigma_{k=1}^K\pi_kN(x_i\|\mu_k,\Sigma_k)} γ(zik)=p(zik=1∥xi)=Σk=1Kp(zik=1)p(xi∥zk=1)p(zik=1)p(xi∥zk=1)=Σk=1KπkN(xi∥μk,Σk)πkN(xi∥μk,Σk) | μ k = Σ i γ ( z i k ) x i Σ i γ ( z i k ) π k = Σ i γ ( z i k ) N Σ k = Σ i γ ( z i k ) ( x i − μ k ) ( x i − μ k ) T γ ( z i k ) \mu_k=\frac{\Sigma_i\gamma(z_{ik})x_i}{\Sigma_i\gamma(z_{ik})}\\\pi_k=\frac{\Sigma_i\gamma(z_{ik})}{N}\\\Sigma_k=\frac{\Sigma_i\gamma(z_{ik})(x_i-\mu_k)(x_i-\mu_k)^T}{\gamma(z_{ik})} μk=Σiγ(zik)Σiγ(zik)xiπk=NΣiγ(zik)Σk=γ(zik)Σiγ(zik)(xi−μk)(xi−μk)T | p ( x ; θ ) = Π i N Σ k = 1 K π k N ( x i ∥ μ k , Σ k ) , 其 中 Σ k π k = 1 , 0 ≤ π k ≤ 1 , Q ( θ , θ o l d ) = l o g ( Σ z P ( x , z ∥ θ ) ) = l o g p ( x ∥ θ ) p(x;\theta)=\Pi_i^N\Sigma_{k=1}^K\pi_kN(x_i\|\mu_k,\Sigma_k),其中\Sigma_k\pi_k=1,0\leq \pi_k\leq 1,Q(\theta,\theta^{old})=log(\Sigma_zP(x,z\|\theta))=log p(x\|\theta) p(x;θ)=ΠiNΣk=1KπkN(xi∥μk,Σk),其中Σkπk=1,0≤πk≤1,Q(θ,θold)=log(ΣzP(x,z∥θ))=logp(x∥θ) | l o g p ( x ∥ θ ) log p(x\|\theta) logp(x∥θ) |
HMM (y) | ξ ( y t , y t + 1 ) = P ( y t , y t + 1 ∥ x ) = α ( y t ) P ( x t + 1 ∥ y t + 1 ) β ( y t + 1 ) a y t + 1 , y t p ( x ) γ t i = P ( y t i = 1 ∥ x , θ p ) E ( n i j ∥ x , θ p ) = Σ t = 1 T γ t i x t i E ( m i j ∥ x , θ p ) = Σ t = 1 T − 1 ξ t , t + 1 i j \xi(y_t,y_{t+1})=P(y_t,y_{t+1}\|x)=\frac{\alpha(y_t)P(x_{t+1}\|y_{t+1})\beta(y_{t+1})a_{y_{t+1},y_t}}{p(x)}\\\gamma_t^i=P(y_t^i=1\|x,\theta^p)\\E(n_{ij}\|x,\theta^p)=\Sigma_{t=1}^T\gamma_t^ix_t^i\\E(m_{ij}\|x,\theta^p)=\Sigma_{t=1}^{T-1}\xi_{t,t+1}^{ij} ξ(yt,yt+1)=P(yt,yt+1∥x)=p(x)α(yt)P(xt+1∥yt+1)β(yt+1)ayt+1,ytγti=P(yti=1∥x,θp)E(nij∥x,θp)=Σt=1TγtixtiE(mij∥x,θp)=Σt=1T−1ξt,t+1ij | α ^ i j = m i j Σ k = 1 N m i k η ^ i j = n i j Σ k = 1 N n i k π ^ i = y 1 i \hat{\alpha}_{ij}=\frac{m_{ij}}{\Sigma_{k=1}^N m_{ik}}\\\hat{\eta}_{ij}=\frac{n_{ij}}{\Sigma_{k=1}^N n_{ik}}\\\hat{\pi}_i=y_1^i α^ij=Σk=1Nmikmijη^ij=Σk=1Nniknijπ^i=y1i | Q ( θ , θ o l d ) = E p ( y ∥ x , θ o l d ) ( l o g P ( x , y ∥ θ ) ) = Σ y p ( y ∥ x , θ o l d ) l o g P ( x , y ∥ θ ) = Σ y ( ( l o g π i + Σ t = 1 T − 1 l o g a t , t + 1 + Σ t = 1 T P ( x t ∥ y t ) ) P ( x , y ∥ θ o l d ) ) s . t Σ i = 1 N π i = 1 Q(\theta,\theta^{old})=E_{p(y\|x,\theta^{old})}(log P(x,y\|\theta))=\Sigma_y p(y\|x,\theta^{old} )logP(x,y\|\theta)\\=\Sigma_y((log \pi_i+\Sigma_{t=1}^{T-1}log a_{t,t+1} +\Sigma_{t=1}^TP(x_t\|y_t))P(x,y\|\theta^{old}))\\s.t \Sigma_{i=1}^N\pi_i=1 Q(θ,θold)=Ep(y∥x,θold)(logP(x,y∥θ))=Σyp(y∥x,θold)logP(x,y∥θ)=Σy((logπi+Σt=1T−1logat,t+1+Σt=1TP(xt∥yt))P(x,y∥θold))s.tΣi=1Nπi=1 | L = Q ( θ , θ o l d ) + λ ( Σ i = 1 N π i − 1 ) L=Q(\theta,\theta^{old})+\lambda(\Sigma_{i=1}^N\pi_i-1) L=Q(θ,θold)+λ(Σi=1Nπi−1) |
EM收敛性保证
- 目标:最大化
P
(
x
∣
θ
)
=
Σ
z
p
(
x
,
z
∣
θ
)
P(x|\theta)=\Sigma_zp(x,z|\theta)
P(x∣θ)=Σzp(x,z∣θ)
- 直接优化 P ( x ∣ θ ) P(x|\theta) P(x∣θ)很困难,但优化完整数据的 p ( x , z ∣ θ ) p(x,z|\theta) p(x,z∣θ)容易
- 证明
- 分解
- 对任意分布q(z),下列分解成立
- l n p ( x ∣ θ ) = L ( q , θ ) + K L ( q ∣ ∣ p ) 其 中 , L ( q , θ ) = Σ z q ( z ) l n ( p ( x , z ∣ θ ) q ( z ) ) K L ( q ∣ ∣ p ) = − Σ z q ( z ) l n ( p ( z ∣ x , θ ) q ( z ) ) K L ( q ∣ ∣ p ) ≥ 0 , L ( q , θ ) 是 l n p ( x ∣ θ ) 的 下 界 lnp(x|\theta)=L(q,\theta)+KL(q||p)\\其中,\\L(q,\theta)=\Sigma_zq(z)ln(\frac{p(x,z|\theta)}{q(z)})\\KL(q||p)=-\Sigma_zq(z)ln(\frac{p(z|x,\theta)}{q(z)})\\KL(q||p)\geq0,L(q,\theta)是lnp(x|\theta)的下界 lnp(x∣θ)=L(q,θ)+KL(q∣∣p)其中,L(q,θ)=Σzq(z)ln(q(z)p(x,z∣θ))KL(q∣∣p)=−Σzq(z)ln(q(z)p(z∣x,θ))KL(q∣∣p)≥0,L(q,θ)是lnp(x∣θ)的下界
- E: 最 大 化 L ( q , θ ) , q ( z ) = P ( z ∣ x , θ o l d ) 最大化L(q,\theta),\\q(z)=P(z|x,\theta^{old}) 最大化L(q,θ),q(z)=P(z∣x,θold)
- M : 原 来 的 下 界 L ( q , θ ) = Σ z P ( z ∣ x , θ o l d ) l n ( p ( x , z ∣ θ ) q ( z ) ) = Q ( θ , θ o l d ) + c o n s t − − − 正 好 是 期 望 M:原来的下界L(q,\theta)=\Sigma_zP(z|x,\theta^{old})ln(\frac{p(x,z|\theta)}{q(z)})=Q(\theta,\theta^{old})+const---正好是期望 M:原来的下界L(q,θ)=ΣzP(z∣x,θold)ln(q(z)p(x,z∣θ))=Q(θ,θold)+const−−−正好是期望
- 下界提升了
https://www.bilibili.com/video/av31906558?from=search&seid=2112421761429235163
GMM(聚类
5.2GMM高斯混合模型和EM
- 概率解释: 假设有K个簇,每一个簇服从高斯分布,以概率π𝑘随机选择一个簇 k ,从其分布中采样出一个样本点,如此得到观测数据
- N个样本点𝒙的似然函数(Likelihood)
- p ( x ; θ ) = Π i N Σ k = 1 K π k N ( x i ∣ μ k , Σ k ) , 其 中 Σ k π k = 1 , 0 ≤ π k ≤ 1 p(x;\theta)=\Pi_i^N\Sigma_{k=1}^K\pi_kN(x_i|\mu_k,\Sigma_k),其中\Sigma_k\pi_k=1,0\leq \pi_k\leq 1 p(x;θ)=ΠiNΣk=1KπkN(xi∣μk,Σk),其中Σkπk=1,0≤πk≤1
- 引入隐变量,指示所属类,k维独热表示
- p ( z k = 1 ) = π k p(z_k=1)=\pi_k p(zk=1)=πk
-
p
(
x
i
∣
z
)
=
Π
k
K
N
(
x
i
∣
μ
k
,
Σ
k
)
z
k
p(x_i|z)=\Pi_k^KN(x_i|\mu_k,\Sigma_k)^{z_k}
p(xi∣z)=ΠkKN(xi∣μk,Σk)zk
- p ( x i ∣ z k = 1 ) = N ( x i ∣ μ k , Σ k ) p(x_i|z_k=1)=N(x_i|\mu_k,\Sigma_k) p(xi∣zk=1)=N(xi∣μk,Σk)
- p ( x i ) = Σ z p ( x i ∣ z ) p ( z ) = Σ k = 1 K π k N ( x i ∣ μ k , Σ k ) p(x_i)=\Sigma_zp(x_i|z)p(z)=\Sigma_{k=1}^K\pi_kN(x_i|\mu_k,\Sigma_k) p(xi)=Σzp(xi∣z)p(z)=Σk=1KπkN(xi∣μk,Σk)
- 从属度(可以看做,xi属于第k个簇的解释
- γ ( z i k ) = p ( z i k = 1 ∣ x i ) = p ( z i k = 1 ) p ( x i ∣ z k = 1 ) Σ k = 1 K p ( z i k = 1 ) p ( x i ∣ z k = 1 ) = π k N ( x i ∣ μ k , Σ k ) Σ k = 1 K π k N ( x i ∣ μ k , Σ k ) \gamma(z_{ik})\\=p(z_{ik=1}|x_i)\\=\frac{p(z_{ik}=1)p(x_i|z_k=1)}{\Sigma_{k=1}^Kp(z_{ik}=1)p(x_i|z_k=1)}\\=\frac{\pi_kN(x_i|\mu_k,\Sigma_k)}{\Sigma_{k=1}^K\pi_kN(x_i|\mu_k,\Sigma_k)} γ(zik)=p(zik=1∣xi)=Σk=1Kp(zik=1)p(xi∣zk=1)p(zik=1)p(xi∣zk=1)=Σk=1KπkN(xi∣μk,Σk)πkN(xi∣μk,Σk)
参数学习:极大似然估计–EM
- 极大似然估计
- 难:log里面有求和,所有参数耦合
- 似然函数取最大值时满足的条件:
l
o
g
(
P
(
x
∣
θ
)
)
=
l
o
g
(
Σ
z
P
(
x
,
z
∣
θ
)
对
μ
k
求
导
log(P(x|\theta))=log(\Sigma_zP(x,z|\theta)对\mu_k求导
log(P(x∣θ))=log(ΣzP(x,z∣θ)对μk求导
-
0
=
−
Σ
i
=
1
N
π
k
N
(
x
i
∣
μ
k
,
Σ
k
)
Σ
k
=
1
K
π
k
N
(
x
i
∣
μ
k
,
Σ
k
)
Σ
k
−
1
(
x
i
−
μ
k
)
0=-\Sigma_{i=1}^N\frac{\pi_kN(x_i|\mu_k,\Sigma_k)}{\Sigma_{k=1}^K\pi_kN(x_i|\mu_k,\Sigma_k)}\Sigma_k^{-1}(x_i-\mu_k)
0=−Σi=1NΣk=1KπkN(xi∣μk,Σk)πkN(xi∣μk,Σk)Σk−1(xi−μk)
- μ k = Σ i γ ( z i k ) x i γ ( z i k ) \mu_k=\frac{\Sigma_i\gamma(z_{ik})x_i}{\gamma(z_{ik})} μk=γ(zik)Σiγ(zik)xi
- π k = Σ i γ ( z i k ) N \pi_k=\frac{\Sigma_i\gamma(z_{ik})}{N} πk=NΣiγ(zik)
- Σ k = Σ i γ ( z i k ) ( x i − μ k ) ( x i − μ k ) T γ ( z i k ) \Sigma_k=\frac{\Sigma_i\gamma(z_{ik})(x_i-\mu_k)(x_i-\mu_k)^T}{\gamma(z_{ik})} Σk=γ(zik)Σiγ(zik)(xi−μk)(xi−μk)T
- 这不是封闭解–》EM
- E:给定当前参数估计值,求后验概率 γ ( z i k ) = E ( z i k ) \gamma(z_{ik})=E(z_{ik}) γ(zik)=E(zik)
- M:依据后验概率 γ ( z i k ) \gamma(z_{ik}) γ(zik),求参数估计 μ k 、 π k 、 Σ k \mu_k、\pi_k、\Sigma_k μk、πk、Σk
- 迭代收敛到局部极小
-
0
=
−
Σ
i
=
1
N
π
k
N
(
x
i
∣
μ
k
,
Σ
k
)
Σ
k
=
1
K
π
k
N
(
x
i
∣
μ
k
,
Σ
k
)
Σ
k
−
1
(
x
i
−
μ
k
)
0=-\Sigma_{i=1}^N\frac{\pi_kN(x_i|\mu_k,\Sigma_k)}{\Sigma_{k=1}^K\pi_kN(x_i|\mu_k,\Sigma_k)}\Sigma_k^{-1}(x_i-\mu_k)
0=−Σi=1NΣk=1KπkN(xi∣μk,Σk)πkN(xi∣μk,Σk)Σk−1(xi−μk)
HMM
4.1.3 学习,参数估计
-
极大似然估计:EM算法
- 最大化 P ( x ∣ θ ) P(x|\theta) P(x∣θ)
- 参数 A 、 π , 输 出 分 布 的 参 数 A、\pi,输出分布的参数 A、π,输出分布的参数
-
P ( x ∣ θ ) = Σ y 1 , y 2 , . . . , y T P ( x , y ) = Σ y 1 , y 2 , . . . , y T π y 1 Π t = 1 T − 1 a y t + 1 , y t Π t = 1 T P ( x t ∣ y t , η ) P(x|\theta)=\Sigma_{y_1,y_2,...,y_T} P(x,y)=\Sigma_{y_1,y_2,...,y_T}\pi_{y_1}\Pi_{t=1}^{T-1}a_{y_{t+1},y_t}\Pi_{t=1}^{T}P(x_t|y_t,\eta) P(x∣θ)=Σy1,y2,...,yTP(x,y)=Σy1,y2,...,yTπy1Πt=1T−1ayt+1,ytΠt=1TP(xt∣yt,η)
-
假 设 P ( x t ∣ y t , η ) = Π i = 1 M Π j = 1 L [ η i j ] y t i x t j 假设P(x_t|y_t,\eta)=\Pi_{i=1}^M \Pi_{j=1}^L[\eta_{ij}]^{y_t^ix_t^j} 假设P(xt∣yt,η)=Πi=1MΠj=1L[ηij]ytixtj
-
M
-
α ^ i j = m i j Σ k = 1 N m i k η ^ i j = n i j Σ k = 1 N n i k π ^ i = y 1 i \hat{\alpha}_{ij}=\frac{m_{ij}}{\Sigma_{k=1}^N m_{ik}}\\ \hat{\eta}_{ij}=\frac{n_{ij}}{\Sigma_{k=1}^N n_{ik}}\\ \hat{\pi}_i=y_1^i α^ij=Σk=1Nmikmijη^ij=Σk=1Nniknijπ^i=y1i
-
E步
-
ξ ( y t , y t + 1 ) = P ( y t , y t + 1 ∣ x ) = P ( x ∣ y t , y t + 1 ) P ( y t + 1 ∣ y t ) P ( y t ) p ( x ) = P ( x 1 , . . . x t ∣ y t ) P ( x t + 1 ∣ y t + 1 ) P ( x t + 2 , . . . x n ∣ y t + 1 ) P ( y t + 1 ∣ y t ) P ( y t ) p ( x ) = α ( y t ) P ( x t + 1 ∣ y t + 1 ) β ( y t + 1 ) a y t + 1 , y t p ( x ) \xi(y_t,y_{t+1})=P(y_t,y_{t+1}|x)\\=\frac{P(x|y_t,y_{t+1})P(y_{t+1}|y_t)P(y_t)}{p(x)}\\=\frac{P(x1,...x_t|y_t)P(x_{t+1}|y_{t+1})P(x_{t+2},...x_n|y_{t+1})P(y_{t+1}|y_t)P(y_t)}{p(x)}\\=\frac{\alpha(y_t)P(x_{t+1}|y_{t+1})\beta(y_{t+1})a_{y_{t+1},y_t}}{p(x)} ξ(yt,yt+1)=P(yt,yt+1∣x)=p(x)P(x∣yt,yt+1)P(yt+1∣yt)P(yt)=p(x)P(x1,...xt∣yt)P(xt+1∣yt+1)P(xt+2,...xn∣yt+1)P(yt+1∣yt)P(yt)=p(x)α(yt)P(xt+1∣yt+1)β(yt+1)ayt+1,yt
-
γ t i = P ( y t i = 1 ∣ x , θ p ) \gamma_t^i=P(y_t^i=1|x,\theta^p) γti=P(yti=1∣x,θp)
-
E ( n i j ∣ x , θ p ) = Σ t = 1 T γ t i x t i E ( m i j ∣ x , θ p ) = Σ t = 1 T − 1 ξ t , t + 1 i j E(n_{ij}|x,\theta^p)=\Sigma_{t=1}^T\gamma_t^ix_t^i\\E(m_{ij}|x,\theta^p)=\Sigma_{t=1}^{T-1}\xi_{t,t+1}^{ij} E(nij∣x,θp)=Σt=1TγtixtiE(mij∣x,θp)=Σt=1T−1ξt,t+1ij
-
Q ( θ , θ o l d ) = E p ( y ∣ x , θ o l d ) ( l o g P ( x , y ∣ θ ) ) = Σ I p ( y ∣ x , θ o l d l o g P ( x , y ∣ θ ) = Σ I ( ( l o g π i + Σ t = 1 T − 1 l o g a t , t + 1 + Σ t = 1 T P ( x t ∣ y t ) ) P ( x , y ∣ θ o l d ) ) s . t Σ i = 1 N π i = 1 Q(\theta,\theta^{old})=E_{p(y|x,\theta^{old})}(log P(x,y|\theta))=\Sigma_I p(y|x,\theta^{old} logP(x,y|\theta)\\=\Sigma_I((log \pi_i+\Sigma_{t=1}^{T-1}log a_{t,t+1} +\Sigma_{t=1}^TP(x_t|y_t))P(x,y|\theta^{old}))\\s.t \Sigma_{i=1}^N\pi_i=1 Q(θ,θold)=Ep(y∣x,θold)(logP(x,y∣θ))=ΣIp(y∣x,θoldlogP(x,y∣θ)=ΣI((logπi+Σt=1T−1logat,t+1+Σt=1TP(xt∣yt))P(x,y∣θold))s.tΣi=1Nπi=1
-
缺点
- 仅捕捉了状态之间和状态及其对应输出之间的关系(上下文)
- 学习目标和预测目标不匹配
- 我们只要p(y|x),但只知道p(x,y)—产生式模型