EM算法估计beta混合模型参数

[1] 用 network memerisation 造成的 clean / noisy 数据 loss 差异来区分 clean / noisy data。当得到一批数据的 normalised loss { l i ∈ [ 0 , 1 ] } i = 1 n \{l_i\in[0,1]\}_{i=1}^n {li[0,1]}i=1n 之后,用 beta 混合模型(BMM)拟合两个峰,此前的一篇 [3] 是用高斯混合模型(GMM),两篇都是用 EM 算法求参数。[1] 用 BMM 拟合的来源可能是 [9]。

这里以 [1] 为背景,记 EM 算法的笔记,参考 [4-6]。
normalised loss distribution
[1] 要跟据 loss 值判断一个图文对是 aligned(即 clean)的还是 partial-/mis-aligned(即 noisy)的。具体来说,先求得一批图文对的 normalised loss,分布如上图,有两个峰,各用一个 beta 分布分量拟合,总分布为: p ( l ) = Σ k = 1 K = 2 p ( z = k ) ⋅ p ( l ∣ z = k ) = Σ k K λ k ⋅ p ( l ∣ k ) = Σ k K λ k ⋅ B ( l ; α k , β k ) \begin{aligned} p(l) &= \Sigma_{k=1}^{K=2} p(z=k) \cdot p(l|z=k) \\ &= \Sigma_k^K\lambda_k \cdot p(l|k) \\ &= \Sigma_k^K\lambda_k \cdot \Beta(l;\alpha_k,\beta_k) \end{aligned} p(l)=Σk=1K=2p(z=k)p(lz=k)=ΣkKλkp(lk)=ΣkKλkB(l;αk,βk) 即 [1] 的 (4) 式。隐变量 z = 1 , 2 z=1,2 z=1,2 表示 l l l 属于哪一个峰,1 clean 2 noisy,beta 分布介绍见 [7]。EM 算法求完参数之后,用 p ( z i = 1 ∣ l i ) p(z_i=1|l_i) p(zi=1∣li) 和一个阈值 δ \delta δ 指定第 i 对图文是 clean 还是 noisy。

参考 [4-6],log likelihood: L L = Σ i = 1 n log ⁡ p ( l i ) = Σ i log ⁡ Σ k = 1 K = 2 p ( l i , z i = k ) = Σ i log ⁡ Σ k Q i ( k ) ⋅ p ( l i , k ) Q i ( k ) ( 1 ) ≥ Σ i Σ k Q i ( k ) log ⁡ p ( l i , k ) Q i ( k ) ( 2 ) \begin{aligned} LL &= \Sigma_{i=1}^n \log p(l_i) \\ &= \Sigma_i \log \Sigma_{k=1}^{K=2} p(l_i, z_i=k) \\ &= \Sigma_i \log \Sigma_k Q_i(k)\cdot \frac{p(l_i,k)}{Q_i(k)} & (1) \\ &\ge \Sigma_i \Sigma_kQ_i(k) \log \frac{p(l_i,k)}{Q_i(k)} & (2) \end{aligned} LL=Σi=1nlogp(li)=ΣilogΣk=1K=2p(li,zi=k)=ΣilogΣkQi(k)Qi(k)p(li,k)ΣiΣkQi(k)logQi(k)p(li,k)(1)(2)

用 Jensen 不等式求下界是因为 sum-log 比 log-sum 好求导;用最大似然求参数时希望下界能取等号,即 (1) = (2),这样最大化此下界就等同于最大化 log likelihood。

要 (1) = (2),一种方法是令 p ( l i , k ) Q i ( k ) = c \frac{p(l_i,k)}{Q_i(k)}=c Qi(k)p(li,k)=c,c 是常数,这样 ( 1 ) = Σ i log ⁡ ( c 的期望 ) = Σ i log ⁡ c (1)=\Sigma_i\log(c的期望)=\Sigma_i\log c (1)=Σilog(c的期望)=Σilogc,而 ( 2 ) = Σ i [ log ⁡ ( c ) 的期望 ] = Σ i log ⁡ c (2)=\Sigma_i [\log(c) 的期望]=\Sigma_i\log c (2)=Σi[log(c)的期望]=Σilogc,故 (1) = (2)。此时有: p ( l i , k ) Q i ( k ) = c ( 3 ) p ( l i , k ) = c ⋅ Q i ( k ) p ( l i ) = Σ k p ( l i , k ) = c ⋅ Σ k Q i ( k ) = c ⋅ 1 = c \begin{aligned} \frac{p(l_i,k)}{Q_i(k)} &= c & (3)\\ p(l_i,k) &= c \cdot Q_i(k) \\ p(l_i) = \Sigma_k p(l_i,k) &= c \cdot \Sigma_k Q_i(k) = c \cdot 1 = c \end{aligned} Qi(k)p(li,k)p(li,k)p(li)=Σkp(li,k)=c=cQi(k)=cΣkQi(k)=c1=c(3) 代回 (3)、移项得 Q i ( k ) = p ( l i , k ) c = p ( l i , k ) p ( l i ) = p ( k ∣ l i ) Q_i(k) = \frac{p(l_i,k)}{c} = \frac{p(l_i,k)}{p(l_i)} = p(k|l_i) Qi(k)=cp(li,k)=p(li)p(li,k)=p(kli)。也就是说令 Q i ( k ) = p ( k ∣ l i ) Q_i(k) = p(k|l_i) Qi(k)=p(kli) 时 (1) = (2),最大化下界等同最大化 log likelihood。

于是 EM 算法开始吟唱:

  1. 瞎蒙一组初始参数: λ k , α k , β k = λ k 0 , α k 0 , β k 0   ( k = 1 , … , K ) \lambda_k,\alpha_k,\beta_k = \lambda_k^0,\alpha_k^0,\beta_k^0\, \quad (k=1,\dots,K) λk,αk,βk=λk0,αk0,βk0(k=1,,K)
  2. E 步:定住 λ k , α k , β k \lambda_k,\alpha_k,\beta_k λk,αk,βk,令 Q i ( k ) : = p ( k ∣ l i ) = p ( k ) ⋅ p ( l i ∣ k ) Σ k ′ p ( k ′ ) ⋅ p ( l i ∣ k ′ ) = λ k ⋅ B ( l i ; α k , β k ) Σ k ′ λ k ′ ⋅ B ( l i ; α k ′ , β k ′ ) Q_i(k) := p(k|l_i)=\frac{p(k) \cdot p(l_i|k)}{\Sigma_{k'}p(k') \cdot p(l_i|k')} = \frac{\lambda_k \cdot \Beta(l_i;\alpha_k,\beta_k)}{\Sigma_{k'} \lambda_{k'} \cdot \Beta(l_i;\alpha_{k'},\beta_{k'})} Qi(k):=p(kli)=Σkp(k)p(lik)p(k)p(lik)=ΣkλkB(li;αk,βk)λkB(li;αk,βk)
  3. M 步:定住 Q i ( k ) Q_i(k) Qi(k) λ k , α k , β k = arg max ⁡ λ , α , β L L \lambda_k,\alpha_k,\beta_k = \argmax_{\lambda,\alpha,\beta}LL λk,αk,βk=argmaxλ,α,βLL

E、M 步重复迭代若干次,其中 M 步 λ k \lambda_k λk 用拉格朗日乘子(Lagrange multiplier)求,参考 [6],约束是 Σ k λ k = 1 \Sigma_k \lambda_k=1 Σkλk=1,故拉格朗日函数: L = L L + γ ( 1 − Σ k λ k ) = Σ i Σ k Q i ( k ) log ⁡ p ( l i , k ) Q i ( k ) + γ ( 1 − Σ k λ k ) = Σ i Σ k Q i ( k ) log ⁡ p ( l i , k ) − Σ i Σ k Q i ( k ) log ⁡ Q i ( k ) ⏟ 常数 + γ ( 1 − Σ k λ k ) ∝ Σ i Σ k Q i ( k ) log ⁡ p ( l i , k ) + γ ( 1 − Σ k λ k ) = Σ i Σ k Q i ( k ) log ⁡ λ k ⋅ B ( l i ; α k , β k ) ⏟ 与 λ k 无关 + γ ( 1 − Σ k λ k ) \begin{aligned} L &= LL + \gamma(1 - \Sigma_k \lambda_k) \\ &= \Sigma_i\Sigma_k Q_i(k)\log\frac{p(l_i,k)}{Q_i(k)} + \gamma(1 - \Sigma_k \lambda_k) \\ &= \Sigma_i\Sigma_k Q_i(k)\log p(l_i,k) - \Sigma_i\Sigma_k \underbrace{Q_i(k)\log Q_i(k)}_{\text{常数}} + \gamma(1 - \Sigma_k \lambda_k) \\ &\propto \Sigma_i\Sigma_k Q_i(k)\log p(l_i,k) + \gamma(1 - \Sigma_k \lambda_k) \\ &= \Sigma_i\Sigma_k Q_i(k)\log \lambda_k \cdot \underbrace{\Beta(l_i;\alpha_k,\beta_k)}_{与 \lambda_{k} 无关} + \gamma(1 - \Sigma_k \lambda_k) \end{aligned} L=LL+γ(1Σkλk)=ΣiΣkQi(k)logQi(k)p(li,k)+γ(1Σkλk)=ΣiΣkQi(k)logp(li,k)ΣiΣk常数 Qi(k)logQi(k)+γ(1Σkλk)ΣiΣkQi(k)logp(li,k)+γ(1Σkλk)=ΣiΣkQi(k)logλkλk无关 B(li;αk,βk)+γ(1Σkλk) 分别求导、令等于零: { ∂ L ∂ λ k = 1 λ k Σ i Q i ( k ) + γ = 0 ∂ L ∂ γ = 1 − Σ k λ k = 0 \left\{\begin{aligned} \frac{\partial L}{\partial \lambda_k} &= \frac{1}{\lambda_k}\Sigma_i Q_i(k) + \gamma &= 0 \\ \frac{\partial L}{\partial \gamma} &= 1 - \Sigma_k \lambda_k &= 0 \end{aligned}\right. λkLγL=λk1ΣiQi(k)+γ=1Σkλk=0=0 解得 { γ = − n λ k = 1 n Σ i Q i ( k ) = 1 n Σ i λ k ⋅ B ( l i ; α k , β k ) Σ k ′ λ k ′ ⋅ B ( l i ; α k ′ , β k ′ ) \left\{\begin{aligned} \gamma &= -n \\ \lambda_k &= \frac{1}{n}\Sigma_i Q_i(k) = \frac{1}{n}\Sigma_i \frac{\lambda_k \cdot \Beta(l_i;\alpha_k,\beta_k)}{\Sigma_{k'} \lambda_{k'} \cdot \Beta(l_i;\alpha_{k'},\beta_{k'})} \end{aligned}\right. γλk=n=n1ΣiQi(k)=n1ΣiΣkλkB(li;αk,βk)λkB(li;αk,βk) α k , β k \alpha_k, \beta_k αk,βk 则用 beta 分布的均值、方差公式反求,参考 [7],对 B ( x ; α , β ) \Beta(x;\alpha,\beta) B(x;α,β) 有: { μ = E X = α α + β σ 2 = D X = α β ( α + β ) 2 ( α + β + 1 ) \left\{\begin{aligned} \mu &= \mathbb{E}X &&= \frac{\alpha}{\alpha + \beta} \\ \sigma^2 &= \mathbb{D}X &&= \frac{\alpha \beta}{(\alpha + \beta)^2(\alpha + \beta + 1)} \end{aligned}\right. μσ2=EX=DX=α+βα=(α+β)2(α+β+1)αβ 反求得 { β = ( 1 μ − 1 ) α α = μ [ μ ( 1 − μ ) σ 2 − 1 ] \left\{\begin{aligned} \beta &= (\frac{1}{\mu} - 1)\alpha \\ \alpha &= \mu \left[ \frac{\mu(1 - \mu)}{\sigma^2} - 1 \right] \end{aligned}\right. βα=(μ11)α=μ[σ2μ(1μ)1]

对应 [1] 代码的 BetaMixture1D,其将 Q i ( ⋅ ) Q_i(\cdot) Qi() 称为 responsibilities p ( l i , z i = k ) = p ( k ) p ( l i ∣ k ) = λ k B ( l i ; α k , β k ) p(l_i,z_i=k)=p(k)p(l_i|k)=\lambda_k\Beta(l_i;\alpha_k,\beta_k) p(li,zi=k)=p(k)p(lik)=λkB(li;αk,βk) 称为 weighted_likelihood

References

  1. (CVPR 2023) BiCro: Noisy Correspondence Rectification for Multi-modality Data via Bi-directional Cross-modal Similarity Consistency - paper, code
  2. (ICML 2017) A Closer Look at Memorization in Deep Networks - paper
  3. (NIPS 2021) Learning with Noisy Correspondence for Cross-modal Matching - paper, code
  4. EM算法原理总结
  5. 【机器学习】EM——期望最大(非常详细)
  6. 高斯混合模型(GMM)与EM算法的推导
  7. Β分布
  8. 话说有可以求beta分布或者gamma分布的参数的最大似然估计的方法嘛?
  9. (ICML 2019) Unsupervised Label Noise Modeling and Loss Correction - paper, code
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值