机器学习基础:期望最大化算法(Machine Learning Fundamentals: EM Algorithm)

前言

EM算法和MLE算法的相同点在于,两者都需要知道确定的概率密度函数形式。

若没有隐藏变量,则可以用MLE进行估计。若数据欠缺,或存在隐含变量,则无法使用直接使用MLE进行估计,因此需要使用EM算法。

所谓的隐藏变量,指的是1. 在整个数据集中,有些数据是不完备的。2. 或者一个数学模型可以通过引入未知变量来简化。如:高斯混合模型引入权重变量。

什么时候使用EM算法

实际上,即使有隐藏变量,我们只不过多加一个参数来估计而已,因此理论上也可以使用MLE的方法进行估计,即:先求对数似然函数 log likelihood, 然后求导最大化这个对数似然函数。

之所以引入EM算法,更重要的是因为多了隐变量之后,log-likelihood虽然能写出来,但是求导麻烦了,因此不好优化了。
由于加了隐变量,我们估计模型参数时需要知道隐变量;而隐变量是不知道的,要想估计隐变量,又需要知道模型参数。因此看起来像是个死锁。总结来说,当Q函数比似然函数l好优化时,就可以用EM算法来求。[1]

那么EM算法是如何迭代的呢?我们可以先初始化模型参数和隐变量中的一个,然后用这个初始化的参数去估计第二个参数。然后用这个估计的第二个参数再去估计第一个参数。不断迭代,得到最终结果,且这种策略的收敛性得到了证明。

核心思想

EM算法的核心思想是根据给定数据,迭代地使用最大似然估计方法(MLE)求得未知参数 [1].

基本流程

EM算法的大概流程是:1. 先根据估计的参数来求得对数似然函数(log likelihood)的期望, 这个期望还是一个函数 2. 然后再计算使该期望最大时新的参数。3. 不断迭代前两步直至收敛

实例分析

推导过程

MLE回顾

回顾MLE,对数似然函数可以写成如下形式:

l ( θ ) = ∑ i = 1 n log ⁡ p ( x i ∣ θ ) l(\theta) = \sum_{i=1}^{n}\log p(x_i|\theta) l(θ)=i=1nlogp(xiθ)

其中 X = { x 1 , x 2 , . . . , x n } X=\{x_1, x_2, ..., x_n\} X={x1,x2,...,xn}是观察数据。

EM

定义似然函数

相比于MLE,EM由于增加了隐随机变量 Z = { z 1 , z 2 , . . . , z n } Z=\{z_1, z_2, ..., z_n\} Z={z1,z2,...,zn},则对数似然函数需要表示成如下形式:

l ( θ ) = ∑ i = 1 n log ⁡ p ( x i ∣ θ ) = ∑ i = 1 n log ⁡ ∑ z i = 1 k p ( z i ∣ θ ) p ( x i ∣ z i , θ ) = ∑ i = 1 n log ⁡ ∑ z i = 1 k p ( x i , z i ∣ θ ) \begin{aligned} l(\theta) &= \sum_{i=1}^{n}\log p(x_i|\theta) \\ &= \sum_{i=1}^{n} \log \sum_{z_i=1}^{k}p(z_i|\theta)p(x_i|z_i, \theta)\\ &= \sum_{i=1}^{n}\log \sum_{z_i=1}^{k} p(x_i, z_i|\theta) \\ \end{aligned} l(θ)=i=1nlogp(xiθ)=i=1nlogzi=1kp(ziθ)p(xizi,θ)=i=1nlogzi=1kp(xi,ziθ)

似然函数的近似

背景: 如果没有隐变量 Z Z Z, 则上述似然函数可以直接用MLE求解了。问题是 log ⁡ \log log函数里面还有个求和公式,因此很难直接求解这个似然函数。一种思路就是通过一种技巧把 log ⁡ \log log里面的求和公式干掉。

知识回顾:

  1. 对于离散型随机变量 X X X, 其期望为 E ( X ) = ∑ x ∈ X x P ( x ) E(X) = \sum_{x\in X} xP(x) E(X)=xXxP(x)
  2. 设Y = g(X), 则 E ( Y ) = ∑ x ∈ X g ( x ) P ( x ) E(Y)=\sum_{x\in X}g(x)P(x) E(Y)=xXg(x)P(x)
  3. 由于 log ⁡ ( x ) \log(x) log(x)是个凹函数,有 l o g ( E ( X ) ) ≥ E ( l o g ( X ) ) log(E(X))\ge E(log(X)) log(E(X))E(log(X)), 如图:

在这里插入图片描述

方法: 去除 log ⁡ \log log函数里面的求和公式需要用到一些近似,具体放缩方式如下:

l ( θ ) = ∑ i = 1 n log ⁡ ∑ z i = 1 k p ( x i , z i ∣ θ ) = ∑ i = 1 n log ⁡ ∑ z i = 1 k Q i ( z i ) p ( x i , z i ∣ θ ) Q i ( z i ) ≥ ∑ i = 1 n ∑ z i = 1 k Q i ( z i ) log ⁡ p ( x i , z i ∣ θ ) Q i ( z i ) \begin{aligned} l(\theta) &= \sum_{i=1}^{n}\log \sum_{z_i=1}^{k} p(x_i, z_i|\theta) \\ &= \sum_{i=1}^{n}\log \sum_{z_i=1}^{k} Q_i(z_i) \frac{p(x_i, z_i|\theta)}{Q_i(z_i)} \\ &\ge \sum_{i=1}^{n}\sum_{z_i=1}^{k} Q_i(z_i) \log \frac{p(x_i, z_i|\theta)}{Q_i(z_i)} \\ \end{aligned} l(θ)=i=1nlogzi=1kp(xi,ziθ)=i=1nlogzi=1kQi(zi)Qi(zi)p(xi,ziθ)i=1nzi=1kQi(zi)logQi(zi)p(xi,ziθ)

推导: 根据要点回顾,我们引入了随机变量 z i z_i zi 的分布 Q i ( z i ) Q_i(z_i) Qi(zi)。然后设

g ( z ) = p ( x i , z i ∣ θ ) Q i ( z i ) g(z)=\frac{p(x_i, z_i|\theta)}{Q_i(z_i)} g(z)=Qi(zi)p(xi,ziθ)

则似然函数中第二行的 ∑ z i = 1 k Q i ( z i ) p ( x i , z i ∣ θ ) Q i ( z i ) \sum_{z_i=1}^{k} Q_i(z_i) \frac{p(x_i, z_i|\theta)}{Q_i(z_i)} zi=1kQi(zi)Qi(zi)p(xi,ziθ)可看作 E ( g ( z ) ) E(g(z)) E(g(z)), 即
E z i ( p ( x i , z i ∣ θ ) Q i ( z i ) ) E_{z_i}\left (\frac{p(x_i, z_i|\theta)}{Q_i(z_i)} \right) Ezi(Qi(zi)p(xi,ziθ))

又根据对数函数的性质,有 l o g ( E ( X ) ) ≥ E ( l o g ( X ) log(E(X))\ge E(log(X) log(E(X))E(log(X), 则:

∑ i n l o g ( E z i ( p ( x i , z i ∣ θ ) Q i ( z i ) ) ) ≥ ∑ i n E z i ( l o g ( p ( x i , z i ∣ θ ) Q i ( z i ) ) ) \sum_{i}^{n}log\left( E_{z_i}\left (\frac{p(x_i, z_i|\theta)}{Q_i(z_i)} \right) \right) \ge \sum_{i}^{n} E_{z_i}\left( log \left( \frac{p(x_i, z_i|\theta)}{Q_i(z_i)}\right) \right) inlog(Ezi(Qi(zi)p(xi,ziθ)))inEzi(log(Qi(zi)p(xi,ziθ)))
即:
∑ i = 1 n log ⁡ ∑ z i = 1 k Q i ( z i ) p ( x i , z i ∣ θ ) Q i ( z i ) ≥ ∑ i = 1 n ∑ z i = 1 k Q i ( z i ) log ⁡ p ( x i , z i ∣ θ ) Q i ( z i ) \sum_{i=1}^{n}\log \sum_{z_i=1}^{k} Q_i(z_i) \frac{p(x_i, z_i|\theta)}{Q_i(z_i)} \ge \sum_{i=1}^{n}\sum_{z_i=1}^{k} Q_i(z_i) \log \frac{p(x_i, z_i|\theta)}{Q_i(z_i)} i=1nlogzi=1kQi(zi)Qi(zi)p(xi,ziθ)i=1nzi=1kQi(zi)logQi(zi)p(xi,ziθ)
推导完成。

似然函数的求解

这个不等式右边被称为 l ( θ ) l(\theta) l(θ)的下界,我们还需要确定下 Q i ( z i ) Q_i(z_i) Qi(zi)才能求解这个下界的最大值。求解过程如下:

若取到等号时我们的下界函数最接近原始函数,此时有:

p ( x i , z i ∣ θ ) Q i ( z i ) = c ( 常 数 ) \frac{p(x_i, z_i|\theta)}{Q_i(z_i)} = c(常数) Qi(zi)p(xi,ziθ)=c

又因为 Q i ( z i ) Q_i(z_i) Qi(zi)是概率分布,所以 ∑ z i = 1 k Q i ( z i ) = 1 \sum_{z_i=1}^{k}Q_i(z_i)=1 zi=1kQi(zi)=1. 将 p ( x i , z i ∣ θ ) Q i ( z i ) = c \frac{p(x_i, z_i|\theta)}{Q_i(z_i)} = c Qi(zi)p(xi,ziθ)=c 带入对数似然函数中的 ∑ z i = 1 k Q i ( z i ) p ( x i , z i ∣ θ ) Q i ( z i ) \sum_{z_i=1}^{k}Q_i(z_i) \frac{p(x_i, z_i|\theta)}{Q_i(z_i)} zi=1kQi(zi)Qi(zi)p(xi,ziθ), 得:

∑ z i = 1 k p ( x i , z i ∣ θ ) = c \sum_{z_i=1}^{k} p(x_i, z_i|\theta) = c zi=1kp(xi,ziθ)=c

所以有:

Q i ( z i ) = p ( x i , z i ∣ θ ) c = p ( x i , z i ∣ θ ) ∑ z i = 1 k p ( x i , z i ∣ θ ) = p ( x i , z i ∣ θ ) p ( x i ∣ θ ) = p ( z i ∣ x i , θ ) \begin{aligned} Q_i(z_i) &= \frac{p(x_i, z_i|\theta)}{c} \\ &= \frac{p(x_i, z_i|\theta)}{\sum_{z_i=1}^{k} p(x_i, z_i|\theta)} \\ &= \frac{p(x_i, z_i|\theta)}{p(x_i|\theta)} \\ &= p(z_i|x_i, \theta) \end{aligned} Qi(zi)=cp(xi,ziθ)=zi=1kp(xi,ziθ)p(xi,ziθ)=p(xiθ)p(xi,ziθ)=p(zixi,θ)

一旦这个Q确定了,我们就可以求解 ( θ ) (\theta) (θ)的下界了。

我们再回顾下上述不等式,所谓E步骤,其实就是在计算一个期望,这个期望也就是 l ( θ ) l(\theta) l(θ)的下界,即:

∑ i n E z i ( l o g ( p ( x i , z i ∣ θ ) Q i ( z i ) ) ) = ∑ i = 1 n ∑ z i = 1 k Q i ( z i ) log ⁡ p ( x i , z i ∣ θ ) Q i ( z i ) \sum_{i}^{n} E_{z_i}\left( log \left( \frac{p(x_i, z_i|\theta)}{Q_i(z_i)}\right) \right) = \sum_{i=1}^{n}\sum_{z_i=1}^{k} Q_i(z_i) \log \frac{p(x_i, z_i|\theta)}{Q_i(z_i)} inEzi(log(Qi(zi)p(xi,ziθ)))=i=1nzi=1kQi(zi)logQi(zi)p(xi,ziθ)

然后我们再计算:

a r g m a x ∑ i = 1 n ∑ z i = 1 k Q i ( z i ) log ⁡ p ( x i , z i ∣ θ ) Q i ( z i ) argmax\sum_{i=1}^{n}\sum_{z_i=1}^{k} Q_i(z_i) \log \frac{p(x_i, z_i|\theta)}{Q_i(z_i)} argmaxi=1nzi=1kQi(zi)logQi(zi)p(xi,ziθ)

这一步就被称为M步。

小结

通过上述分析,我们总结EM算法如下:

  1. 随机初始化模型参数 θ 0 \theta^{0} θ0
  2. (E步)计算联合分布的条件概率期望 ( j = 0 , 1 , 2 , 3... ) (j = 0, 1, 2, 3 ...) (j=0,1,2,3...)
    Q i ( z i ) = p ( z i ∣ x i , θ j ) Q_i(z_i) = p(z_i|x_i, \theta^{j}) Qi(zi)=p(zixi,θj)

l ( θ , θ j ) = ∑ i = 1 n ∑ z i = 1 k Q i ( z i ) log ⁡ p ( x i , z i ∣ θ ) Q i ( z i ) l(\theta, \theta^{j}) = \sum_{i=1}^{n}\sum_{z_i=1}^{k} Q_i(z_i) \log \frac{p(x_i, z_i|\theta)}{Q_i(z_i)} \\ l(θθj)=i=1nzi=1kQi(zi)logQi(zi)p(xi,ziθ)
3. (M步)极大化 l ( θ , θ j ) l(\theta, \theta^{j}) l(θθj)
θ j + 1 = arg max ⁡ l ( θ , θ j ) \theta^{j+1} = \argmax l(\theta, \theta^{j}) θj+1=argmaxl(θ,θj)
4. 重复2,3直至算法收敛

思考

参考文献

  1. R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. John Wiley & Sons, 2012.
  2. EM算法详解
  3. EM算法原理及推导
  4. 徐亦达机器学习:Expectation Maximization EM算法
  5. 如何感性地理解EM算法?
  6. 复旦-机器学习课程 第十讲 EM 算法
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
Machine Learning Fundamentals: Use Python and scikit-learn to get up and running with the hottest developments in machine learning By 作者: Hyatt Saleh ISBN-10 书号: 1789803551 ISBN-13 书号: 9781789803556 出版日期: 2018-11-29 pages 页数: (426) As machine learning algorithms become popular, new tools that optimize these algorithms are also developed. Machine Learning Fundamentals explains you how to use the syntax of scikit-learn. You’ll study the difference between supervised and unsupervised models, as well as the importance of choosing the appropriate algorithm for each dataset. You’ll apply unsupervised clustering algorithms over real-world datasets, to discover patterns and profiles, and explore the process to solve an unsupervised machine learning problem. The focus of the book then shifts to supervised learning algorithms. You’ll learn to implement different supervised algorithms and develop neural network structures using the scikit-learn package. You’ll also learn how to perform coherent result analysis to improve the performance of the algorithm by tuning hyperparameters. By the end of this book, you will have gain all the skills required to start programming machine learning algorithms. Contents What You Will Learn Understand the importance of data representation Gain insights into the differences between supervised and unsupervised models Explore data using the Matplotlib library Study popular algorithms, such as k-means, Mean-Shift, and DBSCAN Measure model performance through different metrics Implement a confusion matrix using scikit-learn Study popular algorithms, such as Naïve-Bayes, Decision Tree, and SVM Perform error analysis to improve the performance of the model Learn to build a comprehensive machine learning program Authors Hyatt Saleh After graduating from college as a business administrator, Hyatt Saleh discovered the importance of data analysis for understanding and solving real-life problems. Since then, as a self-taught person, she has not only worked as a freelancer for many companies around the world in the field of machine learning but also founded an artificial intelligence company that aims to optimize everyday processes. She is also the author of another Packt book, titled “Machine Learning Fundamentals”.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

此人姓于名叫罩百灵

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值