概率生成式模型
概率判别式模型直接对条件概率
p
(
Y
∣
X
)
p(Y|X)
p(Y∣X)建模,比如逻辑回归,先计算
p
(
y
=
1
∣
x
)
p(y=1|x)
p(y=1∣x)和
p
(
y
=
0
∣
x
)
p(y=0|x)
p(y=0∣x)的概率值,再通过概率值判断分类结果取0还是1;
概率生成式模型关心的是
p
(
y
=
0
∣
x
)
p(y=0|x)
p(y=0∣x)和
p
(
y
=
1
∣
x
)
p(y=1|x)
p(y=1∣x)两个概率哪个更大,只是比较二者的大小,不是一味地求
p
(
y
∣
x
)
p(y|x)
p(y∣x)的具体值;引入贝叶斯公式:
p
(
y
∣
x
)
=
p
(
x
∣
y
)
p
(
y
)
p
(
x
)
p(y|x)=\frac{p(x|y)p(y)}{p(x)}
p(y∣x)=p(x)p(x∣y)p(y)
分母
p
(
x
)
p(x)
p(x)是样本的概率,一般为常数,因此有
p
(
y
∣
x
)
p(y|x)
p(y∣x)正比于
p
(
x
∣
y
)
p
(
y
)
p(x|y)p(y)
p(x∣y)p(y),即正比于联合概率;所以有生成式模型的表达:
y
=
a
r
g
m
a
x
y
∈
{
0
,
1
}
p
(
y
∣
x
)
=
a
r
g
m
a
x
y
∈
{
0
,
1
}
p
(
x
∣
y
)
p
(
y
)
y=argmax_{y\in\left\{0,1\right\}}p(y|x)=argmax_{y\in\left\{0,1\right\}}p(x|y)p(y)
y=argmaxy∈{0,1}p(y∣x)=argmaxy∈{0,1}p(x∣y)p(y)
高斯判别模型原理
对 p ( y ) p(y) p(y)进行研究, y y y的取值为1或0,是一个二分类问题,随机变量 y y y服从伯努利分布:
y y y | 1 1 1 | 0 0 0 |
---|---|---|
p p p | ϕ \phi ϕ | 1 − ϕ 1-\phi 1−ϕ |
即有:
p
(
y
=
1
)
=
ϕ
y
p(y=1)=\phi^{y}
p(y=1)=ϕy 和
p
(
y
=
0
)
=
(
1
−
ϕ
)
1
−
y
p(y=0)=(1-\phi)^{1-y}
p(y=0)=(1−ϕ)1−y;联立为一个式子:
p
(
y
)
=
ϕ
y
(
1
−
ϕ
)
1
−
y
p(y)=\phi^{y}(1-\phi)^{1-y}
p(y)=ϕy(1−ϕ)1−y
再对
p
(
x
∣
y
)
p(x|y)
p(x∣y)进行研究,先做出一个强假设:当确定样本类别时,样本存在的概率服从高斯分布,这也是高斯判别模型中具有"高斯"二字的原因;即有:
p
(
x
∣
y
=
1
)
=
N
(
μ
1
,
Σ
)
,
p
(
x
∣
y
=
0
)
=
N
(
μ
0
,
Σ
)
p(x|y=1)=N(\mu_{1},\Sigma),p(x|y=0)=N(\mu_{0},\Sigma)
p(x∣y=1)=N(μ1,Σ),p(x∣y=0)=N(μ0,Σ)
可以进一步描述模型的假设:基于不同分类的条件概率满足高斯分布,他们具有不同的均值(或者均值向量),但是其方差(或者协方差矩阵)是一致的。现在,将两个条件概率写成一个式子进行表达:
p
(
x
∣
y
)
=
N
(
μ
1
,
Σ
)
y
N
(
μ
0
,
Σ
)
1
−
y
p(x|y)=N(\mu_{1},\Sigma)^{y}N(\mu_{0},\Sigma)^{1-y}
p(x∣y)=N(μ1,Σ)yN(μ0,Σ)1−y
高斯判别模型的参数估计
根据
p
(
y
)
p(y)
p(y)和
p
(
x
∣
y
)
p(x|y)
p(x∣y),针对
p
(
x
∣
y
)
p
(
y
)
p(x|y)p(y)
p(x∣y)p(y)建立似然函数,利用极大似然估计方法估计高斯判别模型的参数;模型的对数似然函数为:
L
(
θ
)
=
l
o
g
∏
i
=
1
N
(
p
(
x
i
∣
y
i
)
p
(
y
i
)
)
=
∑
i
=
1
N
l
o
g
(
p
(
x
i
∣
y
i
)
p
(
y
i
)
)
=
∑
i
=
1
N
(
l
o
g
[
p
(
x
i
∣
y
i
)
]
+
l
o
g
[
p
(
y
i
)
]
)
L(\theta)=log\prod_{i=1}^{N}(p(x_{i}|y_{i})p(y_{i}))=\sum_{i=1}^{N}log(p(x_{i}|y_{i})p(y_{i}))=\sum_{i=1}^{N}(log[p(x_{i}|y_{i})]+log[p(y_{i})])
L(θ)=logi=1∏N(p(xi∣yi)p(yi))=i=1∑Nlog(p(xi∣yi)p(yi))=i=1∑N(log[p(xi∣yi)]+log[p(yi)])
代入假设的分布为:
L
(
θ
)
=
∑
i
=
1
N
(
l
o
g
[
N
(
μ
1
,
Σ
)
y
i
N
(
μ
0
,
Σ
)
1
−
y
i
]
+
l
o
g
[
ϕ
y
i
(
1
−
ϕ
)
1
−
y
i
]
)
L(\theta)=\sum_{i=1}^{N}(log[N(\mu_{1},\Sigma)^{y_{i}}N(\mu_{0},\Sigma)^{1-y_{i}}]+log[\phi^{y_{i}}(1-\phi)^{1-y_{i}}])
L(θ)=i=1∑N(log[N(μ1,Σ)yiN(μ0,Σ)1−yi]+log[ϕyi(1−ϕ)1−yi])
待估计参数为
θ
=
(
ϕ
,
μ
1
,
μ
0
,
Σ
)
\theta=(\phi,\mu_{1},\mu_{0},\Sigma)
θ=(ϕ,μ1,μ0,Σ),假设
y
=
1
y=1
y=1的样本数为
N
1
N_{1}
N1,
y
=
0
y=0
y=0的样本数为
N
0
N_{0}
N0,则
N
0
+
N
1
=
N
N_{0}+N_{1}=N
N0+N1=N;
先估计参数
ϕ
\phi
ϕ,其只与对数似然的第二项有关,因此有:
ϕ
m
l
e
=
a
r
g
m
a
x
ϕ
∑
i
=
1
N
l
o
g
[
ϕ
y
i
(
1
−
ϕ
)
1
−
y
i
]
=
a
r
g
m
a
x
ϕ
∑
i
=
1
N
(
y
i
l
o
g
(
ϕ
)
+
(
1
−
y
i
)
l
o
g
(
1
−
ϕ
)
)
\phi_{mle}=argmax_{\phi}\sum_{i=1}^{N}log[\phi^{y_{i}}(1-\phi)^{1-y_{i}}]=argmax_{\phi}\sum_{i=1}^{N}(y_{i}log(\phi)+(1-y_{i})log(1-\phi))
ϕmle=argmaxϕi=1∑Nlog[ϕyi(1−ϕ)1−yi]=argmaxϕi=1∑N(yilog(ϕ)+(1−yi)log(1−ϕ))
计算偏导数,令偏导数为0,得到:
∂
(
∑
i
=
1
N
(
y
i
l
o
g
(
ϕ
)
+
(
1
−
y
i
)
l
o
g
(
1
−
ϕ
)
)
)
∂
ϕ
=
0
⇒
ϕ
m
l
e
=
N
1
N
\frac{\partial(\sum_{i=1}^{N}(y_{i}log(\phi)+(1-y_{i})log(1-\phi)))}{\partial\phi}=0\Rightarrow \phi_{mle}=\frac{N_{1}}{N}
∂ϕ∂(∑i=1N(yilog(ϕ)+(1−yi)log(1−ϕ)))=0⇒ϕmle=NN1
估计参数
μ
1
\mu_{1}
μ1,它只与对数似然的第一项有关,第一项又可分解为:
∑
i
=
1
N
(
l
o
g
[
N
(
μ
1
,
Σ
)
y
i
N
(
μ
0
,
Σ
)
1
−
y
i
]
)
=
∑
i
=
1
N
(
l
o
g
[
N
(
μ
1
,
Σ
)
y
i
]
+
l
o
g
[
N
(
μ
0
,
Σ
)
1
−
y
i
]
)
\sum_{i=1}^{N}(log[N(\mu_{1},\Sigma)^{y_{i}}N(\mu_{0},\Sigma)^{1-y_{i}}])=\sum_{i=1}^{N}(log[N(\mu_{1},\Sigma)^{y_{i}}]+log[N(\mu_{0},\Sigma)^{1-y_{i}}])
i=1∑N(log[N(μ1,Σ)yiN(μ0,Σ)1−yi])=i=1∑N(log[N(μ1,Σ)yi]+log[N(μ0,Σ)1−yi])
因此,只需考虑上式第一项:
μ
1
=
a
r
g
m
a
x
μ
1
∑
i
=
1
N
l
o
g
[
N
(
μ
1
,
Σ
)
y
i
]
=
a
r
g
m
a
x
μ
1
∑
i
=
1
N
y
i
l
o
g
1
(
2
π
)
D
/
2
∣
Σ
∣
1
/
2
e
x
p
(
−
1
2
(
x
i
−
μ
1
)
T
Σ
−
1
(
x
i
−
μ
1
)
)
\mu_{1}=argmax_{\mu_{1}}\sum_{i=1}^{N}log[N(\mu_{1},\Sigma)^{y_{i}}]=argmax_{\mu_{1}}\sum_{i=1}^{N}y_{i}log\frac{1}{(2\pi)^{D/2}|\Sigma|^{1/2}}exp(-\frac{1}{2}(x_{i}-\mu_{1})^{T}\Sigma^{-1}(x_{i}-\mu_{1}))
μ1=argmaxμ1i=1∑Nlog[N(μ1,Σ)yi]=argmaxμ1i=1∑Nyilog(2π)D/2∣Σ∣1/21exp(−21(xi−μ1)TΣ−1(xi−μ1))
其中,
N
N
N是一个
D
D
D维高斯分布,去除无关项后,化简为:
μ
1
=
a
r
g
m
a
x
μ
1
∑
i
=
1
N
y
i
(
−
1
2
(
x
i
−
μ
1
)
T
Σ
−
1
(
x
i
−
μ
1
)
)
\mu_{1}=argmax_{\mu_{1}}\sum_{i=1}^{N}y_{i}(-\frac{1}{2}(x_{i}-\mu_{1})^{T}\Sigma^{-1}(x_{i}-\mu_{1}))
μ1=argmaxμ1i=1∑Nyi(−21(xi−μ1)TΣ−1(xi−μ1))
同样,计算偏导数:
∂
(
∑
i
=
1
N
y
i
(
−
1
2
(
x
i
−
μ
1
)
T
Σ
−
1
(
x
i
−
μ
1
)
)
)
∂
μ
1
=
∂
(
−
1
2
∑
i
=
1
N
y
i
(
x
i
T
Σ
−
1
x
i
−
x
i
T
Σ
−
1
μ
1
−
μ
1
T
Σ
−
1
x
i
+
μ
1
T
Σ
−
1
μ
1
)
)
∂
μ
1
\frac{\partial(\sum_{i=1}^{N}y_{i}(-\frac{1}{2}(x_{i}-\mu_{1})^{T}\Sigma^{-1}(x_{i}-\mu_{1})))}{\partial\mu_{1}}=\frac{\partial(-\frac{1}{2}\sum_{i=1}^{N}y_{i}(x_{i}^{T}\Sigma^{-1}x_{i}-x_{i}^{T}\Sigma^{-1}\mu_{1}-\mu_{1}^{T}\Sigma^{-1}x_{i}+\mu_{1}^{T}\Sigma^{-1}\mu_{1}))}{\partial\mu_{1}}
∂μ1∂(∑i=1Nyi(−21(xi−μ1)TΣ−1(xi−μ1)))=∂μ1∂(−21∑i=1Nyi(xiTΣ−1xi−xiTΣ−1μ1−μ1TΣ−1xi+μ1TΣ−1μ1))
其中,
x
i
T
Σ
−
1
x
i
x_{i}^{T}\Sigma^{-1}x_{i}
xiTΣ−1xi相对
μ
1
\mu_{1}
μ1为常数,在计算偏导时为0,因此可以忽略;
x
i
T
Σ
−
1
μ
1
x_{i}^{T}\Sigma^{-1}\mu_{1}
xiTΣ−1μ1 和
μ
1
T
Σ
−
1
x
i
\mu_{1}^{T}\Sigma^{-1}x_{i}
μ1TΣ−1xi 互为转置,而且两项都表示一个数,即二者相等;所以,上式等价于:
∂
(
−
1
2
∑
i
=
1
N
y
i
(
−
2
μ
1
T
Σ
−
1
x
i
+
μ
1
T
Σ
−
1
μ
1
)
)
∂
μ
1
=
−
1
2
∑
i
=
1
N
y
i
(
−
2
Σ
−
1
x
i
+
2
Σ
−
1
μ
1
)
=
0
\frac{\partial(-\frac{1}{2}\sum_{i=1}^{N}y_{i}(-2\mu_{1}^{T}\Sigma^{-1}x_{i}+\mu_{1}^{T}\Sigma^{-1}\mu_{1}))}{\partial\mu_{1}}=-\frac{1}{2}\sum_{i=1}^{N}y_{i}(-2\Sigma^{-1}x_{i}+2\Sigma^{-1}\mu_{1})=0
∂μ1∂(−21∑i=1Nyi(−2μ1TΣ−1xi+μ1TΣ−1μ1))=−21i=1∑Nyi(−2Σ−1xi+2Σ−1μ1)=0
即有:
∑
i
=
1
N
y
i
(
μ
1
−
x
i
)
=
0
⇒
μ
1
=
∑
i
=
1
N
y
i
x
i
N
1
\sum_{i=1}^{N}y_{i}(\mu_{1}-x_{i})=0\Rightarrow\mu_{1}=\frac{\sum_{i=1}^{N}y_{i}x_{i}}{N_{1}}
i=1∑Nyi(μ1−xi)=0⇒μ1=N1∑i=1Nyixi
参数
μ
0
\mu_{0}
μ0的估计推导过程与之类似,最后估计协方差矩阵
Σ
\Sigma
Σ,首先考虑各类别样本的集合:
C
1
=
{
x
i
∣
y
i
=
1
,
i
=
1
,
2
,
.
.
.
,
N
1
}
,
∣
C
1
∣
=
N
1
C_{1}=\left\{x_{i}|y_{i}=1,i=1,2,...,N_{1}\right\},|C_{1}|=N_{1}
C1={xi∣yi=1,i=1,2,...,N1},∣C1∣=N1
C
0
=
{
x
i
∣
y
i
=
0
,
i
=
1
,
2
,
.
.
.
,
N
0
}
,
∣
C
1
∣
=
N
0
C_{0}=\left\{x_{i}|y_{i}=0,i=1,2,...,N_{0}\right\},|C_{1}|=N_{0}
C0={xi∣yi=0,i=1,2,...,N0},∣C1∣=N0
因此可以化简对数似然的各项为:
∑
i
=
1
N
l
o
g
[
N
(
μ
1
,
Σ
)
y
i
]
=
∑
i
=
1
N
y
i
l
o
g
[
N
(
μ
1
,
Σ
)
]
=
∑
x
i
∈
C
1
l
o
g
[
N
(
μ
1
,
Σ
)
]
\sum_{i=1}^{N}log[N(\mu_{1},\Sigma)^{y_{i}}]=\sum_{i=1}^{N}y_{i}log[N(\mu_{1},\Sigma)]=\sum_{x_{i}\in C_{1}}log[N(\mu_{1},\Sigma)]
i=1∑Nlog[N(μ1,Σ)yi]=i=1∑Nyilog[N(μ1,Σ)]=xi∈C1∑log[N(μ1,Σ)]
同理有:
∑
i
=
1
N
l
o
g
[
N
(
μ
0
,
Σ
)
1
−
y
i
]
=
∑
x
i
∈
C
0
l
o
g
[
N
(
μ
0
,
Σ
)
]
\sum_{i=1}^{N}log[N(\mu_{0},\Sigma)^{1-y_{i}}]=\sum_{x_{i}\in C_{0}}log[N(\mu_{0},\Sigma)]
i=1∑Nlog[N(μ0,Σ)1−yi]=xi∈C0∑log[N(μ0,Σ)]
计算关于协方差矩阵的偏导数(梯度):
∂
(
∑
x
i
∈
C
1
l
o
g
[
N
(
μ
1
,
Σ
)
]
+
∑
x
i
∈
C
0
l
o
g
[
N
(
μ
0
,
Σ
)
]
)
∂
Σ
\frac{\partial(\sum_{x_{i}\in C_{1}}log[N(\mu_{1},\Sigma)]+\sum_{x_{i}\in C_{0}}log[N(\mu_{0},\Sigma)])}{\partial\Sigma}
∂Σ∂(∑xi∈C1log[N(μ1,Σ)]+∑xi∈C0log[N(μ0,Σ)])
下面对通用的形式进行化简:
∑
i
=
1
N
l
o
g
N
(
μ
,
Σ
)
=
∑
i
=
1
N
l
o
g
1
(
2
π
)
D
/
2
∣
Σ
∣
1
/
2
e
x
p
(
−
1
2
(
x
−
μ
)
T
Σ
−
1
(
x
−
μ
)
)
\sum_{i=1}^{N}logN(\mu,\Sigma)=\sum_{i=1}^{N}log\frac{1}{(2\pi)^{D/2}|\Sigma|^{1/2}}exp(-\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu))
i=1∑NlogN(μ,Σ)=i=1∑Nlog(2π)D/2∣Σ∣1/21exp(−21(x−μ)TΣ−1(x−μ))
=
−
∑
i
=
1
N
D
2
l
o
g
(
2
π
)
−
∑
i
=
1
N
1
2
l
o
g
∣
Σ
∣
−
∑
i
=
1
N
1
2
(
x
−
μ
)
T
Σ
−
1
(
x
−
μ
)
=-\sum_{i=1}^{N}\frac{D}{2}log(2\pi)-\sum_{i=1}^{N}\frac{1}{2}log|\Sigma|-\sum_{i=1}^{N}\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu)
=−i=1∑N2Dlog(2π)−i=1∑N21log∣Σ∣−i=1∑N21(x−μ)TΣ−1(x−μ)
此处引入线性代数中的概念:迹;对于一个
n
n
n阶方阵
A
A
A,方阵的迹为
t
r
(
A
)
tr(A)
tr(A),为方阵对角线上所有元素之和,而
(
x
−
μ
)
T
Σ
−
1
(
x
−
μ
)
(x-\mu)^{T}\Sigma^{-1}(x-\mu)
(x−μ)TΣ−1(x−μ)的结果为一个数值,数值可以看作是一个
1
×
1
1\times 1
1×1的方阵,因此有:
(
x
−
μ
)
T
Σ
−
1
(
x
−
μ
)
=
t
r
(
(
x
−
μ
)
T
Σ
−
1
(
x
−
μ
)
)
(x-\mu)^{T}\Sigma^{-1}(x-\mu)=tr((x-\mu)^{T}\Sigma^{-1}(x-\mu))
(x−μ)TΣ−1(x−μ)=tr((x−μ)TΣ−1(x−μ))
而关于方阵的迹存在特性:
t
r
(
A
B
)
=
t
r
(
B
A
)
tr(AB)=tr(BA)
tr(AB)=tr(BA);
利用方阵迹的性质,得到:
∑
i
=
1
N
(
x
i
−
μ
)
T
Σ
−
1
(
x
i
−
μ
)
=
t
r
[
Σ
−
1
∑
i
=
1
N
(
x
i
−
μ
)
T
(
x
i
−
μ
)
]
\sum_{i=1}^{N}(x_{i}-\mu)^{T}\Sigma^{-1}(x_{i}-\mu)=tr[\Sigma^{-1}\sum_{i=1}^{N}(x_{i}-\mu)^{T}(x_{i}-\mu)]
i=1∑N(xi−μ)TΣ−1(xi−μ)=tr[Σ−1i=1∑N(xi−μ)T(xi−μ)]
注意,结合方差的表达:
S
=
1
N
∑
i
=
1
N
(
x
i
−
μ
)
T
(
x
i
−
μ
)
S=\frac{1}{N}\sum_{i=1}^{N}(x_{i}-\mu)^{T}(x_{i}-\mu)
S=N1i=1∑N(xi−μ)T(xi−μ)
因此:
∑
i
=
1
N
(
x
i
−
μ
)
T
Σ
−
1
(
x
i
−
μ
)
=
N
t
r
(
S
Σ
−
1
)
\sum_{i=1}^{N}(x_{i}-\mu)^{T}\Sigma^{-1}(x_{i}-\mu)=Ntr(S\Sigma^{-1})
i=1∑N(xi−μ)TΣ−1(xi−μ)=Ntr(SΣ−1)
代入通式:
∑
i
=
1
N
l
o
g
N
(
μ
,
Σ
)
=
C
−
N
2
l
o
g
∣
Σ
∣
−
N
t
r
(
S
Σ
−
1
)
\sum_{i=1}^{N}logN(\mu,\Sigma)=C-\frac{N}{2}log|\Sigma|-Ntr(S\Sigma^{-1})
i=1∑NlogN(μ,Σ)=C−2Nlog∣Σ∣−Ntr(SΣ−1)
将通式代入协方差矩阵的似然项:
∑
x
i
∈
C
1
l
o
g
[
N
(
μ
1
,
Σ
)
]
+
∑
x
i
∈
C
0
l
o
g
[
N
(
μ
0
,
Σ
)
]
=
−
1
2
N
l
o
g
∣
Σ
∣
−
1
2
N
1
t
r
(
S
1
Σ
−
1
)
−
1
2
N
0
t
r
(
S
0
Σ
−
1
)
+
C
\sum_{x_{i}\in C_{1}}log[N(\mu_{1},\Sigma)]+\sum_{x_{i}\in C_{0}}log[N(\mu_{0},\Sigma)]=-\frac{1}{2}Nlog|\Sigma|-\frac{1}{2}N_{1}tr(S_{1}\Sigma^{-1})-\frac{1}{2}N_{0}tr(S_{0}\Sigma^{-1})+C
xi∈C1∑log[N(μ1,Σ)]+xi∈C0∑log[N(μ0,Σ)]=−21Nlog∣Σ∣−21N1tr(S1Σ−1)−21N0tr(S0Σ−1)+C
以下是关于矩阵求导的常用公式:
对协方差矩阵的似然项求导得到:
∂
(
N
l
o
g
∣
Σ
∣
+
N
1
t
r
(
S
1
Σ
−
1
)
+
N
0
t
r
(
S
0
Σ
−
1
)
)
∂
Σ
=
N
Σ
−
N
1
S
1
−
N
0
S
0
=
0
\frac{\partial(Nlog|\Sigma|+N_{1}tr(S_{1}\Sigma^{-1})+N_{0}tr(S_{0}\Sigma^{-1}))}{\partial\Sigma}=N\Sigma-N_{1}S_{1}-N_{0}S_{0}=0
∂Σ∂(Nlog∣Σ∣+N1tr(S1Σ−1)+N0tr(S0Σ−1))=NΣ−N1S1−N0S0=0
即有:
Σ
=
1
N
(
N
1
S
1
+
N
2
S
2
)
\Sigma=\frac{1}{N}(N_{1}S_{1}+N_{2}S_{2})
Σ=N1(N1S1+N2S2)