前置知识
协方差
协方差是用来衡量两个随机变量之间关系的指标。简单来说,协方差描述了这两个变量如何同时变化,即它们之间的变化趋势。协方差的公式为:
c
o
v
(
X
,
Y
)
=
E
[
(
X
−
E
(
X
)
)
(
Y
−
E
(
Y
)
)
]
cov(X,Y) = E[(X - E(X))(Y - E(Y))]
cov(X,Y)=E[(X−E(X))(Y−E(Y))]
其中,X和Y是两个随机变量,E(X)和E(Y)是它们的期望值。当协方差的值为正时,表示X和Y之间存在正相关关系,即当一个变量增加时,另一个变量也会随之增加;当协方差的值为负时,表示X和Y之间存在负相关关系,即当一个变量增加时,另一个变量会随之减少;而当协方差的值接近于0时,则表示X和Y之间不存在线性关系。
协方差矩阵
协方差矩阵是一种用于描述多个随机变量之间关系的矩阵。它是一个对称矩阵,其中每个元素表示相应的两个随机变量之间的协方差。
假设有 n n n 个随机变量 X 1 , X 2 , . . . , X n X_1, X_2, ..., X_n X1,X2,...,Xn,每个随机变量都有 m m m 个样本。将样本数据放入一个 m × n m\times n m×n 的矩阵 X X X 中,其中每行表示一个样本,每列表示一个随机变量。则协方差矩阵 C C C 的元素 c i , j c_{i,j} ci,j 表示随机变量 X i X_i Xi 和 X j X_j Xj 之间的协方差,定义如下:
c i , j = 1 m − 1 ∑ k = 1 m ( x k , i − x i ˉ ) ( x k , j − x j ˉ ) c_{i,j} = \frac{1}{m-1}\sum_{k=1}^m (x_{k,i} - \bar{x_i})(x_{k,j} - \bar{x_j}) ci,j=m−11k=1∑m(xk,i−xiˉ)(xk,j−xjˉ)
其中,
x
i
ˉ
\bar{x_i}
xiˉ 表示随机变量
X
i
X_i
Xi 的样本均值。协方差矩阵
C
C
C 的对角线元素
c
i
,
i
c_{i,i}
ci,i 是方差,表示随机变量
X
i
X_i
Xi 的方差。
C
=
(
c
o
v
(
X
1
,
X
1
)
c
o
v
(
X
1
,
X
2
)
…
c
o
v
(
X
1
,
X
n
)
c
o
v
(
X
2
,
X
1
)
c
o
v
(
X
2
,
X
2
)
…
c
o
v
(
X
2
,
X
n
)
.
.
.
.
.
.
.
.
.
.
.
.
c
o
v
(
X
m
.
X
1
)
c
o
v
(
X
m
,
X
2
)
.
.
.
c
o
v
(
X
m
,
X
n
)
)
=
1
m
−
1
(
∑
j
=
1
m
(
x
1
j
−
x
1
ˉ
)
(
x
1
j
−
x
1
ˉ
)
∑
j
=
1
m
(
x
1
j
−
x
1
ˉ
)
(
x
2
j
−
x
2
ˉ
)
.
.
.
∑
j
=
1
m
(
x
1
j
−
x
1
ˉ
)
(
x
n
j
−
x
n
ˉ
∑
j
=
1
m
(
x
2
j
−
x
2
ˉ
)
(
x
1
j
−
x
1
ˉ
)
∑
j
=
1
m
(
x
2
j
−
x
2
ˉ
)
(
x
2
j
−
x
2
ˉ
)
.
.
.
∑
j
=
1
m
(
x
2
j
−
x
2
ˉ
)
(
x
n
j
−
x
n
ˉ
.
.
.
.
.
.
.
.
.
.
.
.
∑
j
=
1
m
(
x
m
j
−
x
m
ˉ
)
(
x
1
j
−
x
1
ˉ
)
∑
j
=
1
m
(
x
m
j
−
x
m
ˉ
)
(
x
2
j
−
x
2
ˉ
)
.
.
.
∑
j
=
1
m
(
x
m
j
−
x
m
ˉ
)
(
x
n
j
−
x
n
ˉ
)
=
1
m
−
1
∑
j
=
1
m
(
x
j
−
x
ˉ
)
(
x
j
−
x
ˉ
)
T
\begin{split} C&=\begin{pmatrix} cov(X_1,X_1)&cov(X_1,X_2)&\dots&cov(X_1,X_n)\\ cov(X_2,X_1)&cov(X_2,X_2)&\dots&cov(X_2,X_n)\\ ...&...&...&...\\ cov(X_m.X_1)&cov(X_m,X_2)&...&cov(X_m,X_n) \end{pmatrix}\\&=\frac{1}{m-1} \begin{pmatrix} \sum^{m}_{j=1}(x_{1j}-\bar{x_1})(x_{1j}-\bar{x_1})&\sum^{m}_{j=1}(x_{1j}-\bar{x_1})(x_{2j}-\bar{x_2})&...&\sum^{m}_{j=1}(x_{1j}-\bar{x_1})(x_{nj}-\bar{x_n}\\ \sum^{m}_{j=1}(x_{2j}-\bar{x_2})(x_{1j}-\bar{x_1})&\sum^{m}_{j=1}(x_{2j}-\bar{x_2})(x_{2j}-\bar{x_2})&...&\sum^{m}_{j=1}(x_{2j}-\bar{x_2})(x_{nj}-\bar{x_n}\\ ...&...&...&...\\ \sum^{m}_{j=1}(x_{mj}-\bar{x_m})(x_{1j}-\bar{x_1})&\sum^{m}_{j=1}(x_{mj}-\bar{x_m})(x_{2j}-\bar{x_2})&...&\sum^{m}_{j=1}(x_{mj}-\bar{x_m})(x_{nj}-\bar{x_n}\\ \end{pmatrix}\\&=\frac{1}{m-1}\sum^{m}_{j=1}(x_{j}-\bar{x})(x_j-\bar{x})^T \end{split}
C=
cov(X1,X1)cov(X2,X1)...cov(Xm.X1)cov(X1,X2)cov(X2,X2)...cov(Xm,X2)……......cov(X1,Xn)cov(X2,Xn)...cov(Xm,Xn)
=m−11
∑j=1m(x1j−x1ˉ)(x1j−x1ˉ)∑j=1m(x2j−x2ˉ)(x1j−x1ˉ)...∑j=1m(xmj−xmˉ)(x1j−x1ˉ)∑j=1m(x1j−x1ˉ)(x2j−x2ˉ)∑j=1m(x2j−x2ˉ)(x2j−x2ˉ)...∑j=1m(xmj−xmˉ)(x2j−x2ˉ)............∑j=1m(x1j−x1ˉ)(xnj−xnˉ∑j=1m(x2j−x2ˉ)(xnj−xnˉ...∑j=1m(xmj−xmˉ)(xnj−xnˉ
=m−11j=1∑m(xj−xˉ)(xj−xˉ)T
公式中
m
m
m为样本数量,
x
ˉ
\bar{x}
xˉ为样本的均值,是一个列向量,
x
j
x_j
xj为第
j
j
j个样本,也是一个列向量。这个公式中,系数
1
m
−
1
\frac{1}{m-1}
m−11 就是为了将样本的协方差估计无偏化而引入的。
具体来说,如果将系数改为 1 m \frac{1}{m} m1,那么就得到了样本方差的无偏估计公式。但是,当计算协方差矩阵时,由于样本的均值向量已知,所以样本协方差矩阵的自由度为 m − 1 m-1 m−1,而不是 m m m。因此,为了保证样本协方差矩阵的无偏估计,需要将系数改为 1 m − 1 \frac{1}{m-1} m−11。
需要注意的是,这个系数只有在使用样本协方差矩阵进行推断时才需要考虑,如果使用总体协方差矩阵,则无需使用这个系数。
无偏估计是指,在进行参数估计时,使用的统计量的期望值等于被估计的参数的真实值。也就是说,无偏估计不会出现估计偏差,即估计值与真实值之间的差异的期望为零。对于协方差矩阵的计算,如果使用样本协方差矩阵的通用公式(即系数为 1 m \frac{1}{m} m1),则得到的是样本的无偏方差估计量,但并不是无偏的协方差矩阵估计量。这是因为当 m > 1 m>1 m>1 时,样本协方差矩阵的计算中分母为 m m m 而不是 m − 1 m-1 m−1,所以样本协方差矩阵的计算结果会偏向总体协方差矩阵,导致估计结果存在偏差。
因此,在使用样本协方差矩阵进行推断时,需要将样本协方差矩阵的计算公式中的系数从 1 m \frac{1}{m} m1 改为 1 m − 1 \frac{1}{m-1} m−11,这样计算出来的样本协方差矩阵就是无偏的协方差矩阵估计量。这个无偏化的过程可以使得样本协方差矩阵的期望值等于总体协方差矩阵的期望值,从而消除估计偏差。
协方差矩阵描述了多个随机变量之间的线性关系。如果 C C C 中某些元素的值为零,则相应的随机变量之间不存在线性关系。如果某些元素的值为正,则相应的随机变量呈正相关;如果某些元素的值为负,则相应的随机变量呈负相关。
多元高斯分布
多元高斯分布(Multivariate Gaussian Distribution),也叫多元正态分布,是统计学中非常重要的概率分布之一。在概率论、统计学和相关领域中具有广泛应用。本篇博客将介绍多元高斯分布的定义、性质和应用。
多元高斯分布是一个优美的概率分布,它可以用一个多维向量来描述。多元高斯分布的一个重要性质是它的线性变换是可逆的,因此它非常适合用于描述多元数据和多元时间序列。
多元高斯分布的定义如下:
对于向量 x ∈ R n x \in R^n x∈Rn,其概率分布为:
p ( x ) = 1 ( 2 π ) n / 2 ∣ Σ ∣ 1 / 2 e − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) p(x) = \frac{1}{(2 \pi)^{n/2} |\Sigma|^{1/2}} e^{-\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu)} p(x)=(2π)n/2∣Σ∣1/21e−21(x−μ)TΣ−1(x−μ)
其中 n n n 表示向量的维度, μ \mu μ 表示 x x x 分布的均值向量, Σ \Sigma Σ 表示 x x x 的协方差矩阵, ∣ Σ ∣ |\Sigma| ∣Σ∣ 表示 Σ \Sigma Σ 的行列式。
多元高斯分布的均值是 μ \mu μ,协方差矩阵是 Σ \Sigma Σ。对于正态分布,均值和协方差是唯一的。
高斯判别分析的原理及其推导
介绍
高斯判别分析(Gaussian Discriminant Analysis)是一种有监督学习算法,通常用于二元分类问题。它将数据集中的每个类别建模为一个高斯分布,并使用贝叶斯定理来估计后验概率。在进行预测时,算法选择具有最高后验概率的类别作为预测结果。
模型
在高斯判别分析中,假设有两个类别 y ∈ 0 , 1 y\in{0,1} y∈0,1,每个类别都被建模为一个高斯分布。设 μ 0 \mu_0 μ0和 μ 1 \mu_1 μ1分别是两个类别的均值向量, Σ \Sigma Σ是两个类别的共同协方差矩阵。
类别 y = 0 y=0 y=0的高斯分布的概率密度函数为:
p ( x ∣ y = 0 ) = 1 ( 2 π ) n / 2 ∣ Σ ∣ 1 / 2 exp ( − 1 2 ( x − μ 0 ) T Σ − 1 ( x − μ 0 ) ) p(x|y=0)=\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}\exp\left(-\frac{1}{2}(x-\mu_0)^T\Sigma^{-1}(x-\mu_0)\right) p(x∣y=0)=(2π)n/2∣Σ∣1/21exp(−21(x−μ0)TΣ−1(x−μ0))
类别 y = 1 y=1 y=1的高斯分布的概率密度函数为:
p ( x ∣ y = 1 ) = 1 ( 2 π ) n / 2 ∣ Σ ∣ 1 / 2 exp ( − 1 2 ( x − μ 1 ) T Σ − 1 ( x − μ 1 ) ) p(x|y=1)=\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}\exp\left(-\frac{1}{2}(x-\mu_1)^T\Sigma^{-1}(x-\mu_1)\right) p(x∣y=1)=(2π)n/2∣Σ∣1/21exp(−21(x−μ1)TΣ−1(x−μ1))
其中, n n n是特征数量, ∣ Σ ∣ |\Sigma| ∣Σ∣是协方差矩阵 Σ \Sigma Σ的行列式。
参数估计
假设我们有一个训练集
(
x
(
1
)
,
y
(
1
)
)
,
…
,
(
x
(
m
)
,
y
(
m
)
)
{(x^{(1)},y^{(1)}),\dots,(x^{(m)},y^{(m)})}
(x(1),y(1)),…,(x(m),y(m)),其中
x
(
i
)
x^{(i)}
x(i)是一个
n
n
n维实向量。我们可以使用最大似然估计来估计高斯分布的参数。为了简化推导,我们假设两个类别的协方差矩阵是相等的,即
Σ
=
Σ
0
=
Σ
1
\Sigma=\Sigma_0=\Sigma_1
Σ=Σ0=Σ1。在这种情况下,我们将模型称为共享协方差高斯判别分析(Shared Covariance Gaussian Discr iminant Analysis)。
所以有:
y
∼
B
e
r
n
o
u
l
l
i
(
ϕ
)
x
∣
y
=
0
∼
N
(
μ
0
,
Σ
)
x
∣
y
=
1
∼
N
(
μ
1
,
Σ
)
\begin{split} y\sim Bernoulli(\phi)\\x|y=0\sim N(\mu_0,\Sigma)\\x|y=1\sim N(\mu_1,\Sigma) \end{split}
y∼Bernoulli(ϕ)x∣y=0∼N(μ0,Σ)x∣y=1∼N(μ1,Σ)
展开有:
p
(
y
)
=
ϕ
y
(
1
−
ϕ
)
1
−
y
p
(
x
∣
y
=
1
)
=
1
(
2
π
)
n
/
2
∣
Σ
∣
1
/
2
exp
(
−
1
2
(
x
−
μ
1
)
T
Σ
−
1
(
x
−
μ
1
)
)
p
(
x
∣
y
=
0
)
=
1
(
2
π
)
n
/
2
∣
Σ
∣
1
/
2
exp
(
−
1
2
(
x
−
μ
−
1
)
T
Σ
−
1
(
x
−
μ
0
)
)
\begin{align} p(y) &= \phi^y(1-\phi)^{1-y} \\ p(x|y=1) &= \frac{1}{(2\pi)^{n/2} \left|\Sigma\right|^{1/2}} \exp\Big(-\frac{1}{2}(x - \mu_{1})^T \Sigma^{-1} (x - \mu_{1}) \Big) \\ p(x|y=0) &= \frac{1}{(2\pi)^{n/2} \left|\Sigma\right|^{1/2}} \exp\Big(-\frac{1}{2}(x - \mu_{-1})^T \Sigma^{-1} (x - \mu_{0}) \Big) \\ \end{align}
p(y)p(x∣y=1)p(x∣y=0)=ϕy(1−ϕ)1−y=(2π)n/2∣Σ∣1/21exp(−21(x−μ1)TΣ−1(x−μ1))=(2π)n/2∣Σ∣1/21exp(−21(x−μ−1)TΣ−1(x−μ0))
概率函数:
L
(
ϕ
,
μ
0
,
μ
1
,
Σ
)
=
Π
i
=
1
m
p
(
x
i
,
y
i
,
ϕ
,
μ
0
,
μ
1
,
Σ
)
=
Π
i
=
1
m
p
(
x
i
∣
y
i
)
p
(
y
(
i
)
)
\begin{align} L(\phi,\mu_0,\mu_1,\Sigma)&=\Pi^m_{i=1}p(x^{i},y^{i},\phi,\mu_0,\mu_1,\Sigma)\\&=\Pi^m_{i=1}p(x^{i}|y^{i})p(y^{(i)})\end{align}
L(ϕ,μ0,μ1,Σ)=Πi=1mp(xi,yi,ϕ,μ0,μ1,Σ)=Πi=1mp(xi∣yi)p(y(i))
我们的目的是将概率函数最大化,为了便于计算,我们采用对数似然来对原函数进行处理:
l
(
ϕ
,
μ
1
,
μ
0
,
Σ
2
)
=
log
∏
i
=
1
m
p
(
x
(
i
)
,
y
(
i
)
;
ϕ
,
μ
0
,
μ
1
,
Σ
2
)
=
log
∏
i
=
1
m
p
(
x
(
i
)
∣
y
(
i
)
;
ϕ
,
μ
0
,
μ
1
,
Σ
2
)
p
(
y
(
i
)
;
ϕ
)
\begin{align*} l(\phi, \mu_{1}, \mu_{0}, \Sigma^2) &= \log \prod_{i=1}^{m} p(x^{(i)}, y^{(i)}; \phi, \mu_{0}, \mu_1, \Sigma^2) \\ &= \log \prod_{i=1}^{m} p(x^{(i)}|y^{(i)}; \phi, \mu_{0}, \mu_1, \Sigma^2) p(y^{(i)}; \phi) \\ \end{align*}
l(ϕ,μ1,μ0,Σ2)=logi=1∏mp(x(i),y(i);ϕ,μ0,μ1,Σ2)=logi=1∏mp(x(i)∣y(i);ϕ,μ0,μ1,Σ2)p(y(i);ϕ)
此时我们假设有
m
0
m_0
m0个样本的
y
=
0
y=0
y=0,有
m
1
m_1
m1个样本的
y
=
1
y=1
y=1。则
m
0
+
m
1
=
m
m_0+m_1=m
m0+m1=m.
有:
l
(
ϕ
,
μ
1
,
μ
−
1
,
σ
2
)
=
log
[
(
∏
i
=
1
m
1
1
(
2
π
σ
2
)
1
2
exp
(
−
(
x
(
i
)
−
μ
1
)
2
2
σ
2
)
ϕ
)
(
∏
j
=
1
m
0
1
(
2
π
σ
2
)
1
2
exp
(
−
(
x
(
j
)
−
μ
0
)
2
2
σ
2
)
(
1
−
ϕ
)
)
]
=
∑
i
=
1
m
log
(
1
(
2
π
σ
2
)
1
2
)
+
∑
i
=
1
m
1
(
−
(
x
(
i
)
−
μ
1
)
2
2
σ
2
)
+
∑
i
=
1
m
0
log
ϕ
+
∑
i
=
1
m
0
(
−
(
x
(
i
)
−
μ
0
)
2
2
σ
2
)
+
∑
i
=
1
m
0
log
(
1
−
ϕ
)
=
∑
i
=
1
m
−
1
2
log
(
2
π
σ
2
)
−
∑
i
=
1
m
1
(
x
(
i
)
−
μ
1
)
2
2
σ
2
−
∑
i
=
1
m
0
(
x
(
i
)
−
μ
0
)
2
2
σ
2
+
∑
i
=
1
m
1
log
ϕ
+
∑
i
=
1
m
−
1
log
(
1
−
ϕ
)
=
−
m
2
log
(
2
π
σ
2
)
−
∑
i
=
1
m
1
(
x
(
i
)
−
μ
1
)
2
2
σ
2
−
∑
i
=
1
m
0
(
x
(
i
)
−
μ
0
)
2
2
σ
2
+
m
1
log
ϕ
+
m
0
log
(
1
−
ϕ
)
\begin{align*} l(\phi, \mu_{1}, \mu_{-1}, \sigma^2) &= \log \Bigg[ \bigg( \prod_{i=1}^{m_1} \frac{1}{(2\pi \sigma^2)^{\frac{1}{2}}} \exp\Big(-\frac{(x^{(i)} - \mu_1)^2}{2\sigma^{2}}\Big ) \phi \bigg) \bigg( \prod_{j=1}^{m_{0}} \frac{1}{(2\pi \sigma^2)^{\frac{1}{2}}} \exp\Big(-\frac{(x^{(j)} - \mu_{0})^2}{2\sigma^{2}}\Big ) (1 - \phi) \bigg) \Bigg ] \\ &= \sum_{i=1}^{m}\log \Big( \frac{1}{(2\pi \sigma^2)^{\frac{1}{2}}} \Big) + \sum_{i=1}^{m_1} \Big(-\frac{(x^{(i)} - \mu_1)^2}{2\sigma^{2}}\Big ) + \sum_{i=1}^{m_0} \log \phi + \sum_{i=1}^{m_{0}} \Big(-\frac{(x^{(i)} - \mu_{0})^2}{2\sigma^{2}}\Big ) + \sum_{i=1}^{m_{0}}\log(1 - \phi) \\ &= \sum_{i=1}^{m} -\frac{1}{2} \log(2\pi \sigma^2) - \sum_{i=1}^{m_1} \frac{(x^{(i)} - \mu_1)^2}{2\sigma^{2}} - \sum_{i=1}^{m_{0}} \frac{(x^{(i)} - \mu_{0})^2}{2\sigma^{2}} + \sum_{i=1}^{m_1} \log \phi + \sum_{i=1}^{m_{-1}}\log(1 - \phi) \\ &= -\frac{m}{2} \log(2\pi \sigma^2) - \sum_{i=1}^{m_1} \frac{(x^{(i)} - \mu_1)^2}{2\sigma^{2}} - \sum_{i=1}^{m_{0}} \frac{(x^{(i)} - \mu_{0})^2}{2\sigma^{2}} + m_1 \log \phi + m_{0}\log(1 - \phi) \\ \end{align*}
l(ϕ,μ1,μ−1,σ2)=log[(i=1∏m1(2πσ2)211exp(−2σ2(x(i)−μ1)2)ϕ)(j=1∏m0(2πσ2)211exp(−2σ2(x(j)−μ0)2)(1−ϕ))]=i=1∑mlog((2πσ2)211)+i=1∑m1(−2σ2(x(i)−μ1)2)+i=1∑m0logϕ+i=1∑m0(−2σ2(x(i)−μ0)2)+i=1∑m0log(1−ϕ)=i=1∑m−21log(2πσ2)−i=1∑m12σ2(x(i)−μ1)2−i=1∑m02σ2(x(i)−μ0)2+i=1∑m1logϕ+i=1∑m−1log(1−ϕ)=−2mlog(2πσ2)−i=1∑m12σ2(x(i)−μ1)2−i=1∑m02σ2(x(i)−μ0)2+m1logϕ+m0log(1−ϕ)
以上推导是我们将
x
x
x看做一维数据进行推导,使用的也是一维高斯分布
现在让
l
l
l 对各个变量求偏导,并置为0就能得到各个变量的取值:
∂
l
∂
ϕ
=
m
1
ϕ
−
m
0
1
−
ϕ
=
0
⟹
ϕ
=
m
1
m
=
1
m
∑
i
=
1
m
1
{
y
(
i
)
=
1
}
\begin{align*} \frac{\partial l}{\partial \phi} &= \frac{m_1}{\phi} - \frac{m_{0}}{1 - \phi} \\ &=0 \end{align*}\Longrightarrow \phi=\frac{m_1}{m}=\frac{1}{m}\sum^{m}_{i=1}1\{y^{(i)}=1\}
∂ϕ∂l=ϕm1−1−ϕm0=0⟹ϕ=mm1=m1i=1∑m1{y(i)=1}
∂
l
∂
μ
1
=
−
1
2
σ
2
∑
i
=
1
m
1
2
(
x
(
i
)
−
μ
1
)
⋅
(
−
1
)
=
1
σ
2
∑
i
=
1
m
1
(
x
(
i
)
−
μ
1
)
=
1
σ
2
∑
i
=
1
m
1
x
(
i
)
−
∑
i
=
1
m
1
μ
1
=
1
σ
2
(
∑
i
=
1
m
1
{
y
(
i
)
=
1
}
x
(
i
)
−
∑
i
=
1
m
1
{
y
(
i
)
=
1
}
μ
1
)
=
0
⟹
μ
1
=
∑
i
=
1
m
1
{
y
(
i
)
=
1
}
x
(
i
)
∑
i
=
1
m
1
{
y
(
i
)
=
1
}
\begin{align*} \frac{\partial l}{\partial \mu_1} &= -\frac{1}{2 \sigma ^2} \sum_{i=1}^{m_1} 2(x^{(i)} - \mu_1) \cdot (-1) \\ &= \frac{1}{\sigma ^2} \sum_{i=1}^{m_1} (x^{(i)} - \mu_1) \\ &= \frac{1}{\sigma ^2} \sum_{i=1}^{m_1} x^{(i)} - \sum_{i=1}^{m_1} \mu_1 \\ &= \frac{1}{\sigma ^2} \bigg( \sum_{i=1}^{m} \mathbb{1}\{y^{(i)} = 1\} x^{(i)} - \sum_{i=1}^{m} \mathbb{1}\{y^{(i)} = 1\} \mu_1 \bigg) \\ &=0 \end{align*}\Longrightarrow \mu_1 = \frac{\sum_{i=1}^{m} \mathbb{1}\{y^{(i)} = 1\} x^{(i)}}{\sum_{i=1}^{m} \mathbb{1}\{y^{(i)} = 1\}}
∂μ1∂l=−2σ21i=1∑m12(x(i)−μ1)⋅(−1)=σ21i=1∑m1(x(i)−μ1)=σ21i=1∑m1x(i)−i=1∑m1μ1=σ21(i=1∑m1{y(i)=1}x(i)−i=1∑m1{y(i)=1}μ1)=0⟹μ1=∑i=1m1{y(i)=1}∑i=1m1{y(i)=1}x(i)
同理有:
μ
1
=
∑
i
=
1
m
1
{
y
(
i
)
=
0
}
x
(
i
)
∑
i
=
1
m
1
{
y
(
i
)
=
0
}
\mu_1 = \frac{\sum_{i=1}^{m} \mathbb{1}\{y^{(i)} = 0\} x^{(i)}}{\sum_{i=1}^{m} \mathbb{1}\{y^{(i)} = 0\}}
μ1=∑i=1m1{y(i)=0}∑i=1m1{y(i)=0}x(i)
∂
l
∂
σ
2
=
−
m
2
1
2
π
σ
2
2
π
−
∑
i
=
1
m
1
(
x
(
i
)
−
μ
1
)
2
2
(
−
(
σ
2
)
−
2
)
−
∑
i
=
1
m
0
(
x
(
i
)
−
μ
0
)
2
2
(
−
(
σ
2
)
−
2
)
=
−
m
2
σ
2
+
∑
i
=
1
m
1
(
x
(
i
)
−
μ
1
)
2
2
σ
4
+
∑
i
=
1
m
0
(
x
(
i
)
−
μ
0
)
2
2
σ
4
=
0
\begin{align*} \frac{\partial l}{\partial \sigma^2} &= -\frac{m}{2} \frac{1}{2\pi \sigma^2} 2\pi - \sum_{i=1}^{m_1} \frac{(x^{(i)} - \mu_1)^2}{2} \big(- (\sigma^{2})^{-2} \big) - \sum_{i=1}^{m_{0}} \frac{(x^{(i)} - \mu_{0})^2}{2} \big(- (\sigma^{2})^{-2} \big) \\ &= -\frac{m}{2 \sigma^2} + \sum_{i=1}^{m_1} \frac{(x^{(i)} - \mu_1)^2}{2\sigma^{4}} + \sum_{i=1}^{m_{0}} \frac{(x^{(i)} - \mu_{0})^2}{2\sigma^4} \\&=0 \end{align*}
∂σ2∂l=−2m2πσ212π−i=1∑m12(x(i)−μ1)2(−(σ2)−2)−i=1∑m02(x(i)−μ0)2(−(σ2)−2)=−2σ2m+i=1∑m12σ4(x(i)−μ1)2+i=1∑m02σ4(x(i)−μ0)2=0
m
2
σ
2
=
∑
i
=
1
m
1
(
x
(
i
)
−
μ
1
)
2
2
σ
4
+
∑
i
=
1
m
0
(
x
(
i
)
−
μ
0
)
2
2
σ
4
m
σ
2
=
∑
i
=
1
m
1
(
x
(
i
)
−
μ
1
)
2
+
∑
i
=
1
m
0
(
x
(
i
)
−
μ
0
)
2
=
∑
i
=
1
m
(
x
(
i
)
−
μ
y
(
i
)
)
2
=
0
⟹
σ
2
=
1
m
∑
i
=
1
m
(
x
(
i
)
−
μ
y
(
i
)
)
2
\begin{align*} \frac{m}{2 \sigma^2} &= \sum_{i=1}^{m_1} \frac{(x^{(i)} - \mu_1)^2}{2 \sigma^{4}} + \sum_{i=1}^{m_{0}} \frac{(x^{(i)} - \mu_{0})^2}{2 \sigma^4} \\ m \sigma^2 &= \sum_{i=1}^{m_1} (x^{(i)} - \mu_1)^2 + \sum_{i=1}^{m_{0}} (x^{(i)} - \mu_{0})^2 \\ &= \sum_{i=1}^{m} (x^{(i)} - \mu_{y{(i)}})^2 \\&=0\end{align*} \Longrightarrow \sigma^2 = \frac{1}{m} \sum_{i=1}^{m} (x^{(i)} - \mu_{y{(i)}})^2
2σ2mmσ2=i=1∑m12σ4(x(i)−μ1)2+i=1∑m02σ4(x(i)−μ0)2=i=1∑m1(x(i)−μ1)2+i=1∑m0(x(i)−μ0)2=i=1∑m(x(i)−μy(i))2=0⟹σ2=m1i=1∑m(x(i)−μy(i))2
对于更一般的多维情况,与一维情况下所做的相似,对数化为:
l
(
ϕ
,
μ
1
,
μ
0
,
σ
2
)
=
log
[
(
∏
i
=
1
m
1
1
(
2
π
)
n
/
2
∣
Σ
∣
1
/
2
exp
(
−
1
2
(
x
(
i
)
−
μ
1
)
T
Σ
−
1
(
x
(
i
)
−
μ
1
)
)
ϕ
)
(
∏
j
=
1
m
0
1
(
2
π
)
n
/
2
∣
Σ
∣
1
/
2
exp
(
−
1
2
(
x
(
j
)
−
μ
0
)
T
Σ
−
1
(
x
(
j
)
−
μ
0
)
)
(
1
−
ϕ
)
)
]
=
−
∑
i
=
1
m
log
(
(
2
π
)
n
/
2
∣
Σ
∣
1
/
2
)
−
∑
i
=
1
m
1
(
1
2
(
x
(
i
)
−
μ
1
)
T
Σ
−
1
(
x
(
i
)
−
μ
1
)
)
−
∑
i
=
1
m
0
(
1
2
(
x
(
i
)
−
μ
0
)
T
Σ
−
1
(
x
(
i
)
−
μ
0
)
)
+
m
1
log
ϕ
+
m
0
log
(
1
−
ϕ
)
=
−
m
(
n
2
log
(
2
π
)
+
log
(
∣
Σ
∣
1
/
2
)
)
−
∑
i
=
1
m
1
(
1
2
(
x
(
i
)
−
μ
1
)
T
Σ
−
1
(
x
(
i
)
−
μ
1
)
)
−
∑
i
=
1
m
0
(
1
2
(
x
(
i
)
−
μ
0
)
T
Σ
−
1
(
x
(
i
)
−
μ
0
)
)
+
m
1
log
ϕ
+
m
0
log
(
1
−
ϕ
)
=
−
m
n
2
log
(
2
π
)
−
m
2
log
(
∣
Σ
∣
)
−
∑
i
=
1
m
1
(
1
2
(
x
(
i
)
−
μ
1
)
T
Σ
−
1
(
x
(
i
)
−
μ
1
)
)
−
∑
i
=
1
m
0
(
1
2
(
x
(
i
)
−
μ
0
)
T
Σ
−
1
(
x
(
i
)
−
μ
0
)
)
+
m
1
log
ϕ
+
m
0
log
(
1
−
ϕ
)
\begin{align*} l(\phi, \mu_{1}, \mu_{0}, \sigma^2) &= \log \Bigg[ \bigg( \prod_{i=1}^{m_1} \frac{1}{(2\pi)^{n/2} \left|\Sigma\right|^{1/2}} \exp\Big(-\frac{1}{2}(x^{(i)} - \mu_{1})^T \Sigma^{-1} (x^{(i)} - \mu_{1}) \Big) \phi \bigg) \bigg( \prod_{j=1}^{m_{0}} \frac{1}{(2\pi)^{n/2} \left|\Sigma\right|^{1/2}} \exp\Big(-\frac{1}{2}(x^{(j)} - \mu_{0})^T \Sigma^{-1} (x^{(j)} - \mu_{0}) \Big) (1 - \phi) \bigg) \Bigg] \\ &= -\sum_{i=1}^m \log \Big( (2\pi)^{n/2} \left|\Sigma\right|^{1/2} \Big) - \sum_{i=1}^{m_1} \Big(\frac{1}{2}(x^{(i)} - \mu_{1})^T \Sigma^{-1} (x^{(i)} - \mu_{1}) \Big) - \sum_{i=1}^{m_{0}} \Big(\frac{1}{2}(x^{(i)} - \mu_{0})^T \Sigma^{-1} (x^{(i)} - \mu_{0}) \Big) + m_1 \log \phi + m_{0}\log(1 - \phi) \\ &= -m \Big(\frac{n}{2}\log(2\pi) + \log(\left|\Sigma\right|^{1/2}) \Big) - \sum_{i=1}^{m_1} \Big(\frac{1}{2}(x^{(i)} - \mu_{1})^T \Sigma^{-1} (x^{(i)} - \mu_{1}) \Big) - \sum_{i=1}^{m_{0}} \Big(\frac{1}{2}(x^{(i)} - \mu_{0})^T \Sigma^{-1} (x^{(i)} - \mu_{0}) \Big) + m_1 \log \phi + m_{0}\log(1 - \phi) \\ &= - \frac{mn}{2}\log{(2\pi)} - \frac{m}{2}\log(\left|\Sigma\right|) - \sum_{i=1}^{m_1} \Big(\frac{1}{2}(x^{(i)} - \mu_{1})^T \Sigma^{-1} (x^{(i)} - \mu_{1}) \Big) - \sum_{i=1}^{m_{0}} \Big(\frac{1}{2}(x^{(i)} - \mu_{0})^T \Sigma^{-1} (x^{(i)} - \mu_{0}) \Big) + m_1 \log \phi + m_{0}\log(1 - \phi) \end{align*}
l(ϕ,μ1,μ0,σ2)=log[(i=1∏m1(2π)n/2∣Σ∣1/21exp(−21(x(i)−μ1)TΣ−1(x(i)−μ1))ϕ)(j=1∏m0(2π)n/2∣Σ∣1/21exp(−21(x(j)−μ0)TΣ−1(x(j)−μ0))(1−ϕ))]=−i=1∑mlog((2π)n/2∣Σ∣1/2)−i=1∑m1(21(x(i)−μ1)TΣ−1(x(i)−μ1))−i=1∑m0(21(x(i)−μ0)TΣ−1(x(i)−μ0))+m1logϕ+m0log(1−ϕ)=−m(2nlog(2π)+log(∣Σ∣1/2))−i=1∑m1(21(x(i)−μ1)TΣ−1(x(i)−μ1))−i=1∑m0(21(x(i)−μ0)TΣ−1(x(i)−μ0))+m1logϕ+m0log(1−ϕ)=−2mnlog(2π)−2mlog(∣Σ∣)−i=1∑m1(21(x(i)−μ1)TΣ−1(x(i)−μ1))−i=1∑m0(21(x(i)−μ0)TΣ−1(x(i)−μ0))+m1logϕ+m0log(1−ϕ)
易知
ϕ
\phi
ϕ的推导与一维情况相同,计算
Δ
μ
1
l
(
有公式:
∇
x
x
T
A
x
=
2
A
x
)
\Delta_{\mu_1}l(有公式:\nabla_x x^TAx = 2Ax)
Δμ1l(有公式:∇xxTAx=2Ax):
∇
μ
1
l
=
−
∑
i
=
1
m
1
1
2
⋅
2
Σ
−
1
(
x
(
i
)
−
μ
1
)
(
−
1
)
=
Σ
−
1
∑
i
=
1
m
1
(
x
(
i
)
−
μ
1
)
=
Σ
−
1
(
∑
i
=
1
m
1
{
y
(
i
)
=
1
}
x
(
i
)
−
∑
i
=
1
m
1
{
y
(
i
)
=
1
}
μ
1
)
=
0
⟹
μ
1
=
∑
i
=
1
m
1
{
y
(
i
)
=
1
}
x
(
i
)
∑
i
=
1
m
1
{
y
(
i
)
=
1
}
\begin{align*} \nabla_{\mu_1} l &= - \sum_{i=1}^{m_1}\frac{1}{2} \cdot 2 \Sigma^{-1} (x^{(i)} - \mu_1)(-1) \\ &= \Sigma^{-1}\sum_{i=1}^{m_1}(x^{(i)} - \mu_1) \\ &= \Sigma^{-1} \bigg( \sum_{i=1}^{m} \mathbb{1}\{y^{(i)} = 1\} x^{(i)} - \sum_{i=1}^{m} \mathbb{1}\{y^{(i)} = 1\} \mu_1 \bigg)\\&=0 \end{align*}\Longrightarrow \mu_1 = \frac{\sum_{i=1}^{m} \mathbb{1}\{y^{(i)} = 1\} x^{(i)}}{\sum_{i=1}^{m} \mathbb{1}\{y^{(i)} = 1\}}
∇μ1l=−i=1∑m121⋅2Σ−1(x(i)−μ1)(−1)=Σ−1i=1∑m1(x(i)−μ1)=Σ−1(i=1∑m1{y(i)=1}x(i)−i=1∑m1{y(i)=1}μ1)=0⟹μ1=∑i=1m1{y(i)=1}∑i=1m1{y(i)=1}x(i)
同理:
μ
0
=
∑
i
=
1
m
1
{
y
(
i
)
=
0
}
x
(
i
)
∑
i
=
1
m
1
{
y
(
i
)
=
0
}
\mu_0 = \frac{\sum_{i=1}^{m} \mathbb{1}\{y^{(i)} = 0\} x^{(i)}}{\sum_{i=1}^{m} \mathbb{1}\{y^{(i)} = 0\}}
μ0=∑i=1m1{y(i)=0}∑i=1m1{y(i)=0}x(i)
已知:
∇
A
log
∣
A
∣
=
1
∣
A
∣
∇
A
∣
A
∣
=
A
−
1
∇
A
x
T
A
x
=
x
x
T
\begin{align*} \nabla_{A} \log|A| &= \frac{1}{|A|} \nabla_{A}|A| = A^{-1}\\ \nabla_{A} x^T A x &= xx^T \\ \end{align*}
∇Alog∣A∣∇AxTAx=∣A∣1∇A∣A∣=A−1=xxT
∇
Σ
−
1
(
−
m
2
log
(
∣
Σ
∣
)
)
=
∇
Σ
−
1
(
−
m
2
log
(
1
∣
Σ
−
1
∣
)
)
=
∇
Σ
−
1
(
m
2
log
(
∣
Σ
−
1
∣
)
)
=
m
2
(
Σ
−
1
)
−
1
=
m
2
Σ
\begin{align*} \nabla_{\Sigma^{-1}} \Big( - \frac{m}{2}\log(\left|\Sigma\right|) \Big) &= \nabla_{\Sigma^{-1}} \Big(- \frac{m}{2}\log \Big(\frac{1}{\left|\Sigma^{-1} \right|} \big) \Big) \\ &= \nabla_{\Sigma^{-1}} (\frac{m}{2}\log(\left|\Sigma^{-1}\right|)) \\ &= \frac{m}{2}(\Sigma^{-1})^{-1} \\ &= \frac{m}{2}\Sigma \end{align*}
∇Σ−1(−2mlog(∣Σ∣))=∇Σ−1(−2mlog(∣Σ−1∣1))=∇Σ−1(2mlog(
Σ−1
))=2m(Σ−1)−1=2mΣ
∇
Σ
−
1
(
−
∑
i
=
1
m
1
(
1
2
(
x
(
i
)
−
μ
1
)
T
Σ
−
1
(
x
(
i
)
−
μ
1
)
)
−
∑
i
=
1
m
−
1
(
1
2
(
x
(
i
)
−
μ
−
1
)
T
Σ
−
1
(
x
(
i
)
−
μ
−
1
)
)
)
=
−
∑
i
=
1
m
1
(
1
2
(
x
(
i
)
−
μ
1
)
(
x
(
i
)
−
μ
1
)
T
)
−
∑
i
=
1
m
−
1
(
1
2
(
x
(
i
)
−
μ
−
1
)
(
x
(
i
)
−
μ
−
1
)
T
)
=
−
1
2
∑
i
=
1
m
(
x
(
i
)
−
μ
y
(
i
)
)
(
x
(
i
)
−
μ
y
(
i
)
)
T
\begin{align*} &\quad \nabla_{\Sigma^{-1}} \bigg( - \sum_{i=1}^{m_1} \Big(\frac{1}{2}(x^{(i)} - \mu_{1})^T \Sigma^{-1} (x^{(i)} - \mu_{1}) \Big) - \sum_{i=1}^{m_{-1}} \Big(\frac{1}{2}(x^{(i)} - \mu_{-1})^T \Sigma^{-1} (x^{(i)} - \mu_{-1}) \Big) \bigg ) \\ &= - \sum_{i=1}^{m_1} \Big(\frac{1}{2}(x^{(i)} - \mu_{1}) (x^{(i)} - \mu_{1})^T \Big) - \sum_{i=1}^{m_{-1}} \Big(\frac{1}{2}(x^{(i)} - \mu_{-1}) (x^{(i)} - \mu_{-1})^T \Big) \\ &= - \frac{1}{2} \sum_{i=1}^{m} (x^{(i)} - \mu_{y^{(i)}}) (x^{(i)} - \mu_{y^{(i)}})^T \\ \end{align*}
∇Σ−1(−i=1∑m1(21(x(i)−μ1)TΣ−1(x(i)−μ1))−i=1∑m−1(21(x(i)−μ−1)TΣ−1(x(i)−μ−1)))=−i=1∑m1(21(x(i)−μ1)(x(i)−μ1)T)−i=1∑m−1(21(x(i)−μ−1)(x(i)−μ−1)T)=−21i=1∑m(x(i)−μy(i))(x(i)−μy(i))T
把这两部分集合起来,设为0:
m
2
Σ
−
1
2
∑
i
=
1
m
(
x
(
i
)
−
μ
y
(
i
)
)
(
x
(
i
)
−
μ
y
(
i
)
)
T
=
0
Σ
=
1
m
∑
i
=
1
m
(
x
(
i
)
−
μ
y
(
i
)
)
(
x
(
i
)
−
μ
y
(
i
)
)
T
\begin{align*} \frac{m}{2}\Sigma - \frac{1}{2} \sum_{i=1}^{m} (x^{(i)} - \mu_{y^{(i)}}) (x^{(i)} - \mu_{y^{(i)}})^T &= 0 \\ \Sigma = \frac{1}{m} \sum_{i=1}^{m} (x^{(i)} - \mu_{y^{(i)}}) (x^{(i)} - \mu_{y^{(i)}})^T \end{align*}
2mΣ−21i=1∑m(x(i)−μy(i))(x(i)−μy(i))TΣ=m1i=1∑m(x(i)−μy(i))(x(i)−μy(i))T=0
综合有:
ϕ
=
1
m
∑
i
=
1
m
1
{
y
(
i
)
=
1
}
μ
1
=
∑
i
=
1
m
1
{
y
(
i
)
=
1
}
x
(
i
)
∑
i
=
1
m
1
{
y
(
i
)
=
1
}
μ
0
=
∑
i
=
1
m
1
{
y
(
i
)
=
0
}
x
(
i
)
∑
i
=
1
m
1
{
y
(
i
)
=
0
}
Σ
=
1
m
∑
i
=
1
m
(
x
(
i
)
−
μ
y
(
i
)
)
(
x
(
i
)
−
μ
y
(
i
)
)
T
\begin{align} \phi&=\frac{1}{m}\sum^{m}_{i=1}1\{y^{(i)}=1\}\\ \mu_1 &= \frac{\sum_{i=1}^{m} \mathbb{1}\{y^{(i)} = 1\} x^{(i)}}{\sum_{i=1}^{m} \mathbb{1}\{y^{(i)} = 1\}}\\ \mu_0 &= \frac{\sum_{i=1}^{m} \mathbb{1}\{y^{(i)} = 0\} x^{(i)}}{\sum_{i=1}^{m} \mathbb{1}\{y^{(i)} = 0\}}\\ \Sigma &= \frac{1}{m} \sum_{i=1}^{m} (x^{(i)} - \mu_{y^{(i)}}) (x^{(i)} - \mu_{y^{(i)}})^T \end{align}
ϕμ1μ0Σ=m1i=1∑m1{y(i)=1}=∑i=1m1{y(i)=1}∑i=1m1{y(i)=1}x(i)=∑i=1m1{y(i)=0}∑i=1m1{y(i)=0}x(i)=m1i=1∑m(x(i)−μy(i))(x(i)−μy(i))T
将求出来的上述参数代入计算
p
(
y
(
i
)
=
0
∣
x
(
i
)
)
,
p
(
y
(
i
)
=
1
∣
x
(
i
)
)
p(y^{(i)}=0|x^{(i)}),p(y^{(i)}=1|x^{(i)})
p(y(i)=0∣x(i)),p(y(i)=1∣x(i))两个概率值来判断
x
(
i
)
x^{(i)}
x(i)属于哪个类别。