线性分类
线性分类:
- 硬分类 y ^ ∈ 0 , 1 \hat{y} \in {0 , 1} y^∈0,1 : 线性判别分析(fisher)、感知机
- 软分类 y ^ ∈ [ 0 , 1 ] \hat{y} \in [0 , 1] y^∈[0,1]:生成式(Gaussian Discriminative Analysis)、判别式(Logistic Regression)
软分类输出:
P
(
y
=
1
)
=
p
P(y=1)=p
P(y=1)=p
判别式直接学习求解P(Y|X);生成式利用贝叶斯定理求解
P
(
Y
∣
X
)
=
P
(
X
∣
Y
)
P
(
Y
)
P
(
X
)
P(Y|X)=\frac{P(X|Y)P(Y)}{P(X)}
P(Y∣X)=P(X)P(X∣Y)P(Y),给定先验
P
(
Y
)
P(Y)
P(Y),给定条件下观察似然
P
(
X
∣
Y
)
P(X|Y)
P(X∣Y)
Note
样本集 X 包含N个样本, 每个样本的特征维度为p(也就是说 每一样本是长度为p的列向量):
X
=
(
x
1
,
x
2
,
.
.
.
,
x
N
)
T
∈
R
N
×
p
x
i
=
(
x
i
1
,
x
i
2
,
.
.
.
,
x
i
p
)
T
∈
R
p
×
1
Y
=
(
y
1
,
y
2
,
.
.
.
,
y
N
)
T
∈
R
N
y
i
∈
+
1
,
−
1
X = (x_1, x_2, ..., x_N)^T \in \mathbb{R^{N \times p}} \\ x_i = (x_i^1, x_i^2, ..., x_i^p)^T \in \mathbb{R^{p \times1}}\\ Y = (y_1, y_2,...,y_N)^T \in \mathbb R^N\\ y_i \in +1, -1
X=(x1,x2,...,xN)T∈RN×pxi=(xi1,xi2,...,xip)T∈Rp×1Y=(y1,y2,...,yN)T∈RNyi∈+1,−1
样本
(
x
i
,
y
i
)
i
=
1
N
(x_i, y_i)_{i=1}^N
(xi,yi)i=1N;类别1样本集:
X
c
1
=
X_{c1}=
Xc1={
x
i
∣
y
i
=
1
x_i|y_i=1
xi∣yi=1},
∣
X
c
1
∣
=
N
1
|X_{c1}|=N_1
∣Xc1∣=N1 ;类别2样本集:
X
c
2
=
X_{c2}=
Xc2={
x
i
∣
y
i
=
−
1
x_i|y_i=-1
xi∣yi=−1},
∣
X
c
2
∣
=
N
2
|X_{c2}|=N_2
∣Xc2∣=N2; 且
N
1
+
N
2
=
N
N_1 + N_2 = N
N1+N2=N
感知机
思想
错误驱动
模型
f ( x ) = s i g n ( w T x ) , x ∈ R p , w ∈ R p s i g n ( x ) = 1 , x > = 0 − 1 , x < 0 f(x)=sign(w^Tx), x \in \mathbb{R^p}, w \in \mathbb{R^p}\\ sign(x)= \begin{array}{c} 1,x>=0\\ -1,x<0 \\ \end{array} f(x)=sign(wTx),x∈Rp,w∈Rpsign(x)=1,x>=0−1,x<0
策略
loss function -> 被错误分类的样本的个数
L
(
w
)
=
∑
i
=
1
N
I
[
y
i
w
T
x
i
<
0
]
L(w)=\sum_{i=1}^N I[y_iw^Tx_i<0]
L(w)=i=1∑NI[yiwTxi<0]
对于样本点
(
x
i
,
y
i
)
(x_i, y_i)
(xi,yi),应该有:
w
T
x
i
>
0
,
y
i
=
1
w
T
x
i
<
0
,
y
i
=
−
1
w^Tx_i > 0, y_i = 1\\ w^Tx_i < 0, y_i = -1
wTxi>0,yi=1wTxi<0,yi=−1
也就是有,被正确分类的样本应该满足:
y
i
w
T
x
i
>
0
y_iw^Tx_i >0
yiwTxi>0
由于,指示函数
I
(
⋅
)
I(\cdot)
I(⋅)不可导,将上述loss function 改写为以下形式:
L
(
w
)
=
∑
x
i
∈
D
−
y
i
w
T
x
i
L(w)=\sum_{x_i \in D} -y_iw^Tx_i
L(w)=xi∈D∑−yiwTxi
则:
∂
L
(
w
)
∂
w
=
−
y
i
x
i
\frac{\partial L(w)}{\partial w} = -y_ix_i
∂w∂L(w)=−yixi
算法
SGD
w
t
=
1
=
w
t
+
λ
y
i
x
I
w^{t=1} = w^t + \lambda y_ix_I
wt=1=wt+λyixI
Fisher’s linear discriminant
线性判别分析
思想
从降维的角度出发,将数据从p维空间投影到低维空间,然后进行分类 -> 找到合适的投影方向
类内小,类间大
样本点在分类线上的投影:
z
i
=
w
T
x
i
z_i = w^Tx_i
zi=wTxi
z
i
z_i
zi的均值:
z
‾
=
1
N
∑
i
=
1
N
z
i
\overline{z} = \frac{1}{N}\sum_{i=1}^Nz_i
z=N1i=1∑Nzi
z
i
z_i
zi的协方差:
S
z
=
1
N
∑
i
=
1
N
(
z
i
−
z
‾
)
(
z
i
−
z
‾
)
T
=
1
N
∑
i
=
1
N
(
w
T
x
i
−
z
‾
)
(
w
T
x
i
−
z
‾
)
T
S_z = \frac{1}{N}\sum_{i=1}^N (z_i - \overline{z})(z_i - \overline{z})^T= \frac{1}{N}\sum_{i=1}^N (w^Tx_i- \overline{z})(w^Tx_i - \overline{z})^T
Sz=N1i=1∑N(zi−z)(zi−z)T=N1i=1∑N(wTxi−z)(wTxi−z)T
对于类别1的样本有:
z
‾
1
=
1
N
1
∑
i
=
1
N
1
z
i
S
1
=
1
N
1
∑
i
=
1
N
1
(
w
T
x
i
−
z
‾
1
)
(
w
T
x
i
−
z
‾
1
)
T
\overline{z}_1 = \frac{1}{N_1}\sum_{i=1}^{N_1}z_i\\S_1 = \frac{1}{N_1}\sum_{i=1}^{N_1} (w^Tx_i- \overline{z}_1)(w^Tx_i - \overline{z}_1)^T
z1=N11i=1∑N1ziS1=N11i=1∑N1(wTxi−z1)(wTxi−z1)T
对于类别2的样本有:
z
‾
2
=
1
N
2
∑
i
=
1
N
2
z
i
S
2
=
1
N
2
∑
i
=
1
N
2
(
w
T
x
i
−
z
‾
2
)
(
w
T
x
i
−
z
‾
2
)
T
\overline{z}_2 = \frac{1}{N_2}\sum_{i=1}^{N_2}z_i\\S_2 = \frac{1}{N_2}\sum_{i=1}^{N_2} (w^Tx_i- \overline{z}_2)(w^Tx_i - \overline{z}_2)^T
z2=N21i=1∑N2ziS2=N21i=1∑N2(wTxi−z2)(wTxi−z2)T
模型:目标函数
损失函数loss function
类间:
(
z
‾
1
−
z
‾
2
)
2
(\overline{z}_1 -\overline{z}_2)^2
(z1−z2)2
类内:
S
1
+
S
2
S_1 +S_2
S1+S2
目标函数:
J
(
w
)
=
(
z
‾
1
−
z
‾
2
)
2
S
1
+
S
2
J(w)=\frac{(\overline{z}_1-\overline{z}_2)^2}{S_1 +S_2}
J(w)=S1+S2(z1−z2)2
又:
z
‾
1
−
z
‾
2
=
1
N
1
∑
i
=
1
N
1
w
T
x
i
−
1
N
2
∑
i
=
1
N
2
w
T
x
i
=
w
T
(
1
N
1
∑
i
=
1
N
1
x
i
−
1
N
2
∑
i
=
1
N
2
x
i
)
=
w
T
(
X
‾
c
1
−
X
‾
c
2
)
\overline{z}_1 -\overline{z}_2=\frac{1}{N_1}\sum_{i=1}^{N_1} w^Tx_i - \frac{1}{N_2}\sum_{i=1}^{N_2} w^Tx_i \\=w^T(\frac{1}{N_1}\sum_{i=1}^{N_1} x_i - \frac{1}{N_2}\sum_{i=1}^{N_2} x_i)\\=w^T(\overline{X}_{c1}-\overline{X}_{c2})
z1−z2=N11i=1∑N1wTxi−N21i=1∑N2wTxi=wT(N11i=1∑N1xi−N21i=1∑N2xi)=wT(Xc1−Xc2)
S
1
=
1
N
1
∑
i
=
1
N
1
(
w
T
x
i
−
z
‾
1
)
(
w
T
x
i
−
z
‾
1
)
T
=
1
N
1
∑
i
=
1
N
1
(
w
T
x
i
−
1
N
1
∑
j
=
1
N
1
w
T
x
j
)
(
w
T
x
i
−
1
N
1
∑
j
=
1
N
1
w
T
x
j
)
T
=
1
N
1
∑
i
=
1
N
1
w
T
(
x
i
−
1
N
1
∑
j
=
1
N
1
x
j
)
(
x
i
−
1
N
1
∑
j
=
1
N
1
x
j
)
T
w
=
1
N
1
∑
i
=
1
N
1
w
T
(
x
i
−
X
‾
c
1
)
(
x
i
−
X
‾
c
1
)
T
w
=
w
T
⋅
1
N
1
∑
i
=
1
N
1
(
x
i
−
X
‾
c
1
)
(
x
i
−
X
‾
c
1
)
T
⋅
w
=
w
T
⋅
S
c
1
⋅
w
S_1 =\frac{1}{N_1}\sum_{i=1}^{N_1} (w^Tx_i- \overline{z}_1)(w^Tx_i - \overline{z}_1)^T\\=\frac{1}{N_1}\sum_{i=1}^{N_1} (w^Tx_i- \frac{1}{N_1}\sum_{j=1}^{N_1} w^Tx_j)(w^Tx_i- \frac{1}{N_1}\sum_{j=1}^{N_1} w^Tx_j)^T\\=\frac{1}{N_1}\sum_{i=1}^{N_1} w^T(x_i-\frac{1}{N_1}\sum_{j=1}^{N_1} x_j)(x_i-\frac{1}{N_1}\sum_{j=1}^{N_1} x_j)^Tw\\=\frac{1}{N_1}\sum_{i=1}^{N_1} w^T(x_i-\overline{X}_{c1})(x_i-\overline{X}_{c1})^Tw\\=w^T\cdot \frac{1}{N_1}\sum_{i=1}^{N_1} (x_i-\overline{X}_{c1})(x_i-\overline{X}_{c1})^T \cdot w\\=w^T\cdot S_{c1} \cdot w
S1=N11i=1∑N1(wTxi−z1)(wTxi−z1)T=N11i=1∑N1(wTxi−N11j=1∑N1wTxj)(wTxi−N11j=1∑N1wTxj)T=N11i=1∑N1wT(xi−N11j=1∑N1xj)(xi−N11j=1∑N1xj)Tw=N11i=1∑N1wT(xi−Xc1)(xi−Xc1)Tw=wT⋅N11i=1∑N1(xi−Xc1)(xi−Xc1)T⋅w=wT⋅Sc1⋅w
同理:
S
2
=
w
T
⋅
S
c
2
⋅
w
S_2 =w^T\cdot S_{c2} \cdot w
S2=wT⋅Sc2⋅w
所以:
J
(
w
)
=
(
z
‾
1
−
z
‾
2
)
2
S
1
+
S
2
=
w
T
(
X
‾
c
1
−
X
‾
c
2
)
(
X
‾
c
1
−
X
‾
c
2
)
T
w
w
T
⋅
(
S
c
1
+
S
c
2
)
⋅
w
=
w
T
⋅
S
b
⋅
w
w
T
⋅
S
w
⋅
w
J(w)=\frac{(\overline{z}_1-\overline{z}_2)^2}{S_1 +S_2}=\frac{w^T(\overline{X}_{c1}-\overline{X}_{c2})(\overline{X}_{c1}-\overline{X}_{c2})^Tw}{w^T\cdot (S_{c1}+S_{c2}) \cdot w}=\frac{w^T \cdot S_b \cdot w}{w^T\cdot S_w \cdot w}
J(w)=S1+S2(z1−z2)2=wT⋅(Sc1+Sc2)⋅wwT(Xc1−Xc2)(Xc1−Xc2)Tw=wT⋅Sw⋅wwT⋅Sb⋅w
类间方差 between-class
S
b
=
(
X
‾
c
1
−
X
‾
c
2
)
(
X
‾
c
1
−
X
‾
c
2
)
T
S_b = (\overline{X}_{c1}-\overline{X}_{c2})(\overline{X}_{c1}-\overline{X}_{c2})^T
Sb=(Xc1−Xc2)(Xc1−Xc2)T
类内方差 with-class
S
w
=
S
c
1
+
S
c
2
S_w = S_{c1}+S_{c2}
Sw=Sc1+Sc2
求解
w
^
=
arg max
w
J
(
w
)
=
arg max
w
w
T
⋅
S
b
⋅
w
w
T
⋅
S
w
⋅
w
\hat{w} = \mathop{\argmax_w} J(w)=\mathop{\argmax_w} \frac{w^T \cdot S_b \cdot w}{w^T\cdot S_w \cdot w}
w^=wargmaxJ(w)=wargmaxwT⋅Sw⋅wwT⋅Sb⋅w
∂
J
(
w
)
∂
w
=
2
S
b
w
(
w
T
S
w
w
)
−
1
−
2
w
T
S
b
w
(
w
T
S
w
w
)
−
2
S
w
w
=
0
\frac{\partial J(w)}{\partial w}=2S_bw(w^TS_ww)^{-1}-2w^TS_bw(w^TS_ww)^{-2}S_ww=0
∂w∂J(w)=2Sbw(wTSww)−1−2wTSbw(wTSww)−2Sww=0
S
b
w
(
w
T
S
w
w
)
−
w
T
S
b
w
S
w
w
=
0
S_bw(w^TS_ww)-w^TS_bwS_ww=0
Sbw(wTSww)−wTSbwSww=0
由于
w
T
∈
R
1
×
p
,
w
∈
R
p
×
1
,
S
b
∈
R
p
×
p
,
S
w
∈
R
p
×
p
w^T \in \mathbb{R^{1\times p}},w \in \mathbb{R^{p\times 1},S_b \in \mathbb{R^{p \times p}},S_w \in \mathbb{R^{p \times p}}}
wT∈R1×p,w∈Rp×1,Sb∈Rp×p,Sw∈Rp×p,所以
w
T
S
w
w
∈
R
,
w
T
S
b
w
∈
R
w^TS_ww \in \mathbb{R},w^TS_bw \in \mathbb{R}
wTSww∈R,wTSbw∈R
因此:
S
w
w
=
w
T
S
b
w
w
T
S
w
w
S
b
w
S_ww = \frac{w^TS_bw}{w^TS_ww} S_b w
Sww=wTSwwwTSbwSbw
w
=
w
T
S
b
w
w
T
S
w
w
S
w
−
1
S
b
w
=
w
T
S
b
w
w
T
S
w
w
S
w
−
1
(
X
‾
c
1
−
X
‾
c
2
)
(
X
‾
c
1
−
X
‾
c
2
)
T
w
w = \frac{w^TS_bw}{w^TS_ww} S_w^{-1}S_b w= \frac{w^TS_bw}{w^TS_ww} S_w^{-1} (\overline{X}_{c1}-\overline{X}_{c2})(\overline{X}_{c1}-\overline{X}_{c2})^Tw
w=wTSwwwTSbwSw−1Sbw=wTSwwwTSbwSw−1(Xc1−Xc2)(Xc1−Xc2)Tw
由于
(
X
‾
c
1
−
X
‾
c
2
)
∈
R
p
×
1
(\overline{X}_{c1}-\overline{X}_{c2}) \in \mathbb{R^{p\times 1}}
(Xc1−Xc2)∈Rp×1,
(
X
‾
c
1
−
X
‾
c
2
)
T
∈
R
1
×
p
(\overline{X}_{c1}-\overline{X}_{c2})^T \in \mathbb{R^{1\times p}}
(Xc1−Xc2)T∈R1×p,所以
(
X
‾
c
1
−
X
‾
c
2
)
T
w
∈
R
(\overline{X}_{c1}-\overline{X}_{c2})^Tw \in \mathbb{R}
(Xc1−Xc2)Tw∈R
因此:
w
^
=
λ
S
w
−
1
(
X
‾
c
1
−
X
‾
c
2
)
\hat{w} = \lambda S_w^{-1} (\overline{X}_{c1}-\overline{X}_{c2})
w^=λSw−1(Xc1−Xc2)
λ
=
w
T
S
b
w
w
T
S
w
w
(
X
‾
c
1
−
X
‾
c
2
)
T
w
\lambda = \frac{w^TS_bw}{w^TS_ww}(\overline{X}_{c1}-\overline{X}_{c2})^Tw
λ=wTSwwwTSbw(Xc1−Xc2)Tw
Logistic Regression
概率判别模型 -> p(y|x)
概率判别模型思想:
直接求解
p
(
y
∣
x
)
p(y|x)
p(y∣x)的值,对条件概率分布
p
(
y
∣
x
)
p(y|x)
p(y∣x)建模
y
^
=
arg max
y
∈
0
,
1
p
(
y
∣
x
)
\hat{y} = \argmax_{y \in 0,1} p(y|x)
y^=y∈0,1argmaxp(y∣x)
sigmoid 函数:
σ
(
z
)
=
1
1
+
e
x
p
(
−
z
)
\sigma(z) = \frac{1}{1+exp(-z)}
σ(z)=1+exp(−z)1
p
(
y
∣
x
;
w
)
p(y|x;w)
p(y∣x;w):
p
(
y
=
1
∣
x
;
w
)
=
σ
(
w
T
x
)
p
(
y
=
0
∣
x
;
w
)
=
1
−
p
(
y
=
1
∣
x
;
w
)
=
1
−
σ
(
w
T
x
)
p
(
y
∣
x
;
w
)
=
p
(
y
=
1
∣
x
;
w
)
y
p
(
y
=
0
∣
x
;
w
)
1
−
y
p(y=1|x;w)=\sigma(w^Tx)\\p(y=0|x;w)=1-p(y=1|x;w)=1-\sigma(w^Tx)\\p(y|x;w)=p(y=1|x;w)^{y}p(y=0|x;w)^{1-y}
p(y=1∣x;w)=σ(wTx)p(y=0∣x;w)=1−p(y=1∣x;w)=1−σ(wTx)p(y∣x;w)=p(y=1∣x;w)yp(y=0∣x;w)1−y
数据独立同分布,极大似然估计MLE:
w
^
=
arg max
w
log
P
(
Y
∣
X
)
=
arg max
w
log
∏
i
=
1
N
p
(
y
i
∣
x
i
;
w
)
=
arg max
w
∑
i
=
1
N
log
p
(
y
i
∣
x
i
;
w
)
=
arg max
w
∑
i
=
1
N
y
i
log
p
(
y
i
=
1
∣
x
;
w
)
+
(
1
−
y
i
)
log
p
(
y
i
=
0
∣
x
;
w
)
\hat{w}=\argmax _w \log P(Y|X)\\=\argmax_w \log\prod_{i=1}^Np(y_i|x_i;w)\\=\argmax_w\sum_{i=1}^N\log p(y_i|x_i;w)\\=\argmax_w \sum_{i=1}^N y_i\log p(y_i=1|x;w)+(1-y_i)\log p(y_i=0|x;w)
w^=wargmaxlogP(Y∣X)=wargmaxlogi=1∏Np(yi∣xi;w)=wargmaxi=1∑Nlogp(yi∣xi;w)=wargmaxi=1∑Nyilogp(yi=1∣x;w)+(1−yi)logp(yi=0∣x;w)
Gaussian Discriminant Analysis
标签:软分类、连续数据、概率生成模型
高斯判别分析: 概率生成模型 -> p(x, y)
思想:
比较
p
(
y
=
1
∣
x
)
p(y=1|x)
p(y=1∣x)与
p
(
y
=
0
∣
x
)
p(y=0|x)
p(y=0∣x)的大小,实现分类
借助贝叶斯定理:
p
(
y
∣
x
)
=
p
(
x
∣
y
)
p
(
y
)
p
(
x
)
p(y|x) = \frac{p(x|y)p(y)}{p(x)}
p(y∣x)=p(x)p(x∣y)p(y)
当比较
p
(
y
=
1
∣
x
)
p(y=1|x)
p(y=1∣x)与
p
(
y
=
0
∣
x
)
p(y=0|x)
p(y=0∣x)的大小时,分子
p
(
x
)
p(x)
p(x)都是相同的,因此:
p
(
y
∣
x
)
∝
p
(
x
∣
y
)
p
(
y
)
p(y|x) \propto p(x|y)p(y)
p(y∣x)∝p(x∣y)p(y)
由于
p
(
x
∣
y
)
p
(
y
)
=
p
(
x
,
y
)
p(x|y)p(y)=p(x,y)
p(x∣y)p(y)=p(x,y),因此,概率生成模型是直接对联合概率分布
p
(
x
,
y
)
p(x,y)
p(x,y)建模。其中,
p
(
y
)
p(y)
p(y)是先验分布,
p
(
x
∣
y
)
p(x|y)
p(x∣y)是似然,
p
(
y
∣
x
)
p(y|x)
p(y∣x)是后验分布。
因此:
y
^
=
arg max
y
∈
0
,
1
p
(
y
∣
x
)
=
arg max
y
∈
0
,
1
p
(
x
∣
y
)
p
(
y
)
\hat{y}=\argmax_{y \in 0,1} p(y|x)=\argmax_{y \in 0,1} p(x|y)p(y)
y^=y∈0,1argmaxp(y∣x)=y∈0,1argmaxp(x∣y)p(y)
高斯判别分析,假设先验
p
(
y
)
∼
B
e
r
n
o
u
l
l
i
(
ϕ
)
p(y) \sim Bernoulli(\phi)
p(y)∼Bernoulli(ϕ),似然
p
(
x
∣
y
=
1
)
∼
N
(
μ
1
,
Σ
)
,
p
(
x
∣
y
=
0
)
∼
N
(
μ
2
,
Σ
)
p(x|y=1) \sim \mathcal{N}(\mu_1, \Sigma),p(x|y=0) \sim \mathcal{N}(\mu_2, \Sigma)
p(x∣y=1)∼N(μ1,Σ),p(x∣y=0)∼N(μ2,Σ)
先验分布,伯努利分布Bernoulli:
p
(
y
=
1
)
=
ϕ
=
ϕ
y
p(y=1) = \phi=\phi^y
p(y=1)=ϕ=ϕy
p
(
y
=
0
)
=
1
−
ϕ
=
(
1
−
ϕ
)
1
−
y
p(y=0) = 1-\phi=(1-\phi)^{1-y}
p(y=0)=1−ϕ=(1−ϕ)1−y
因此:
p
(
y
)
=
ϕ
y
⋅
(
1
−
ϕ
)
1
−
y
p(y) = \phi^y \cdot (1-\phi)^{1-y}
p(y)=ϕy⋅(1−ϕ)1−y
同理,似然:
p
(
x
∣
y
)
∼
N
(
μ
1
,
Σ
)
y
⋅
N
(
μ
2
,
Σ
)
1
−
y
p(x|y) \sim \mathcal{N}(\mu_1, \Sigma)^y \cdot \mathcal{N}(\mu_2, \Sigma)^{1-y}
p(x∣y)∼N(μ1,Σ)y⋅N(μ2,Σ)1−y
log-likelihood:
L
(
θ
)
=
log
∏
i
=
1
N
p
(
x
i
,
y
i
)
=
∑
i
=
1
N
log
(
p
(
x
i
∣
y
i
)
p
(
y
i
)
)
=
∑
i
=
1
N
log
p
(
x
i
∣
y
i
)
+
log
p
(
y
i
)
=
∑
i
=
1
N
log
[
N
(
μ
1
,
Σ
)
y
i
⋅
N
(
μ
2
,
Σ
)
1
−
y
i
]
+
log
[
ϕ
y
i
⋅
(
1
−
ϕ
)
1
−
y
i
]
=
∑
i
=
1
N
log
N
(
μ
1
,
Σ
)
y
i
+
log
N
(
μ
2
,
Σ
)
1
−
y
i
+
log
[
ϕ
y
i
⋅
(
1
−
ϕ
)
1
−
y
i
]
\mathcal{L}(\theta)=\log \prod_{i=1}^N p(x_i, y_i)\\=\sum_{i=1}^N \log (p(x_i|y_i)p(y_i))\\=\sum_{i=1}^N \log p(x_i|y_i)+\log p(y_i)\\=\sum_{i=1}^N \log \left[ \mathcal{N}(\mu_1, \Sigma)^{y_i}\cdot \mathcal{N}(\mu_2, \Sigma)^{1-y_i}\right]+\log \left[ \phi^{y_i}\cdot (1-\phi)^{1-y_i}\right]\\=\sum_{i=1}^N \log \mathcal{N}(\mu_1, \Sigma)^{y_i} + \log \mathcal{N}(\mu_2, \Sigma)^{1-y_i}+\log \left[ \phi^{y_i}\cdot (1-\phi)^{1-y_i}\right]
L(θ)=logi=1∏Np(xi,yi)=i=1∑Nlog(p(xi∣yi)p(yi))=i=1∑Nlogp(xi∣yi)+logp(yi)=i=1∑Nlog[N(μ1,Σ)yi⋅N(μ2,Σ)1−yi]+log[ϕyi⋅(1−ϕ)1−yi]=i=1∑NlogN(μ1,Σ)yi+logN(μ2,Σ)1−yi+log[ϕyi⋅(1−ϕ)1−yi]
θ
=
(
μ
1
,
μ
2
,
Σ
,
ϕ
)
\theta = (\mu_1, \mu_2, \Sigma, \phi)
θ=(μ1,μ2,Σ,ϕ)
θ
^
=
arg max
θ
L
(
θ
)
\hat{\theta}=\argmax_\theta \mathcal{L}(\theta)
θ^=θargmaxL(θ)
求解
ϕ
\phi
ϕ:
ϕ
^
=
arg max
ϕ
∑
i
=
1
N
log
[
ϕ
y
i
⋅
(
1
−
ϕ
)
1
−
y
i
]
=
arg max
ϕ
∑
i
=
1
N
y
i
log
ϕ
+
(
1
−
y
i
)
log
(
1
−
ϕ
)
\hat{\phi}=\argmax_\phi \sum_{i=1}^N \log \left[ \phi^{y_i}\cdot (1-\phi)^{1-y_i}\right]\\=\argmax_\phi \sum_{i=1}^N y_i\log\phi + (1-y_i) \log(1-\phi)
ϕ^=ϕargmaxi=1∑Nlog[ϕyi⋅(1−ϕ)1−yi]=ϕargmaxi=1∑Nyilogϕ+(1−yi)log(1−ϕ)
对
ϕ
\phi
ϕ求导:
∑
i
=
1
N
y
i
ϕ
−
1
−
y
i
1
−
ϕ
=
0
∑
i
=
1
N
(
1
−
ϕ
)
y
i
−
(
1
−
y
i
)
ϕ
=
0
∑
i
=
1
N
(
1
−
ϕ
)
y
i
−
(
1
−
y
i
)
ϕ
=
0
∑
i
=
1
N
y
i
−
ϕ
=
0
\sum_{i=1}^N \frac{y_i}{\phi}-\frac{1-y_i}{1-\phi}=0 \\ \sum_{i=1}^N (1-\phi)y_i-(1-y_i)\phi=0 \\ \sum_{i=1}^N (1-\phi)y_i-(1-y_i)\phi=0 \\ \sum_{i=1}^N y_i - \phi = 0
i=1∑Nϕyi−1−ϕ1−yi=0i=1∑N(1−ϕ)yi−(1−yi)ϕ=0i=1∑N(1−ϕ)yi−(1−yi)ϕ=0i=1∑Nyi−ϕ=0
因此:
ϕ
^
=
1
N
∑
i
=
1
N
y
i
\hat{\phi}=\frac{1}{N}\sum_{i=1}^N y_i
ϕ^=N1i=1∑Nyi
求解
μ
1
\mu_1
μ1:
μ
1
^
=
arg max
μ
1
∑
i
=
1
N
log
N
(
μ
1
,
Σ
)
y
i
=
arg max
μ
1
∑
i
=
1
N
y
i
log
N
(
μ
1
,
Σ
)
=
arg max
μ
1
∑
i
=
1
N
y
i
log
1
2
π
p
/
2
⋅
∣
Σ
∣
1
/
2
exp
(
−
1
2
(
x
i
−
μ
1
)
T
Σ
−
1
(
x
i
−
μ
1
)
)
=
arg max
μ
1
∑
i
=
1
N
y
i
[
−
1
2
(
x
i
−
μ
1
)
T
Σ
−
1
(
x
i
−
μ
1
)
−
p
2
log
2
π
−
1
2
log
∣
Σ
∣
]
=
arg max
μ
1
∑
i
=
1
N
y
i
[
−
1
2
(
x
i
−
μ
1
)
T
Σ
−
1
(
x
i
−
μ
1
)
]
\hat{\mu_1}=\argmax_{\mu_1} \sum_{i=1}^N \log \mathcal{N}(\mu_1, \Sigma)^{y_i}\\=\argmax_{\mu_1} \sum_{i=1}^N y_i \log \mathcal{N}(\mu_1, \Sigma)\\=\argmax_{\mu_1} \sum_{i=1}^N y_i \log \frac{1}{2\pi^{p/2}\cdot |\Sigma|^{1/2}} \exp(-\frac{1}{2}(x_i-\mu_1)^T\Sigma^{-1}(x_i-\mu_1))\\=\argmax_{\mu_1} \sum_{i=1}^N y_i \left[ -\frac{1}{2}(x_i-\mu_1)^T\Sigma^{-1}(x_i-\mu_1)-\frac{p}{2}\log 2\pi -\frac{1}{2}\log |\Sigma|\right]\\=\argmax_{\mu_1} \sum_{i=1}^N y_i \left[ -\frac{1}{2}(x_i-\mu_1)^T\Sigma^{-1}(x_i-\mu_1)\right]
μ1^=μ1argmaxi=1∑NlogN(μ1,Σ)yi=μ1argmaxi=1∑NyilogN(μ1,Σ)=μ1argmaxi=1∑Nyilog2πp/2⋅∣Σ∣1/21exp(−21(xi−μ1)TΣ−1(xi−μ1))=μ1argmaxi=1∑Nyi[−21(xi−μ1)TΣ−1(xi−μ1)−2plog2π−21log∣Σ∣]=μ1argmaxi=1∑Nyi[−21(xi−μ1)TΣ−1(xi−μ1)]
∑
i
=
1
N
y
i
[
−
1
2
(
x
i
−
μ
1
)
T
Σ
−
1
(
x
i
−
μ
1
)
]
=
−
1
2
∑
i
=
1
N
y
i
[
(
x
i
T
Σ
−
1
−
μ
1
T
Σ
−
1
)
(
x
i
−
μ
1
)
]
=
−
1
2
∑
i
=
1
N
y
i
[
x
i
T
Σ
−
1
x
i
−
2
x
i
T
Σ
−
1
μ
1
+
μ
1
T
Σ
−
1
μ
1
]
=
−
1
2
∑
i
=
1
N
y
i
[
x
i
T
Σ
−
1
x
i
−
2
μ
1
T
Σ
−
1
x
i
+
μ
1
T
Σ
−
1
μ
1
]
\sum_{i=1}^N y_i \left[ -\frac{1}{2}(x_i-\mu_1)^T\Sigma^{-1}(x_i-\mu_1)\right]\\=-\frac{1}{2}\sum_{i=1}^N y_i \left[ (x_i^T\Sigma^{-1}-\mu_1^T\Sigma^{-1})(x_i-\mu_1)\right]\\=-\frac{1}{2}\sum_{i=1}^N y_i \left[x_i^T\Sigma^{-1}x_i-2x_i^T\Sigma^{-1}\mu_1+\mu_1^T\Sigma^{-1}\mu_1\right]\\=-\frac{1}{2}\sum_{i=1}^N y_i \left[x_i^T\Sigma^{-1}x_i-2\mu_1^T\Sigma^{-1}x_i+\mu_1^T\Sigma^{-1}\mu_1\right]
i=1∑Nyi[−21(xi−μ1)TΣ−1(xi−μ1)]=−21i=1∑Nyi[(xiTΣ−1−μ1TΣ−1)(xi−μ1)]=−21i=1∑Nyi[xiTΣ−1xi−2xiTΣ−1μ1+μ1TΣ−1μ1]=−21i=1∑Nyi[xiTΣ−1xi−2μ1TΣ−1xi+μ1TΣ−1μ1]
对
μ
1
\mu_1
μ1求导:
−
1
2
∑
i
=
1
N
y
i
[
2
Σ
−
1
μ
1
−
2
Σ
−
1
x
i
]
=
0
∑
i
=
1
N
y
i
(
μ
1
−
x
i
)
=
0
-\frac{1}{2}\sum_{i=1}^N y_i \left[ 2\Sigma^{-1}\mu_1-2\Sigma^{-1}x_i\right]=0\\\sum_{i=1}^N y_i (\mu_1-x_i)=0
−21i=1∑Nyi[2Σ−1μ1−2Σ−1xi]=0i=1∑Nyi(μ1−xi)=0
因此:
μ
1
^
=
∑
i
=
1
N
y
i
x
i
∑
i
=
1
N
y
i
\hat{\mu_1}=\frac{\sum_{i=1}^N y_ix_i}{\sum_{i=1}^N y_i}
μ1^=∑i=1Nyi∑i=1Nyixi
同理,求解
μ
2
\mu_2
μ2:
μ
2
^
=
arg max
μ
2
∑
i
=
1
N
log
N
(
μ
2
,
Σ
)
1
−
y
i
=
arg max
μ
2
∑
i
=
1
N
(
1
−
y
i
)
log
N
(
μ
2
,
Σ
)
=
arg max
μ
2
∑
i
=
1
N
(
1
−
y
i
)
log
1
2
π
p
/
2
⋅
∣
Σ
∣
1
/
2
exp
(
−
1
2
(
x
i
−
μ
2
)
T
Σ
−
1
(
x
i
−
μ
2
)
)
=
arg max
μ
2
∑
i
=
1
N
(
1
−
y
i
)
[
−
1
2
(
x
i
−
μ
2
)
T
Σ
−
1
(
x
i
−
μ
2
)
−
p
2
log
2
π
−
1
2
log
∣
Σ
∣
]
=
arg max
μ
2
∑
i
=
1
N
(
1
−
y
i
)
[
−
1
2
(
x
i
−
μ
2
)
T
Σ
−
1
(
x
i
−
μ
2
)
]
\hat{\mu_2}=\argmax_{\mu_2} \sum_{i=1}^N \log \mathcal{N}(\mu_2, \Sigma)^{1-y_i}\\=\argmax_{\mu_2} \sum_{i=1}^N (1-y_i) \log \mathcal{N}(\mu_2, \Sigma)\\=\argmax_{\mu_2} \sum_{i=1}^N (1-y_i) \log \frac{1}{2\pi^{p/2}\cdot |\Sigma|^{1/2}} \exp(-\frac{1}{2}(x_i-\mu_2)^T\Sigma^{-1}(x_i-\mu_2))\\=\argmax_{\mu_2} \sum_{i=1}^N (1-y_i) \left[ -\frac{1}{2}(x_i-\mu_2)^T\Sigma^{-1}(x_i-\mu_2)-\frac{p}{2}\log 2\pi -\frac{1}{2}\log |\Sigma|\right]\\=\argmax_{\mu_2} \sum_{i=1}^N (1-y_i) \left[ -\frac{1}{2}(x_i-\mu_2)^T\Sigma^{-1}(x_i-\mu_2)\right]
μ2^=μ2argmaxi=1∑NlogN(μ2,Σ)1−yi=μ2argmaxi=1∑N(1−yi)logN(μ2,Σ)=μ2argmaxi=1∑N(1−yi)log2πp/2⋅∣Σ∣1/21exp(−21(xi−μ2)TΣ−1(xi−μ2))=μ2argmaxi=1∑N(1−yi)[−21(xi−μ2)TΣ−1(xi−μ2)−2plog2π−21log∣Σ∣]=μ2argmaxi=1∑N(1−yi)[−21(xi−μ2)TΣ−1(xi−μ2)]
∑
i
=
1
N
(
1
−
y
i
)
[
−
1
2
(
x
i
−
μ
2
)
T
Σ
−
1
(
x
i
−
μ
2
)
]
=
−
1
2
∑
i
=
1
N
(
1
−
y
i
)
[
(
x
i
T
Σ
−
1
−
μ
2
T
Σ
−
1
)
(
x
i
−
μ
2
)
]
=
−
1
2
∑
i
=
1
N
(
1
−
y
i
)
[
x
i
T
Σ
−
1
x
i
−
2
x
i
T
Σ
−
1
μ
2
+
μ
2
T
Σ
−
1
μ
2
]
=
−
1
2
∑
i
=
1
N
(
1
−
y
i
)
[
x
i
T
Σ
−
1
x
i
−
2
μ
2
T
Σ
−
1
x
i
+
μ
2
T
Σ
−
1
μ
2
]
\sum_{i=1}^N (1-y_i) \left[ -\frac{1}{2}(x_i-\mu_2)^T\Sigma^{-1}(x_i-\mu_2)\right]\\=-\frac{1}{2}\sum_{i=1}^N (1-y_i) \left[ (x_i^T\Sigma^{-1}-\mu_2^T\Sigma^{-1})(x_i-\mu_2)\right]\\=-\frac{1}{2}\sum_{i=1}^N (1-y_i) \left[x_i^T\Sigma^{-1}x_i-2x_i^T\Sigma^{-1}\mu_2+\mu_2^T\Sigma^{-1}\mu_2\right]\\=-\frac{1}{2}\sum_{i=1}^N (1-y_i) \left[x_i^T\Sigma^{-1}x_i-2\mu_2^T\Sigma^{-1}x_i+\mu_2^T\Sigma^{-1}\mu_2\right]
i=1∑N(1−yi)[−21(xi−μ2)TΣ−1(xi−μ2)]=−21i=1∑N(1−yi)[(xiTΣ−1−μ2TΣ−1)(xi−μ2)]=−21i=1∑N(1−yi)[xiTΣ−1xi−2xiTΣ−1μ2+μ2TΣ−1μ2]=−21i=1∑N(1−yi)[xiTΣ−1xi−2μ2TΣ−1xi+μ2TΣ−1μ2]
对
μ
1
\mu_1
μ1求导:
−
1
2
∑
i
=
1
N
(
1
−
y
i
)
[
2
Σ
−
1
μ
2
−
2
Σ
−
1
x
i
]
=
0
∑
i
=
1
N
(
1
−
y
i
)
(
μ
2
−
x
i
)
=
0
-\frac{1}{2}\sum_{i=1}^N (1-y_i) \left[ 2\Sigma^{-1}\mu_2-2\Sigma^{-1}x_i\right]=0\\\sum_{i=1}^N (1-y_i) (\mu_2-x_i)=0
−21i=1∑N(1−yi)[2Σ−1μ2−2Σ−1xi]=0i=1∑N(1−yi)(μ2−xi)=0
因此:
μ
2
^
=
∑
i
=
1
N
(
1
−
y
i
)
x
i
∑
i
=
1
N
(
1
−
y
i
)
\hat{\mu_2}=\frac{\sum_{i=1}^N (1-y_i)x_i}{\sum_{i=1}^N (1-y_i)}
μ2^=∑i=1N(1−yi)∑i=1N(1−yi)xi
求解
Σ
\Sigma
Σ:
Σ
^
=
arg max
Σ
∑
i
=
1
N
log
N
(
μ
1
,
Σ
)
y
i
+
log
N
(
μ
2
,
Σ
)
1
−
y
i
=
arg max
Σ
∑
i
=
1
N
y
i
log
N
(
μ
1
,
Σ
)
+
(
1
−
y
i
)
log
N
(
μ
2
,
Σ
)
\hat{\Sigma}=\argmax_\Sigma \sum_{i=1}^N \log \mathcal{N}(\mu_1, \Sigma)^{y_i} + \log \mathcal{N}(\mu_2, \Sigma)^{1-y_i}\\= \argmax_\Sigma \sum_{i=1}^N y_i\log \mathcal{N}(\mu_1, \Sigma)+ (1-y_i) \log \mathcal{N}(\mu_2, \Sigma)
Σ^=Σargmaxi=1∑NlogN(μ1,Σ)yi+logN(μ2,Σ)1−yi=Σargmaxi=1∑NyilogN(μ1,Σ)+(1−yi)logN(μ2,Σ)
令
C
1
=
C_1=
C1={
x
i
∣
y
i
=
1
,
i
=
1
,
2
,
.
.
.
,
N
x_i|y_i=1, i=1, 2,...,N
xi∣yi=1,i=1,2,...,N},
C
2
=
C_2=
C2={
x
i
∣
y
i
=
0
,
i
=
1
,
2
,
.
.
.
,
N
x_i|y_i=0, i=1, 2,...,N
xi∣yi=0,i=1,2,...,N},则
∣
C
1
∣
=
N
1
|C_1|=N_1
∣C1∣=N1,
∣
C
2
∣
=
N
2
|C_2|=N_2
∣C2∣=N2,
N
1
+
N
2
=
N
N_1+N_2=N
N1+N2=N。
因此,
Σ
^
=
arg max
Σ
[
∑
x
i
∈
C
1
log
N
(
μ
1
,
Σ
)
+
∑
x
i
∈
C
2
log
N
(
μ
2
,
Σ
)
]
\hat{\Sigma}= \argmax_\Sigma \left[ \sum_{x_i\in C_1}\log \mathcal{N}(\mu_1, \Sigma)+\sum_{x_i\in C_2}\log \mathcal{N}(\mu_2, \Sigma) \right]
Σ^=Σargmax[xi∈C1∑logN(μ1,Σ)+xi∈C2∑logN(μ2,Σ)]
由于:
∑
i
=
1
N
log
N
(
μ
,
Σ
)
=
∑
i
=
1
N
log
1
2
π
p
/
2
⋅
∣
Σ
∣
1
/
2
exp
(
−
1
2
(
x
i
−
μ
)
T
Σ
−
1
(
x
i
−
μ
)
)
=
∑
i
=
1
N
−
1
2
(
x
i
−
μ
)
T
Σ
−
1
(
x
i
−
μ
)
−
p
2
log
2
π
−
1
2
log
∣
Σ
∣
=
∑
i
=
1
N
C
−
1
2
log
∣
Σ
∣
−
1
2
(
x
i
−
μ
)
T
Σ
−
1
(
x
i
−
μ
)
=
C
−
1
2
∑
i
=
1
N
log
∣
Σ
∣
−
1
2
∑
i
=
1
N
(
x
i
−
μ
)
T
Σ
−
1
(
x
i
−
μ
)
=
C
−
N
2
log
∣
Σ
∣
−
1
2
∑
i
=
1
N
(
x
i
−
μ
)
T
Σ
−
1
(
x
i
−
μ
)
\sum_{i=1}^N \log \mathcal{N}(\mu, \Sigma)=\sum_{i=1}^N \log \frac{1}{2\pi^{p/2}\cdot |\Sigma|^{1/2}} \exp(-\frac{1}{2}(x_i-\mu)^T\Sigma^{-1}(x_i-\mu))\\=\sum_{i=1}^N -\frac{1}{2}(x_i-\mu)^T\Sigma^{-1}(x_i-\mu)-\frac{p}{2}\log 2\pi -\frac{1}{2}\log |\Sigma|\\=\sum_{i=1}^N C-\frac{1}{2}\log |\Sigma|-\frac{1}{2}(x_i-\mu)^T\Sigma^{-1}(x_i-\mu)\\=C-\frac{1}{2}\sum_{i=1}^N \log |\Sigma|-\frac{1}{2}\sum_{i=1}^N (x_i-\mu)^T\Sigma^{-1}(x_i-\mu)\\=C-\frac{N}{2}\log |\Sigma|-\frac{1}{2}\sum_{i=1}^N (x_i-\mu)^T\Sigma^{-1}(x_i-\mu)
i=1∑NlogN(μ,Σ)=i=1∑Nlog2πp/2⋅∣Σ∣1/21exp(−21(xi−μ)TΣ−1(xi−μ))=i=1∑N−21(xi−μ)TΣ−1(xi−μ)−2plog2π−21log∣Σ∣=i=1∑NC−21log∣Σ∣−21(xi−μ)TΣ−1(xi−μ)=C−21i=1∑Nlog∣Σ∣−21i=1∑N(xi−μ)TΣ−1(xi−μ)=C−2Nlog∣Σ∣−21i=1∑N(xi−μ)TΣ−1(xi−μ)
因为,
x
i
∈
R
p
×
1
x_i\in\mathcal{R}^{p\times1}
xi∈Rp×1,
Σ
−
1
∈
R
p
×
p
\Sigma^{-1}\in\mathcal{R}^{p\times p}
Σ−1∈Rp×p,则
(
x
i
−
μ
)
T
Σ
−
1
(
x
i
−
μ
)
∈
R
(x_i-\mu)^T\Sigma^{-1}(x_i-\mu)\in\mathcal{R}
(xi−μ)TΣ−1(xi−μ)∈R。
另外,
t
r
a
c
e
(
A
B
)
=
t
r
a
c
e
(
B
A
)
trace(AB)=trace(BA)
trace(AB)=trace(BA)
t
r
a
c
e
(
A
B
C
)
=
t
r
a
c
e
(
C
A
B
)
=
t
r
a
c
e
(
B
C
A
)
trace(ABC)=trace(CAB)=trace(BCA)
trace(ABC)=trace(CAB)=trace(BCA)
∂
t
r
a
c
e
(
A
B
)
∂
A
=
B
T
\frac{\partial trace(AB)}{\partial A}=B^T
∂A∂trace(AB)=BT
∂
∣
A
∣
∂
A
=
∣
A
∣
⋅
A
−
1
\frac{\partial |A|}{\partial A}=|A|\cdot A^{-1}
∂A∂∣A∣=∣A∣⋅A−1
所以:
∑
i
=
1
N
log
N
(
μ
,
Σ
)
=
C
−
N
2
log
∣
Σ
∣
−
1
2
∑
i
=
1
N
(
x
i
−
μ
)
T
Σ
−
1
(
x
i
−
μ
)
=
C
−
N
2
log
∣
Σ
∣
−
1
2
∑
i
=
1
N
t
r
a
c
e
(
(
x
i
−
μ
)
T
Σ
−
1
(
x
i
−
μ
)
)
=
C
−
N
2
log
∣
Σ
∣
−
1
2
∑
i
=
1
N
t
r
a
c
e
(
(
x
i
−
μ
)
(
x
i
−
μ
)
T
Σ
−
1
)
=
C
−
N
2
log
∣
Σ
∣
−
1
2
t
r
a
c
e
(
∑
i
=
1
N
(
x
i
−
μ
)
(
x
i
−
μ
)
T
Σ
−
1
)
\sum_{i=1}^N \log \mathcal{N}(\mu, \Sigma)=C-\frac{N}{2}\log |\Sigma|-\frac{1}{2}\sum_{i=1}^N (x_i-\mu)^T\Sigma^{-1}(x_i-\mu)\\=C-\frac{N}{2}\log |\Sigma|-\frac{1}{2}\sum_{i=1}^N trace((x_i-\mu)^T\Sigma^{-1}(x_i-\mu))\\=C-\frac{N}{2}\log |\Sigma|-\frac{1}{2}\sum_{i=1}^N trace((x_i-\mu)(x_i-\mu)^T\Sigma^{-1})\\=C-\frac{N}{2}\log |\Sigma|-\frac{1}{2} trace(\sum_{i=1}^N(x_i-\mu)(x_i-\mu)^T\Sigma^{-1})
i=1∑NlogN(μ,Σ)=C−2Nlog∣Σ∣−21i=1∑N(xi−μ)TΣ−1(xi−μ)=C−2Nlog∣Σ∣−21i=1∑Ntrace((xi−μ)TΣ−1(xi−μ))=C−2Nlog∣Σ∣−21i=1∑Ntrace((xi−μ)(xi−μ)TΣ−1)=C−2Nlog∣Σ∣−21trace(i=1∑N(xi−μ)(xi−μ)TΣ−1)
有,协方差矩阵
S
S
S:
S
=
1
N
∑
i
=
1
N
(
x
i
−
μ
)
(
x
i
−
μ
)
T
S=\frac{1}{N} \sum_{i=1}^N (x_i-\mu)(x_i-\mu)^T
S=N1i=1∑N(xi−μ)(xi−μ)T
所以:
∑
i
=
1
N
log
N
(
μ
,
Σ
)
=
C
−
N
2
log
∣
Σ
∣
−
1
2
t
r
a
c
e
(
∑
i
=
1
N
(
x
i
−
μ
)
(
x
i
−
μ
)
T
Σ
−
1
)
=
C
−
N
2
log
∣
Σ
∣
−
1
2
t
r
a
c
e
(
N
S
⋅
Σ
−
1
)
=
C
−
N
2
log
∣
Σ
∣
−
N
2
t
r
a
c
e
(
S
⋅
Σ
−
1
)
\sum_{i=1}^N \log \mathcal{N}(\mu, \Sigma)=C-\frac{N}{2}\log |\Sigma|-\frac{1}{2} trace(\sum_{i=1}^N(x_i-\mu)(x_i-\mu)^T\Sigma^{-1})\\=C-\frac{N}{2}\log |\Sigma|-\frac{1}{2} trace(NS\cdot\Sigma^{-1})\\=C-\frac{N}{2}\log |\Sigma|-\frac{N}{2} trace(S\cdot\Sigma^{-1})
i=1∑NlogN(μ,Σ)=C−2Nlog∣Σ∣−21trace(i=1∑N(xi−μ)(xi−μ)TΣ−1)=C−2Nlog∣Σ∣−21trace(NS⋅Σ−1)=C−2Nlog∣Σ∣−2Ntrace(S⋅Σ−1)
那么:
Σ
^
=
arg max
Σ
[
∑
x
i
∈
C
1
log
N
(
μ
1
,
Σ
)
+
∑
x
i
∈
C
2
log
N
(
μ
2
,
Σ
)
]
=
arg max
Σ
−
N
1
2
log
∣
Σ
∣
−
N
1
2
t
r
a
c
e
(
S
1
⋅
Σ
−
1
)
−
N
2
2
log
∣
Σ
∣
−
N
2
2
t
r
a
c
e
(
S
2
⋅
Σ
−
1
)
+
C
=
arg max
Σ
−
N
2
log
∣
Σ
∣
−
N
1
2
t
r
a
c
e
(
S
1
⋅
Σ
−
1
)
−
N
2
2
t
r
a
c
e
(
S
2
⋅
Σ
−
1
)
+
C
\hat{\Sigma}= \argmax_\Sigma \left[ \sum_{x_i\in C_1}\log \mathcal{N}(\mu_1, \Sigma)+\sum_{x_i\in C_2}\log \mathcal{N}(\mu_2, \Sigma) \right]\\=\argmax_\Sigma -\frac{N_1}{2}\log |\Sigma|-\frac{N_1}{2} trace(S_1\cdot\Sigma^{-1}) -\frac{N_2}{2}\log |\Sigma|-\frac{N_2}{2} trace(S_2\cdot\Sigma^{-1})+C\\=\argmax_\Sigma -\frac{N}{2}\log |\Sigma|-\frac{N_1}{2} trace(S_1\cdot\Sigma^{-1}) -\frac{N_2}{2} trace(S_2\cdot\Sigma^{-1}) +C
Σ^=Σargmax[xi∈C1∑logN(μ1,Σ)+xi∈C2∑logN(μ2,Σ)]=Σargmax−2N1log∣Σ∣−2N1trace(S1⋅Σ−1)−2N2log∣Σ∣−2N2trace(S2⋅Σ−1)+C=Σargmax−2Nlog∣Σ∣−2N1trace(S1⋅Σ−1)−2N2trace(S2⋅Σ−1)+C
对
Σ
\Sigma
Σ求导:
N
2
∣
Σ
∣
⋅
∣
Σ
∣
Σ
−
1
+
N
1
2
S
1
T
(
−
Σ
−
2
)
+
N
2
2
S
2
T
(
−
Σ
−
2
)
=
0
\frac{N}{2|\Sigma|} \cdot |\Sigma| \Sigma^{-1} + \frac{N_1}{2}S_1^T(-\Sigma^{-2})+\frac{N_2}{2} S_2^T(-\Sigma^{-2})=0
2∣Σ∣N⋅∣Σ∣Σ−1+2N1S1T(−Σ−2)+2N2S2T(−Σ−2)=0
N
Σ
−
1
−
N
1
S
1
T
Σ
−
2
−
N
2
S
2
T
Σ
−
2
=
0
N\Sigma^{-1}-N_1S_1^T\Sigma^{-2}-N_2S_2^T\Sigma^{-2}=0
NΣ−1−N1S1TΣ−2−N2S2TΣ−2=0
N
−
N
1
S
1
T
Σ
−
1
−
N
2
S
2
T
Σ
−
1
=
0
N-N_1S_1^T\Sigma^{-1}-N_2S_2^T\Sigma^{-1}=0
N−N1S1TΣ−1−N2S2TΣ−1=0
因为
S
=
1
N
∑
i
=
1
N
(
x
i
−
μ
)
(
x
i
−
μ
)
T
S=\frac{1}{N} \sum_{i=1}^N (x_i-\mu)(x_i-\mu)^T
S=N1∑i=1N(xi−μ)(xi−μ)T,所以
S
1
T
=
S
1
,
S
2
T
=
S
2
S_1^T=S1,S_2^T=S_2
S1T=S1,S2T=S2。
那么:
N
Σ
−
N
1
S
1
−
N
2
S
2
=
0
N\Sigma-N_1S_1-N_2S_2=0
NΣ−N1S1−N2S2=0
所以:
Σ
^
=
N
1
S
1
+
N
2
S
2
N
\hat{\Sigma}=\frac{N_1S_1+N_2S_2}{N}
Σ^=NN1S1+N2S2
Naive Bayes Classfier
标签:朴素贝叶斯、软分类、离散数据、概率生成模型 、最简单的概率图模型(有向图)
思想:朴素贝叶斯假设 (or 条件独立性假设)
→
\rightarrow
→
x
i
⊥
x
j
∣
y
x_i \bot x_j |y
xi⊥xj∣y,(动机是为了简化运算),则
p
(
x
∣
y
)
=
∏
i
=
j
p
p
(
x
j
∣
y
)
p(x|y)=\prod_{i=j}^p p(x_j|y)
p(x∣y)=i=j∏pp(xj∣y)
当
x
j
x_j
xj离散时,
x
j
∼
C
a
t
e
g
o
r
i
c
a
l
分
布
x_j \sim Categorical分布
xj∼Categorical分布;当
x
j
x_j
xj连续时,
x
j
∼
N
(
μ
j
,
σ
j
2
)
G
a
u
s
s
i
a
n
分
布
x_j \sim \mathcal{N}(\mu_j, \sigma_j^2)Gaussian分布
xj∼N(μj,σj2)Gaussian分布。一般情况下,朴素贝叶斯用于离散数据。
一次试验,试验结果为 0 / 1 0/1 0/1时(e.g. 抛硬币),对应于 B e r n o u l l i 分 布 Bernoulli 分布 Bernoulli分布,试验结果为 1 , 2 , 3 , . . . , K 1,2,3,...,K 1,2,3,...,K时(e.g. 抛骰子),对应于 C a t e g o r i c a l 分 布 Categorical分布 Categorical分布;N次试验,试验结果为 0 / 1 0/1 0/1时,对应于 B i n o r m i n a l 分 布 Binorminal 分布 Binorminal分布,试验结果为 1 , 2 , 3 , . . . , K 1,2,3,...,K 1,2,3,...,K时,对应于 M u l t i n o r m i n a l 分 布 Multinorminal分布 Multinorminal分布。
目标:给定
x
,
y
=
0
/
1
x,y=0/1
x,y=0/1
y
^
=
arg max
y
p
(
y
∣
x
)
=
arg max
y
∈
0
,
1
p
(
x
∣
y
)
p
(
y
)
\hat{y}=\argmax_y p(y|x)\\=\argmax_{y\in 0,1} p(x|y)p(y)
y^=yargmaxp(y∣x)=y∈0,1argmaxp(x∣y)p(y)
对于二分类问题,
p
(
y
)
∼
B
e
r
n
o
u
l
l
i
分
布
p(y)\sim Bernoulli分布
p(y)∼Bernoulli分布;对于多分类问题,
p
(
y
)
∼
C
a
t
e
g
o
r
i
c
a
l
分
布
p(y)\sim Categorical 分布
p(y)∼Categorical分布。
采用极大似然估计MLE,可以求解参数。