<机器学习><白板推导系列><四>


线性分类:

  1. 硬分类 y ^ ∈ 0 , 1 \hat{y} \in {0 , 1} y^0,1 : 线性判别分析(fisher)、感知机
  2. 软分类 y ^ ∈ [ 0 , 1 ] \hat{y} \in [0 , 1] y^[0,1]:生成式(Gaussian Discriminative Analysis)、判别式(Logistic Regression)

软分类输出: P ( y = 1 ) = p P(y=1)=p P(y=1)=p
判别式直接学习求解P(Y|X);生成式利用贝叶斯定理求解 P ( Y ∣ X ) = P ( X ∣ Y ) P ( Y ) P ( X ) P(Y|X)=\frac{P(X|Y)P(Y)}{P(X)} P(YX)=P(X)P(XY)P(Y),给定先验 P ( Y ) P(Y) P(Y),给定条件下观察似然 P ( X ∣ Y ) P(X|Y) P(XY)

Note

样本集 X 包含N个样本, 每个样本的特征维度为p(也就是说 每一样本是长度为p的列向量):
X = ( x 1 , x 2 , . . . , x N ) T ∈ R N × p x i = ( x i 1 , x i 2 , . . . , x i p ) T ∈ R p × 1 Y = ( y 1 , y 2 , . . . , y N ) T ∈ R N y i ∈ + 1 , − 1 X = (x_1, x_2, ..., x_N)^T \in \mathbb{R^{N \times p}} \\ x_i = (x_i^1, x_i^2, ..., x_i^p)^T \in \mathbb{R^{p \times1}}\\ Y = (y_1, y_2,...,y_N)^T \in \mathbb R^N\\ y_i \in +1, -1 X=(x1,x2,...,xN)TRN×pxi=(xi1,xi2,...,xip)TRp×1Y=(y1,y2,...,yN)TRNyi+1,1
样本 ( x i , y i ) i = 1 N (x_i, y_i)_{i=1}^N (xi,yi)i=1N;类别1样本集: X c 1 = X_{c1}= Xc1={ x i ∣ y i = 1 x_i|y_i=1 xiyi=1}, ∣ X c 1 ∣ = N 1 |X_{c1}|=N_1 Xc1=N1 ;类别2样本集: X c 2 = X_{c2}= Xc2={ x i ∣ y i = − 1 x_i|y_i=-1 xiyi=1}, ∣ X c 2 ∣ = N 2 |X_{c2}|=N_2 Xc2=N2; 且 N 1 + N 2 = N N_1 + N_2 = N N1+N2=N

感知机

思想

错误驱动

模型

f ( x ) = s i g n ( w T x ) , x ∈ R p , w ∈ R p s i g n ( x ) = 1 , x > = 0 − 1 , x < 0 f(x)=sign(w^Tx), x \in \mathbb{R^p}, w \in \mathbb{R^p}\\ sign(x)= \begin{array}{c} 1,x>=0\\ -1,x<0 \\ \end{array} f(x)=sign(wTx),xRp,wRpsign(x)=1,x>=01,x<0

策略

loss function -> 被错误分类的样本的个数
L ( w ) = ∑ i = 1 N I [ y i w T x i < 0 ] L(w)=\sum_{i=1}^N I[y_iw^Tx_i<0] L(w)=i=1NI[yiwTxi<0]
对于样本点 ( x i , y i ) (x_i, y_i) (xi,yi),应该有:
w T x i > 0 , y i = 1 w T x i < 0 , y i = − 1 w^Tx_i > 0, y_i = 1\\ w^Tx_i < 0, y_i = -1 wTxi>0yi=1wTxi<0,yi=1
也就是有,被正确分类的样本应该满足:
y i w T x i > 0 y_iw^Tx_i >0 yiwTxi>0
由于,指示函数 I ( ⋅ ) I(\cdot) I()不可导,将上述loss function 改写为以下形式:
L ( w ) = ∑ x i ∈ D − y i w T x i L(w)=\sum_{x_i \in D} -y_iw^Tx_i L(w)=xiDyiwTxi
则:
∂ L ( w ) ∂ w = − y i x i \frac{\partial L(w)}{\partial w} = -y_ix_i wL(w)=yixi

算法

SGD
w t = 1 = w t + λ y i x I w^{t=1} = w^t + \lambda y_ix_I wt=1=wt+λyixI

Fisher’s linear discriminant

线性判别分析

思想

从降维的角度出发,将数据从p维空间投影到低维空间,然后进行分类 -> 找到合适的投影方向
类内小,类间大

样本点在分类线上的投影:
z i = w T x i z_i = w^Tx_i zi=wTxi
z i z_i zi的均值:
z ‾ = 1 N ∑ i = 1 N z i \overline{z} = \frac{1}{N}\sum_{i=1}^Nz_i z=N1i=1Nzi
z i z_i zi的协方差:
S z = 1 N ∑ i = 1 N ( z i − z ‾ ) ( z i − z ‾ ) T = 1 N ∑ i = 1 N ( w T x i − z ‾ ) ( w T x i − z ‾ ) T S_z = \frac{1}{N}\sum_{i=1}^N (z_i - \overline{z})(z_i - \overline{z})^T= \frac{1}{N}\sum_{i=1}^N (w^Tx_i- \overline{z})(w^Tx_i - \overline{z})^T Sz=N1i=1N(ziz)(ziz)T=N1i=1N(wTxiz)(wTxiz)T
对于类别1的样本有:
z ‾ 1 = 1 N 1 ∑ i = 1 N 1 z i S 1 = 1 N 1 ∑ i = 1 N 1 ( w T x i − z ‾ 1 ) ( w T x i − z ‾ 1 ) T \overline{z}_1 = \frac{1}{N_1}\sum_{i=1}^{N_1}z_i\\S_1 = \frac{1}{N_1}\sum_{i=1}^{N_1} (w^Tx_i- \overline{z}_1)(w^Tx_i - \overline{z}_1)^T z1=N11i=1N1ziS1=N11i=1N1(wTxiz1)(wTxiz1)T
对于类别2的样本有:
z ‾ 2 = 1 N 2 ∑ i = 1 N 2 z i S 2 = 1 N 2 ∑ i = 1 N 2 ( w T x i − z ‾ 2 ) ( w T x i − z ‾ 2 ) T \overline{z}_2 = \frac{1}{N_2}\sum_{i=1}^{N_2}z_i\\S_2 = \frac{1}{N_2}\sum_{i=1}^{N_2} (w^Tx_i- \overline{z}_2)(w^Tx_i - \overline{z}_2)^T z2=N21i=1N2ziS2=N21i=1N2(wTxiz2)(wTxiz2)T

模型:目标函数

损失函数loss function
类间:
( z ‾ 1 − z ‾ 2 ) 2 (\overline{z}_1 -\overline{z}_2)^2 (z1z2)2
类内:
S 1 + S 2 S_1 +S_2 S1+S2
目标函数:
J ( w ) = ( z ‾ 1 − z ‾ 2 ) 2 S 1 + S 2 J(w)=\frac{(\overline{z}_1-\overline{z}_2)^2}{S_1 +S_2} J(w)=S1+S2(z1z2)2
又:
z ‾ 1 − z ‾ 2 = 1 N 1 ∑ i = 1 N 1 w T x i − 1 N 2 ∑ i = 1 N 2 w T x i = w T ( 1 N 1 ∑ i = 1 N 1 x i − 1 N 2 ∑ i = 1 N 2 x i ) = w T ( X ‾ c 1 − X ‾ c 2 ) \overline{z}_1 -\overline{z}_2=\frac{1}{N_1}\sum_{i=1}^{N_1} w^Tx_i - \frac{1}{N_2}\sum_{i=1}^{N_2} w^Tx_i \\=w^T(\frac{1}{N_1}\sum_{i=1}^{N_1} x_i - \frac{1}{N_2}\sum_{i=1}^{N_2} x_i)\\=w^T(\overline{X}_{c1}-\overline{X}_{c2}) z1z2=N11i=1N1wTxiN21i=1N2wTxi=wT(N11i=1N1xiN21i=1N2xi)=wT(Xc1Xc2)
S 1 = 1 N 1 ∑ i = 1 N 1 ( w T x i − z ‾ 1 ) ( w T x i − z ‾ 1 ) T = 1 N 1 ∑ i = 1 N 1 ( w T x i − 1 N 1 ∑ j = 1 N 1 w T x j ) ( w T x i − 1 N 1 ∑ j = 1 N 1 w T x j ) T = 1 N 1 ∑ i = 1 N 1 w T ( x i − 1 N 1 ∑ j = 1 N 1 x j ) ( x i − 1 N 1 ∑ j = 1 N 1 x j ) T w = 1 N 1 ∑ i = 1 N 1 w T ( x i − X ‾ c 1 ) ( x i − X ‾ c 1 ) T w = w T ⋅ 1 N 1 ∑ i = 1 N 1 ( x i − X ‾ c 1 ) ( x i − X ‾ c 1 ) T ⋅ w = w T ⋅ S c 1 ⋅ w S_1 =\frac{1}{N_1}\sum_{i=1}^{N_1} (w^Tx_i- \overline{z}_1)(w^Tx_i - \overline{z}_1)^T\\=\frac{1}{N_1}\sum_{i=1}^{N_1} (w^Tx_i- \frac{1}{N_1}\sum_{j=1}^{N_1} w^Tx_j)(w^Tx_i- \frac{1}{N_1}\sum_{j=1}^{N_1} w^Tx_j)^T\\=\frac{1}{N_1}\sum_{i=1}^{N_1} w^T(x_i-\frac{1}{N_1}\sum_{j=1}^{N_1} x_j)(x_i-\frac{1}{N_1}\sum_{j=1}^{N_1} x_j)^Tw\\=\frac{1}{N_1}\sum_{i=1}^{N_1} w^T(x_i-\overline{X}_{c1})(x_i-\overline{X}_{c1})^Tw\\=w^T\cdot \frac{1}{N_1}\sum_{i=1}^{N_1} (x_i-\overline{X}_{c1})(x_i-\overline{X}_{c1})^T \cdot w\\=w^T\cdot S_{c1} \cdot w S1=N11i=1N1(wTxiz1)(wTxiz1)T=N11i=1N1(wTxiN11j=1N1wTxj)(wTxiN11j=1N1wTxj)T=N11i=1N1wT(xiN11j=1N1xj)(xiN11j=1N1xj)Tw=N11i=1N1wT(xiXc1)(xiXc1)Tw=wTN11i=1N1(xiXc1)(xiXc1)Tw=wTSc1w
同理:
S 2 = w T ⋅ S c 2 ⋅ w S_2 =w^T\cdot S_{c2} \cdot w S2=wTSc2w
所以:
J ( w ) = ( z ‾ 1 − z ‾ 2 ) 2 S 1 + S 2 = w T ( X ‾ c 1 − X ‾ c 2 ) ( X ‾ c 1 − X ‾ c 2 ) T w w T ⋅ ( S c 1 + S c 2 ) ⋅ w = w T ⋅ S b ⋅ w w T ⋅ S w ⋅ w J(w)=\frac{(\overline{z}_1-\overline{z}_2)^2}{S_1 +S_2}=\frac{w^T(\overline{X}_{c1}-\overline{X}_{c2})(\overline{X}_{c1}-\overline{X}_{c2})^Tw}{w^T\cdot (S_{c1}+S_{c2}) \cdot w}=\frac{w^T \cdot S_b \cdot w}{w^T\cdot S_w \cdot w} J(w)=S1+S2(z1z2)2=wT(Sc1+Sc2)wwT(Xc1Xc2)(Xc1Xc2)Tw=wTSwwwTSbw
类间方差 between-class
S b = ( X ‾ c 1 − X ‾ c 2 ) ( X ‾ c 1 − X ‾ c 2 ) T S_b = (\overline{X}_{c1}-\overline{X}_{c2})(\overline{X}_{c1}-\overline{X}_{c2})^T Sb=(Xc1Xc2)(Xc1Xc2)T
类内方差 with-class
S w = S c 1 + S c 2 S_w = S_{c1}+S_{c2} Sw=Sc1+Sc2

求解

w ^ = arg max ⁡ w J ( w ) = arg max ⁡ w w T ⋅ S b ⋅ w w T ⋅ S w ⋅ w \hat{w} = \mathop{\argmax_w} J(w)=\mathop{\argmax_w} \frac{w^T \cdot S_b \cdot w}{w^T\cdot S_w \cdot w} w^=wargmaxJ(w)=wargmaxwTSwwwTSbw
∂ J ( w ) ∂ w = 2 S b w ( w T S w w ) − 1 − 2 w T S b w ( w T S w w ) − 2 S w w = 0 \frac{\partial J(w)}{\partial w}=2S_bw(w^TS_ww)^{-1}-2w^TS_bw(w^TS_ww)^{-2}S_ww=0 wJ(w)=2Sbw(wTSww)12wTSbw(wTSww)2Sww=0
S b w ( w T S w w ) − w T S b w S w w = 0 S_bw(w^TS_ww)-w^TS_bwS_ww=0 Sbw(wTSww)wTSbwSww=0
由于 w T ∈ R 1 × p , w ∈ R p × 1 , S b ∈ R p × p , S w ∈ R p × p w^T \in \mathbb{R^{1\times p}},w \in \mathbb{R^{p\times 1},S_b \in \mathbb{R^{p \times p}},S_w \in \mathbb{R^{p \times p}}} wTR1×pwRp×1SbRp×pSwRp×p,所以 w T S w w ∈ R , w T S b w ∈ R w^TS_ww \in \mathbb{R},w^TS_bw \in \mathbb{R} wTSwwRwTSbwR
因此:
S w w = w T S b w w T S w w S b w S_ww = \frac{w^TS_bw}{w^TS_ww} S_b w Sww=wTSwwwTSbwSbw
w = w T S b w w T S w w S w − 1 S b w = w T S b w w T S w w S w − 1 ( X ‾ c 1 − X ‾ c 2 ) ( X ‾ c 1 − X ‾ c 2 ) T w w = \frac{w^TS_bw}{w^TS_ww} S_w^{-1}S_b w= \frac{w^TS_bw}{w^TS_ww} S_w^{-1} (\overline{X}_{c1}-\overline{X}_{c2})(\overline{X}_{c1}-\overline{X}_{c2})^Tw w=wTSwwwTSbwSw1Sbw=wTSwwwTSbwSw1(Xc1Xc2)(Xc1Xc2)Tw
由于 ( X ‾ c 1 − X ‾ c 2 ) ∈ R p × 1 (\overline{X}_{c1}-\overline{X}_{c2}) \in \mathbb{R^{p\times 1}} (Xc1Xc2)Rp×1 ( X ‾ c 1 − X ‾ c 2 ) T ∈ R 1 × p (\overline{X}_{c1}-\overline{X}_{c2})^T \in \mathbb{R^{1\times p}} (Xc1Xc2)TR1×p,所以 ( X ‾ c 1 − X ‾ c 2 ) T w ∈ R (\overline{X}_{c1}-\overline{X}_{c2})^Tw \in \mathbb{R} (Xc1Xc2)TwR

因此:
w ^ = λ S w − 1 ( X ‾ c 1 − X ‾ c 2 ) \hat{w} = \lambda S_w^{-1} (\overline{X}_{c1}-\overline{X}_{c2}) w^=λSw1(Xc1Xc2)
λ = w T S b w w T S w w ( X ‾ c 1 − X ‾ c 2 ) T w \lambda = \frac{w^TS_bw}{w^TS_ww}(\overline{X}_{c1}-\overline{X}_{c2})^Tw λ=wTSwwwTSbw(Xc1Xc2)Tw

Logistic Regression

概率判别模型 -> p(y|x)

概率判别模型思想:
直接求解 p ( y ∣ x ) p(y|x) p(yx)的值,对条件概率分布 p ( y ∣ x ) p(y|x) p(yx)建模
y ^ = arg max ⁡ y ∈ 0 , 1 p ( y ∣ x ) \hat{y} = \argmax_{y \in 0,1} p(y|x) y^=y0,1argmaxp(yx)

sigmoid 函数:
σ ( z ) = 1 1 + e x p ( − z ) \sigma(z) = \frac{1}{1+exp(-z)} σ(z)=1+exp(z)1
p ( y ∣ x ; w ) p(y|x;w) p(yx;w):
p ( y = 1 ∣ x ; w ) = σ ( w T x ) p ( y = 0 ∣ x ; w ) = 1 − p ( y = 1 ∣ x ; w ) = 1 − σ ( w T x ) p ( y ∣ x ; w ) = p ( y = 1 ∣ x ; w ) y p ( y = 0 ∣ x ; w ) 1 − y p(y=1|x;w)=\sigma(w^Tx)\\p(y=0|x;w)=1-p(y=1|x;w)=1-\sigma(w^Tx)\\p(y|x;w)=p(y=1|x;w)^{y}p(y=0|x;w)^{1-y} p(y=1x;w)=σ(wTx)p(y=0x;w)=1p(y=1x;w)=1σ(wTx)p(yx;w)=p(y=1x;w)yp(y=0x;w)1y
数据独立同分布,极大似然估计MLE:
w ^ = arg max ⁡ w log ⁡ P ( Y ∣ X ) = arg max ⁡ w log ⁡ ∏ i = 1 N p ( y i ∣ x i ; w ) = arg max ⁡ w ∑ i = 1 N log ⁡ p ( y i ∣ x i ; w ) = arg max ⁡ w ∑ i = 1 N y i log ⁡ p ( y i = 1 ∣ x ; w ) + ( 1 − y i ) log ⁡ p ( y i = 0 ∣ x ; w ) \hat{w}=\argmax _w \log P(Y|X)\\=\argmax_w \log\prod_{i=1}^Np(y_i|x_i;w)\\=\argmax_w\sum_{i=1}^N\log p(y_i|x_i;w)\\=\argmax_w \sum_{i=1}^N y_i\log p(y_i=1|x;w)+(1-y_i)\log p(y_i=0|x;w) w^=wargmaxlogP(YX)=wargmaxlogi=1Np(yixi;w)=wargmaxi=1Nlogp(yixi;w)=wargmaxi=1Nyilogp(yi=1x;w)+(1yi)logp(yi=0x;w)

Gaussian Discriminant Analysis

标签:软分类、连续数据、概率生成模型

高斯判别分析: 概率生成模型 -> p(x, y)

思想:
比较 p ( y = 1 ∣ x ) p(y=1|x) p(y=1x) p ( y = 0 ∣ x ) p(y=0|x) p(y=0x)的大小,实现分类
借助贝叶斯定理:
p ( y ∣ x ) = p ( x ∣ y ) p ( y ) p ( x ) p(y|x) = \frac{p(x|y)p(y)}{p(x)} p(yx)=p(x)p(xy)p(y)
当比较 p ( y = 1 ∣ x ) p(y=1|x) p(y=1x) p ( y = 0 ∣ x ) p(y=0|x) p(y=0x)的大小时,分子 p ( x ) p(x) p(x)都是相同的,因此:
p ( y ∣ x ) ∝ p ( x ∣ y ) p ( y ) p(y|x) \propto p(x|y)p(y) p(yx)p(xy)p(y)
由于 p ( x ∣ y ) p ( y ) = p ( x , y ) p(x|y)p(y)=p(x,y) p(xy)p(y)=p(x,y),因此,概率生成模型是直接对联合概率分布 p ( x , y ) p(x,y) p(x,y)建模。其中, p ( y ) p(y) p(y)是先验分布, p ( x ∣ y ) p(x|y) p(xy)是似然, p ( y ∣ x ) p(y|x) p(yx)是后验分布。

因此:
y ^ = arg max ⁡ y ∈ 0 , 1 p ( y ∣ x ) = arg max ⁡ y ∈ 0 , 1 p ( x ∣ y ) p ( y ) \hat{y}=\argmax_{y \in 0,1} p(y|x)=\argmax_{y \in 0,1} p(x|y)p(y) y^=y0,1argmaxp(yx)=y0,1argmaxp(xy)p(y)
高斯判别分析,假设先验 p ( y ) ∼ B e r n o u l l i ( ϕ ) p(y) \sim Bernoulli(\phi) p(y)Bernoulli(ϕ),似然 p ( x ∣ y = 1 ) ∼ N ( μ 1 , Σ ) , p ( x ∣ y = 0 ) ∼ N ( μ 2 , Σ ) p(x|y=1) \sim \mathcal{N}(\mu_1, \Sigma),p(x|y=0) \sim \mathcal{N}(\mu_2, \Sigma) p(xy=1)N(μ1,Σ),p(xy=0)N(μ2,Σ)

先验分布,伯努利分布Bernoulli:
p ( y = 1 ) = ϕ = ϕ y p(y=1) = \phi=\phi^y p(y=1)=ϕ=ϕy
p ( y = 0 ) = 1 − ϕ = ( 1 − ϕ ) 1 − y p(y=0) = 1-\phi=(1-\phi)^{1-y} p(y=0)=1ϕ=(1ϕ)1y
因此:
p ( y ) = ϕ y ⋅ ( 1 − ϕ ) 1 − y p(y) = \phi^y \cdot (1-\phi)^{1-y} p(y)=ϕy(1ϕ)1y
同理,似然:
p ( x ∣ y ) ∼ N ( μ 1 , Σ ) y ⋅ N ( μ 2 , Σ ) 1 − y p(x|y) \sim \mathcal{N}(\mu_1, \Sigma)^y \cdot \mathcal{N}(\mu_2, \Sigma)^{1-y} p(xy)N(μ1,Σ)yN(μ2,Σ)1y

log-likelihood:
L ( θ ) = log ⁡ ∏ i = 1 N p ( x i , y i ) = ∑ i = 1 N log ⁡ ( p ( x i ∣ y i ) p ( y i ) ) = ∑ i = 1 N log ⁡ p ( x i ∣ y i ) + log ⁡ p ( y i ) = ∑ i = 1 N log ⁡ [ N ( μ 1 , Σ ) y i ⋅ N ( μ 2 , Σ ) 1 − y i ] + log ⁡ [ ϕ y i ⋅ ( 1 − ϕ ) 1 − y i ] = ∑ i = 1 N log ⁡ N ( μ 1 , Σ ) y i + log ⁡ N ( μ 2 , Σ ) 1 − y i + log ⁡ [ ϕ y i ⋅ ( 1 − ϕ ) 1 − y i ] \mathcal{L}(\theta)=\log \prod_{i=1}^N p(x_i, y_i)\\=\sum_{i=1}^N \log (p(x_i|y_i)p(y_i))\\=\sum_{i=1}^N \log p(x_i|y_i)+\log p(y_i)\\=\sum_{i=1}^N \log \left[ \mathcal{N}(\mu_1, \Sigma)^{y_i}\cdot \mathcal{N}(\mu_2, \Sigma)^{1-y_i}\right]+\log \left[ \phi^{y_i}\cdot (1-\phi)^{1-y_i}\right]\\=\sum_{i=1}^N \log \mathcal{N}(\mu_1, \Sigma)^{y_i} + \log \mathcal{N}(\mu_2, \Sigma)^{1-y_i}+\log \left[ \phi^{y_i}\cdot (1-\phi)^{1-y_i}\right] L(θ)=logi=1Np(xi,yi)=i=1Nlog(p(xiyi)p(yi))=i=1Nlogp(xiyi)+logp(yi)=i=1Nlog[N(μ1,Σ)yiN(μ2,Σ)1yi]+log[ϕyi(1ϕ)1yi]=i=1NlogN(μ1,Σ)yi+logN(μ2,Σ)1yi+log[ϕyi(1ϕ)1yi]
θ = ( μ 1 , μ 2 , Σ , ϕ ) \theta = (\mu_1, \mu_2, \Sigma, \phi) θ=(μ1,μ2,Σ,ϕ)
θ ^ = arg max ⁡ θ L ( θ ) \hat{\theta}=\argmax_\theta \mathcal{L}(\theta) θ^=θargmaxL(θ)

求解 ϕ \phi ϕ:
ϕ ^ = arg max ⁡ ϕ ∑ i = 1 N log ⁡ [ ϕ y i ⋅ ( 1 − ϕ ) 1 − y i ] = arg max ⁡ ϕ ∑ i = 1 N y i log ⁡ ϕ + ( 1 − y i ) log ⁡ ( 1 − ϕ ) \hat{\phi}=\argmax_\phi \sum_{i=1}^N \log \left[ \phi^{y_i}\cdot (1-\phi)^{1-y_i}\right]\\=\argmax_\phi \sum_{i=1}^N y_i\log\phi + (1-y_i) \log(1-\phi) ϕ^=ϕargmaxi=1Nlog[ϕyi(1ϕ)1yi]=ϕargmaxi=1Nyilogϕ+(1yi)log(1ϕ)
ϕ \phi ϕ求导:
∑ i = 1 N y i ϕ − 1 − y i 1 − ϕ = 0 ∑ i = 1 N ( 1 − ϕ ) y i − ( 1 − y i ) ϕ = 0 ∑ i = 1 N ( 1 − ϕ ) y i − ( 1 − y i ) ϕ = 0 ∑ i = 1 N y i − ϕ = 0 \sum_{i=1}^N \frac{y_i}{\phi}-\frac{1-y_i}{1-\phi}=0 \\ \sum_{i=1}^N (1-\phi)y_i-(1-y_i)\phi=0 \\ \sum_{i=1}^N (1-\phi)y_i-(1-y_i)\phi=0 \\ \sum_{i=1}^N y_i - \phi = 0 i=1Nϕyi1ϕ1yi=0i=1N(1ϕ)yi(1yi)ϕ=0i=1N(1ϕ)yi(1yi)ϕ=0i=1Nyiϕ=0
因此:
ϕ ^ = 1 N ∑ i = 1 N y i \hat{\phi}=\frac{1}{N}\sum_{i=1}^N y_i ϕ^=N1i=1Nyi

求解 μ 1 \mu_1 μ1:
μ 1 ^ = arg max ⁡ μ 1 ∑ i = 1 N log ⁡ N ( μ 1 , Σ ) y i = arg max ⁡ μ 1 ∑ i = 1 N y i log ⁡ N ( μ 1 , Σ ) = arg max ⁡ μ 1 ∑ i = 1 N y i log ⁡ 1 2 π p / 2 ⋅ ∣ Σ ∣ 1 / 2 exp ⁡ ( − 1 2 ( x i − μ 1 ) T Σ − 1 ( x i − μ 1 ) ) = arg max ⁡ μ 1 ∑ i = 1 N y i [ − 1 2 ( x i − μ 1 ) T Σ − 1 ( x i − μ 1 ) − p 2 log ⁡ 2 π − 1 2 log ⁡ ∣ Σ ∣ ] = arg max ⁡ μ 1 ∑ i = 1 N y i [ − 1 2 ( x i − μ 1 ) T Σ − 1 ( x i − μ 1 ) ] \hat{\mu_1}=\argmax_{\mu_1} \sum_{i=1}^N \log \mathcal{N}(\mu_1, \Sigma)^{y_i}\\=\argmax_{\mu_1} \sum_{i=1}^N y_i \log \mathcal{N}(\mu_1, \Sigma)\\=\argmax_{\mu_1} \sum_{i=1}^N y_i \log \frac{1}{2\pi^{p/2}\cdot |\Sigma|^{1/2}} \exp(-\frac{1}{2}(x_i-\mu_1)^T\Sigma^{-1}(x_i-\mu_1))\\=\argmax_{\mu_1} \sum_{i=1}^N y_i \left[ -\frac{1}{2}(x_i-\mu_1)^T\Sigma^{-1}(x_i-\mu_1)-\frac{p}{2}\log 2\pi -\frac{1}{2}\log |\Sigma|\right]\\=\argmax_{\mu_1} \sum_{i=1}^N y_i \left[ -\frac{1}{2}(x_i-\mu_1)^T\Sigma^{-1}(x_i-\mu_1)\right] μ1^=μ1argmaxi=1NlogN(μ1,Σ)yi=μ1argmaxi=1NyilogN(μ1,Σ)=μ1argmaxi=1Nyilog2πp/2Σ1/21exp(21(xiμ1)TΣ1(xiμ1))=μ1argmaxi=1Nyi[21(xiμ1)TΣ1(xiμ1)2plog2π21logΣ]=μ1argmaxi=1Nyi[21(xiμ1)TΣ1(xiμ1)]
∑ i = 1 N y i [ − 1 2 ( x i − μ 1 ) T Σ − 1 ( x i − μ 1 ) ] = − 1 2 ∑ i = 1 N y i [ ( x i T Σ − 1 − μ 1 T Σ − 1 ) ( x i − μ 1 ) ] = − 1 2 ∑ i = 1 N y i [ x i T Σ − 1 x i − 2 x i T Σ − 1 μ 1 + μ 1 T Σ − 1 μ 1 ] = − 1 2 ∑ i = 1 N y i [ x i T Σ − 1 x i − 2 μ 1 T Σ − 1 x i + μ 1 T Σ − 1 μ 1 ] \sum_{i=1}^N y_i \left[ -\frac{1}{2}(x_i-\mu_1)^T\Sigma^{-1}(x_i-\mu_1)\right]\\=-\frac{1}{2}\sum_{i=1}^N y_i \left[ (x_i^T\Sigma^{-1}-\mu_1^T\Sigma^{-1})(x_i-\mu_1)\right]\\=-\frac{1}{2}\sum_{i=1}^N y_i \left[x_i^T\Sigma^{-1}x_i-2x_i^T\Sigma^{-1}\mu_1+\mu_1^T\Sigma^{-1}\mu_1\right]\\=-\frac{1}{2}\sum_{i=1}^N y_i \left[x_i^T\Sigma^{-1}x_i-2\mu_1^T\Sigma^{-1}x_i+\mu_1^T\Sigma^{-1}\mu_1\right] i=1Nyi[21(xiμ1)TΣ1(xiμ1)]=21i=1Nyi[(xiTΣ1μ1TΣ1)(xiμ1)]=21i=1Nyi[xiTΣ1xi2xiTΣ1μ1+μ1TΣ1μ1]=21i=1Nyi[xiTΣ1xi2μ1TΣ1xi+μ1TΣ1μ1]
μ 1 \mu_1 μ1求导:
− 1 2 ∑ i = 1 N y i [ 2 Σ − 1 μ 1 − 2 Σ − 1 x i ] = 0 ∑ i = 1 N y i ( μ 1 − x i ) = 0 -\frac{1}{2}\sum_{i=1}^N y_i \left[ 2\Sigma^{-1}\mu_1-2\Sigma^{-1}x_i\right]=0\\\sum_{i=1}^N y_i (\mu_1-x_i)=0 21i=1Nyi[2Σ1μ12Σ1xi]=0i=1Nyi(μ1xi)=0
因此:
μ 1 ^ = ∑ i = 1 N y i x i ∑ i = 1 N y i \hat{\mu_1}=\frac{\sum_{i=1}^N y_ix_i}{\sum_{i=1}^N y_i} μ1^=i=1Nyii=1Nyixi

同理,求解 μ 2 \mu_2 μ2
μ 2 ^ = arg max ⁡ μ 2 ∑ i = 1 N log ⁡ N ( μ 2 , Σ ) 1 − y i = arg max ⁡ μ 2 ∑ i = 1 N ( 1 − y i ) log ⁡ N ( μ 2 , Σ ) = arg max ⁡ μ 2 ∑ i = 1 N ( 1 − y i ) log ⁡ 1 2 π p / 2 ⋅ ∣ Σ ∣ 1 / 2 exp ⁡ ( − 1 2 ( x i − μ 2 ) T Σ − 1 ( x i − μ 2 ) ) = arg max ⁡ μ 2 ∑ i = 1 N ( 1 − y i ) [ − 1 2 ( x i − μ 2 ) T Σ − 1 ( x i − μ 2 ) − p 2 log ⁡ 2 π − 1 2 log ⁡ ∣ Σ ∣ ] = arg max ⁡ μ 2 ∑ i = 1 N ( 1 − y i ) [ − 1 2 ( x i − μ 2 ) T Σ − 1 ( x i − μ 2 ) ] \hat{\mu_2}=\argmax_{\mu_2} \sum_{i=1}^N \log \mathcal{N}(\mu_2, \Sigma)^{1-y_i}\\=\argmax_{\mu_2} \sum_{i=1}^N (1-y_i) \log \mathcal{N}(\mu_2, \Sigma)\\=\argmax_{\mu_2} \sum_{i=1}^N (1-y_i) \log \frac{1}{2\pi^{p/2}\cdot |\Sigma|^{1/2}} \exp(-\frac{1}{2}(x_i-\mu_2)^T\Sigma^{-1}(x_i-\mu_2))\\=\argmax_{\mu_2} \sum_{i=1}^N (1-y_i) \left[ -\frac{1}{2}(x_i-\mu_2)^T\Sigma^{-1}(x_i-\mu_2)-\frac{p}{2}\log 2\pi -\frac{1}{2}\log |\Sigma|\right]\\=\argmax_{\mu_2} \sum_{i=1}^N (1-y_i) \left[ -\frac{1}{2}(x_i-\mu_2)^T\Sigma^{-1}(x_i-\mu_2)\right] μ2^=μ2argmaxi=1NlogN(μ2,Σ)1yi=μ2argmaxi=1N(1yi)logN(μ2,Σ)=μ2argmaxi=1N(1yi)log2πp/2Σ1/21exp(21(xiμ2)TΣ1(xiμ2))=μ2argmaxi=1N(1yi)[21(xiμ2)TΣ1(xiμ2)2plog2π21logΣ]=μ2argmaxi=1N(1yi)[21(xiμ2)TΣ1(xiμ2)]
∑ i = 1 N ( 1 − y i ) [ − 1 2 ( x i − μ 2 ) T Σ − 1 ( x i − μ 2 ) ] = − 1 2 ∑ i = 1 N ( 1 − y i ) [ ( x i T Σ − 1 − μ 2 T Σ − 1 ) ( x i − μ 2 ) ] = − 1 2 ∑ i = 1 N ( 1 − y i ) [ x i T Σ − 1 x i − 2 x i T Σ − 1 μ 2 + μ 2 T Σ − 1 μ 2 ] = − 1 2 ∑ i = 1 N ( 1 − y i ) [ x i T Σ − 1 x i − 2 μ 2 T Σ − 1 x i + μ 2 T Σ − 1 μ 2 ] \sum_{i=1}^N (1-y_i) \left[ -\frac{1}{2}(x_i-\mu_2)^T\Sigma^{-1}(x_i-\mu_2)\right]\\=-\frac{1}{2}\sum_{i=1}^N (1-y_i) \left[ (x_i^T\Sigma^{-1}-\mu_2^T\Sigma^{-1})(x_i-\mu_2)\right]\\=-\frac{1}{2}\sum_{i=1}^N (1-y_i) \left[x_i^T\Sigma^{-1}x_i-2x_i^T\Sigma^{-1}\mu_2+\mu_2^T\Sigma^{-1}\mu_2\right]\\=-\frac{1}{2}\sum_{i=1}^N (1-y_i) \left[x_i^T\Sigma^{-1}x_i-2\mu_2^T\Sigma^{-1}x_i+\mu_2^T\Sigma^{-1}\mu_2\right] i=1N(1yi)[21(xiμ2)TΣ1(xiμ2)]=21i=1N(1yi)[(xiTΣ1μ2TΣ1)(xiμ2)]=21i=1N(1yi)[xiTΣ1xi2xiTΣ1μ2+μ2TΣ1μ2]=21i=1N(1yi)[xiTΣ1xi2μ2TΣ1xi+μ2TΣ1μ2]
μ 1 \mu_1 μ1求导:
− 1 2 ∑ i = 1 N ( 1 − y i ) [ 2 Σ − 1 μ 2 − 2 Σ − 1 x i ] = 0 ∑ i = 1 N ( 1 − y i ) ( μ 2 − x i ) = 0 -\frac{1}{2}\sum_{i=1}^N (1-y_i) \left[ 2\Sigma^{-1}\mu_2-2\Sigma^{-1}x_i\right]=0\\\sum_{i=1}^N (1-y_i) (\mu_2-x_i)=0 21i=1N(1yi)[2Σ1μ22Σ1xi]=0i=1N(1yi)(μ2xi)=0
因此:
μ 2 ^ = ∑ i = 1 N ( 1 − y i ) x i ∑ i = 1 N ( 1 − y i ) \hat{\mu_2}=\frac{\sum_{i=1}^N (1-y_i)x_i}{\sum_{i=1}^N (1-y_i)} μ2^=i=1N(1yi)i=1N(1yi)xi
求解 Σ \Sigma Σ
Σ ^ = arg max ⁡ Σ ∑ i = 1 N log ⁡ N ( μ 1 , Σ ) y i + log ⁡ N ( μ 2 , Σ ) 1 − y i = arg max ⁡ Σ ∑ i = 1 N y i log ⁡ N ( μ 1 , Σ ) + ( 1 − y i ) log ⁡ N ( μ 2 , Σ ) \hat{\Sigma}=\argmax_\Sigma \sum_{i=1}^N \log \mathcal{N}(\mu_1, \Sigma)^{y_i} + \log \mathcal{N}(\mu_2, \Sigma)^{1-y_i}\\= \argmax_\Sigma \sum_{i=1}^N y_i\log \mathcal{N}(\mu_1, \Sigma)+ (1-y_i) \log \mathcal{N}(\mu_2, \Sigma) Σ^=Σargmaxi=1NlogN(μ1,Σ)yi+logN(μ2,Σ)1yi=Σargmaxi=1NyilogN(μ1,Σ)+(1yi)logN(μ2,Σ)
C 1 = C_1= C1={ x i ∣ y i = 1 , i = 1 , 2 , . . . , N x_i|y_i=1, i=1, 2,...,N xiyi=1,i=1,2,...,N}, C 2 = C_2= C2={ x i ∣ y i = 0 , i = 1 , 2 , . . . , N x_i|y_i=0, i=1, 2,...,N xiyi=0,i=1,2,...,N},则 ∣ C 1 ∣ = N 1 |C_1|=N_1 C1=N1 ∣ C 2 ∣ = N 2 |C_2|=N_2 C2=N2 N 1 + N 2 = N N_1+N_2=N N1+N2=N

因此,
Σ ^ = arg max ⁡ Σ [ ∑ x i ∈ C 1 log ⁡ N ( μ 1 , Σ ) + ∑ x i ∈ C 2 log ⁡ N ( μ 2 , Σ ) ] \hat{\Sigma}= \argmax_\Sigma \left[ \sum_{x_i\in C_1}\log \mathcal{N}(\mu_1, \Sigma)+\sum_{x_i\in C_2}\log \mathcal{N}(\mu_2, \Sigma) \right] Σ^=Σargmax[xiC1logN(μ1,Σ)+xiC2logN(μ2,Σ)]
由于:
∑ i = 1 N log ⁡ N ( μ , Σ ) = ∑ i = 1 N log ⁡ 1 2 π p / 2 ⋅ ∣ Σ ∣ 1 / 2 exp ⁡ ( − 1 2 ( x i − μ ) T Σ − 1 ( x i − μ ) ) = ∑ i = 1 N − 1 2 ( x i − μ ) T Σ − 1 ( x i − μ ) − p 2 log ⁡ 2 π − 1 2 log ⁡ ∣ Σ ∣ = ∑ i = 1 N C − 1 2 log ⁡ ∣ Σ ∣ − 1 2 ( x i − μ ) T Σ − 1 ( x i − μ ) = C − 1 2 ∑ i = 1 N log ⁡ ∣ Σ ∣ − 1 2 ∑ i = 1 N ( x i − μ ) T Σ − 1 ( x i − μ ) = C − N 2 log ⁡ ∣ Σ ∣ − 1 2 ∑ i = 1 N ( x i − μ ) T Σ − 1 ( x i − μ ) \sum_{i=1}^N \log \mathcal{N}(\mu, \Sigma)=\sum_{i=1}^N \log \frac{1}{2\pi^{p/2}\cdot |\Sigma|^{1/2}} \exp(-\frac{1}{2}(x_i-\mu)^T\Sigma^{-1}(x_i-\mu))\\=\sum_{i=1}^N -\frac{1}{2}(x_i-\mu)^T\Sigma^{-1}(x_i-\mu)-\frac{p}{2}\log 2\pi -\frac{1}{2}\log |\Sigma|\\=\sum_{i=1}^N C-\frac{1}{2}\log |\Sigma|-\frac{1}{2}(x_i-\mu)^T\Sigma^{-1}(x_i-\mu)\\=C-\frac{1}{2}\sum_{i=1}^N \log |\Sigma|-\frac{1}{2}\sum_{i=1}^N (x_i-\mu)^T\Sigma^{-1}(x_i-\mu)\\=C-\frac{N}{2}\log |\Sigma|-\frac{1}{2}\sum_{i=1}^N (x_i-\mu)^T\Sigma^{-1}(x_i-\mu) i=1NlogN(μ,Σ)=i=1Nlog2πp/2Σ1/21exp(21(xiμ)TΣ1(xiμ))=i=1N21(xiμ)TΣ1(xiμ)2plog2π21logΣ=i=1NC21logΣ21(xiμ)TΣ1(xiμ)=C21i=1NlogΣ21i=1N(xiμ)TΣ1(xiμ)=C2NlogΣ21i=1N(xiμ)TΣ1(xiμ)
因为, x i ∈ R p × 1 x_i\in\mathcal{R}^{p\times1} xiRp×1 Σ − 1 ∈ R p × p \Sigma^{-1}\in\mathcal{R}^{p\times p} Σ1Rp×p,则 ( x i − μ ) T Σ − 1 ( x i − μ ) ∈ R (x_i-\mu)^T\Sigma^{-1}(x_i-\mu)\in\mathcal{R} (xiμ)TΣ1(xiμ)R

另外,
t r a c e ( A B ) = t r a c e ( B A ) trace(AB)=trace(BA) trace(AB)=trace(BA)
t r a c e ( A B C ) = t r a c e ( C A B ) = t r a c e ( B C A ) trace(ABC)=trace(CAB)=trace(BCA) trace(ABC)=trace(CAB)=trace(BCA)
∂ t r a c e ( A B ) ∂ A = B T \frac{\partial trace(AB)}{\partial A}=B^T Atrace(AB)=BT
∂ ∣ A ∣ ∂ A = ∣ A ∣ ⋅ A − 1 \frac{\partial |A|}{\partial A}=|A|\cdot A^{-1} AA=AA1
所以:
∑ i = 1 N log ⁡ N ( μ , Σ ) = C − N 2 log ⁡ ∣ Σ ∣ − 1 2 ∑ i = 1 N ( x i − μ ) T Σ − 1 ( x i − μ ) = C − N 2 log ⁡ ∣ Σ ∣ − 1 2 ∑ i = 1 N t r a c e ( ( x i − μ ) T Σ − 1 ( x i − μ ) ) = C − N 2 log ⁡ ∣ Σ ∣ − 1 2 ∑ i = 1 N t r a c e ( ( x i − μ ) ( x i − μ ) T Σ − 1 ) = C − N 2 log ⁡ ∣ Σ ∣ − 1 2 t r a c e ( ∑ i = 1 N ( x i − μ ) ( x i − μ ) T Σ − 1 ) \sum_{i=1}^N \log \mathcal{N}(\mu, \Sigma)=C-\frac{N}{2}\log |\Sigma|-\frac{1}{2}\sum_{i=1}^N (x_i-\mu)^T\Sigma^{-1}(x_i-\mu)\\=C-\frac{N}{2}\log |\Sigma|-\frac{1}{2}\sum_{i=1}^N trace((x_i-\mu)^T\Sigma^{-1}(x_i-\mu))\\=C-\frac{N}{2}\log |\Sigma|-\frac{1}{2}\sum_{i=1}^N trace((x_i-\mu)(x_i-\mu)^T\Sigma^{-1})\\=C-\frac{N}{2}\log |\Sigma|-\frac{1}{2} trace(\sum_{i=1}^N(x_i-\mu)(x_i-\mu)^T\Sigma^{-1}) i=1NlogN(μ,Σ)=C2NlogΣ21i=1N(xiμ)TΣ1(xiμ)=C2NlogΣ21i=1Ntrace((xiμ)TΣ1(xiμ))=C2NlogΣ21i=1Ntrace((xiμ)(xiμ)TΣ1)=C2NlogΣ21trace(i=1N(xiμ)(xiμ)TΣ1)
有,协方差矩阵 S S S
S = 1 N ∑ i = 1 N ( x i − μ ) ( x i − μ ) T S=\frac{1}{N} \sum_{i=1}^N (x_i-\mu)(x_i-\mu)^T S=N1i=1N(xiμ)(xiμ)T
所以:
∑ i = 1 N log ⁡ N ( μ , Σ ) = C − N 2 log ⁡ ∣ Σ ∣ − 1 2 t r a c e ( ∑ i = 1 N ( x i − μ ) ( x i − μ ) T Σ − 1 ) = C − N 2 log ⁡ ∣ Σ ∣ − 1 2 t r a c e ( N S ⋅ Σ − 1 ) = C − N 2 log ⁡ ∣ Σ ∣ − N 2 t r a c e ( S ⋅ Σ − 1 ) \sum_{i=1}^N \log \mathcal{N}(\mu, \Sigma)=C-\frac{N}{2}\log |\Sigma|-\frac{1}{2} trace(\sum_{i=1}^N(x_i-\mu)(x_i-\mu)^T\Sigma^{-1})\\=C-\frac{N}{2}\log |\Sigma|-\frac{1}{2} trace(NS\cdot\Sigma^{-1})\\=C-\frac{N}{2}\log |\Sigma|-\frac{N}{2} trace(S\cdot\Sigma^{-1}) i=1NlogN(μ,Σ)=C2NlogΣ21trace(i=1N(xiμ)(xiμ)TΣ1)=C2NlogΣ21trace(NSΣ1)=C2NlogΣ2Ntrace(SΣ1)
那么:
Σ ^ = arg max ⁡ Σ [ ∑ x i ∈ C 1 log ⁡ N ( μ 1 , Σ ) + ∑ x i ∈ C 2 log ⁡ N ( μ 2 , Σ ) ] = arg max ⁡ Σ − N 1 2 log ⁡ ∣ Σ ∣ − N 1 2 t r a c e ( S 1 ⋅ Σ − 1 ) − N 2 2 log ⁡ ∣ Σ ∣ − N 2 2 t r a c e ( S 2 ⋅ Σ − 1 ) + C = arg max ⁡ Σ − N 2 log ⁡ ∣ Σ ∣ − N 1 2 t r a c e ( S 1 ⋅ Σ − 1 ) − N 2 2 t r a c e ( S 2 ⋅ Σ − 1 ) + C \hat{\Sigma}= \argmax_\Sigma \left[ \sum_{x_i\in C_1}\log \mathcal{N}(\mu_1, \Sigma)+\sum_{x_i\in C_2}\log \mathcal{N}(\mu_2, \Sigma) \right]\\=\argmax_\Sigma -\frac{N_1}{2}\log |\Sigma|-\frac{N_1}{2} trace(S_1\cdot\Sigma^{-1}) -\frac{N_2}{2}\log |\Sigma|-\frac{N_2}{2} trace(S_2\cdot\Sigma^{-1})+C\\=\argmax_\Sigma -\frac{N}{2}\log |\Sigma|-\frac{N_1}{2} trace(S_1\cdot\Sigma^{-1}) -\frac{N_2}{2} trace(S_2\cdot\Sigma^{-1}) +C Σ^=Σargmax[xiC1logN(μ1,Σ)+xiC2logN(μ2,Σ)]=Σargmax2N1logΣ2N1trace(S1Σ1)2N2logΣ2N2trace(S2Σ1)+C=Σargmax2NlogΣ2N1trace(S1Σ1)2N2trace(S2Σ1)+C
Σ \Sigma Σ求导:
N 2 ∣ Σ ∣ ⋅ ∣ Σ ∣ Σ − 1 + N 1 2 S 1 T ( − Σ − 2 ) + N 2 2 S 2 T ( − Σ − 2 ) = 0 \frac{N}{2|\Sigma|} \cdot |\Sigma| \Sigma^{-1} + \frac{N_1}{2}S_1^T(-\Sigma^{-2})+\frac{N_2}{2} S_2^T(-\Sigma^{-2})=0 2ΣNΣΣ1+2N1S1T(Σ2)+2N2S2T(Σ2)=0
N Σ − 1 − N 1 S 1 T Σ − 2 − N 2 S 2 T Σ − 2 = 0 N\Sigma^{-1}-N_1S_1^T\Sigma^{-2}-N_2S_2^T\Sigma^{-2}=0 NΣ1N1S1TΣ2N2S2TΣ2=0
N − N 1 S 1 T Σ − 1 − N 2 S 2 T Σ − 1 = 0 N-N_1S_1^T\Sigma^{-1}-N_2S_2^T\Sigma^{-1}=0 NN1S1TΣ1N2S2TΣ1=0
因为 S = 1 N ∑ i = 1 N ( x i − μ ) ( x i − μ ) T S=\frac{1}{N} \sum_{i=1}^N (x_i-\mu)(x_i-\mu)^T S=N1i=1N(xiμ)(xiμ)T,所以 S 1 T = S 1 , S 2 T = S 2 S_1^T=S1,S_2^T=S_2 S1T=S1S2T=S2

那么:
N Σ − N 1 S 1 − N 2 S 2 = 0 N\Sigma-N_1S_1-N_2S_2=0 NΣN1S1N2S2=0
所以:
Σ ^ = N 1 S 1 + N 2 S 2 N \hat{\Sigma}=\frac{N_1S_1+N_2S_2}{N} Σ^=NN1S1+N2S2

Naive Bayes Classfier

标签:朴素贝叶斯、软分类、离散数据、概率生成模型 、最简单的概率图模型(有向图)

思想:朴素贝叶斯假设 (or 条件独立性假设) → \rightarrow x i ⊥ x j ∣ y x_i \bot x_j |y xixjy,(动机是为了简化运算),则
p ( x ∣ y ) = ∏ i = j p p ( x j ∣ y ) p(x|y)=\prod_{i=j}^p p(x_j|y) p(xy)=i=jpp(xjy)
x j x_j xj离散时, x j ∼ C a t e g o r i c a l 分 布 x_j \sim Categorical分布 xjCategorical;当 x j x_j xj连续时, x j ∼ N ( μ j , σ j 2 ) G a u s s i a n 分 布 x_j \sim \mathcal{N}(\mu_j, \sigma_j^2)Gaussian分布 xjN(μj,σj2)Gaussian。一般情况下,朴素贝叶斯用于离散数据。

一次试验,试验结果为 0 / 1 0/1 0/1时(e.g. 抛硬币),对应于 B e r n o u l l i 分 布 Bernoulli 分布 Bernoulli,试验结果为 1 , 2 , 3 , . . . , K 1,2,3,...,K 1,2,3,...,K时(e.g. 抛骰子),对应于 C a t e g o r i c a l 分 布 Categorical分布 Categorical;N次试验,试验结果为 0 / 1 0/1 0/1时,对应于 B i n o r m i n a l 分 布 Binorminal 分布 Binorminal,试验结果为 1 , 2 , 3 , . . . , K 1,2,3,...,K 1,2,3,...,K时,对应于 M u l t i n o r m i n a l 分 布 Multinorminal分布 Multinorminal

目标:给定 x , y = 0 / 1 x,y=0/1 xy=0/1
y ^ = arg max ⁡ y p ( y ∣ x ) = arg max ⁡ y ∈ 0 , 1 p ( x ∣ y ) p ( y ) \hat{y}=\argmax_y p(y|x)\\=\argmax_{y\in 0,1} p(x|y)p(y) y^=yargmaxp(yx)=y0,1argmaxp(xy)p(y)
对于二分类问题, p ( y ) ∼ B e r n o u l l i 分 布 p(y)\sim Bernoulli分布 p(y)Bernoulli;对于多分类问题, p ( y ) ∼ C a t e g o r i c a l 分 布 p(y)\sim Categorical 分布 p(y)Categorical

采用极大似然估计MLE,可以求解参数。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值