(尾巴:补充一些例子)
4.1 直观理解
条件概率
例子4.1:
女朋友和妈妈掉河里了,路人拿出来3颗豆, 两颗红豆1颗绿豆。如果我抽中红豆救女朋友, 抽中绿豆救妈妈。我和路人各自抽了一颗, 路人发现自己抽中的是绿豆,他想用剩下的那颗和我换,我换不换?换不换豆女朋友活下去的概率一样吗?
直觉来讲:
换不换豆我抽中红豆的概率应该都是
1
/
3
1 / 3
1/3 。这时路人跟我说他的是绿豆, 排除一颗, 我抽中红豆的概率是
1
/
2
1 / 2
1/2 。换不换概率都是
1
/
2
1 / 2
1/2 。
计算一下:
如果更换,那么其实就是重新在两个豆子中选了,所以概率是:
P
(
A
∣
B
)
=
P
(
B
∣
A
)
P
(
A
)
P
(
B
)
=
1
⋅
1
3
2
3
=
1
2
P(A \mid B)=\frac{P(B \mid A) P(A)}{P(B)}=\frac{1 \cdot \frac{1}{3}}{\frac{2}{3}}=\frac{1}{2}
P(A∣B)=P(B)P(B∣A)P(A)=321⋅31=21
如果不换,其实还是按照一开始两人同时挑选的概率:
P
(
A
∣
B
)
=
P
(
B
∣
A
)
P
(
A
)
P
(
B
)
=
1
⋅
1
3
1
=
1
3
P(A \mid B)=\frac{P(B \mid A) P(A)}{P(B)}=\frac{1 \cdot \frac{1}{3}}{1}=\frac{1}{3}
P(A∣B)=P(B)P(B∣A)P(A)=11⋅31=31
设 A表示我抽中的是红豆,B表示路人抽中的是绿豆。这里的差别在于有了先后抽取的顺序。更换,一位着重新在第一次抽取的基础上进行了第二次抽取。 结论:如果要救女朋友,最好和路人交换一下。如果要救妈, 最好不要换。
条件概率:
P
(
A
∣
B
)
P(A \mid B)
P(A∣B) 表示在
B
B
B发生的条件下发生
A
A
A的概率。
P
(
A
∣
B
)
=
P
(
A
B
)
P
(
B
)
=
P
(
B
∣
A
)
P
(
A
)
P
(
B
)
P(A \mid B)=\frac{P(A B)}{P(B)}=\frac{P(B \mid A) P(A)}{P(B)}
P(A∣B)=P(B)P(AB)=P(B)P(B∣A)P(A)
参数估计
例子4.2:
假设有一个手写数据集,里面有 100 条记录,其中第0-9条记录是10个人分别写的0。10-19条是10个人分别写的 1。 ⋯ ⋯ \cdots \cdots ⋯⋯ 。第90-99条是10个人分别写的10 。小红写了一个数字X,怎么判断是数字几呢?
朴素贝叶斯工作原理:
P
(
Y
=
0
∣
X
)
=
?
,
P
(
Y
=
1
∣
X
)
=
?
,
⋯
⋯
,
P
(
Y
=
10
∣
X
)
=
?
P(Y=0 \mid X)=?, P(Y=1 \mid X)=?, \cdots \cdots, P(Y=10 \mid X)=?
P(Y=0∣X)=?,P(Y=1∣X)=?,⋯⋯,P(Y=10∣X)=?
找到概率值最高的,就是对应的数字。
对于刚刚的手写数据集, 我们设数字的类别为
C
k
,
C
0
C_{k}, C_{0}
Ck,C0 表示数字
0
,
⋯
⋯
0, \cdots \cdots
0,⋯⋯ 。刚才数字判别公式可以修改为
P
(
Y
=
C
k
∣
X
=
x
)
P\left(Y=C_{\mathrm{k}} \mid X=x\right)
P(Y=Ck∣X=x) 。
P
(
Y
=
C
k
∣
X
=
x
)
=
P
(
X
=
x
∣
Y
=
C
k
)
P
(
Y
=
C
k
)
P
(
X
=
x
)
=
P
(
X
=
x
∣
Y
=
C
k
)
P
(
Y
=
C
k
)
∑
k
P
(
X
=
x
,
Y
=
C
k
)
=
P
(
X
=
x
∣
Y
=
C
k
)
P
(
Y
=
C
k
)
∑
k
P
(
X
=
x
∣
Y
=
C
k
)
P
(
Y
=
C
k
)
=
P
(
X
=
x
∣
Y
=
C
k
)
P
(
Y
=
C
k
)
∑
k
P
(
X
=
x
∣
Y
=
C
k
)
P
(
Y
=
C
k
)
=
P
(
Y
=
C
k
)
∏
j
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
C
k
)
∑
k
P
(
Y
=
C
k
)
∏
j
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
C
k
)
\begin{aligned} P\left(Y=C_{\mathrm{k}} \mid X=x\right) &=\frac{P\left(X=x \mid Y=C_{k}\right) P\left(Y=C_{k}\right)}{P(X=x)} \\ =& \frac{P\left(X=x \mid Y=C_{k}\right) P\left(Y=C_{k}\right)}{\sum_{k} P\left(X=x, Y=C_{k}\right)} \\ =& \frac{P\left(X=x \mid Y=C_{k}\right) P\left(Y=C_{k}\right)}{\sum_{k} P\left(X=x \mid Y=C_{k}\right) P\left(Y=C_{k}\right)}\\ =&\frac{P\left(X=x \mid Y=C_{k}\right) P\left(Y=C_{k}\right)}{\sum_{k} P\left(X=x \mid Y=C_{k}\right) P\left(Y=C_{k}\right)} \\ =&\frac{P\left(Y=C_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=C_{k}\right)}{\sum_{k} P\left(Y=C_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=C_{k}\right)} \end{aligned}
P(Y=Ck∣X=x)=====P(X=x)P(X=x∣Y=Ck)P(Y=Ck)∑kP(X=x,Y=Ck)P(X=x∣Y=Ck)P(Y=Ck)∑kP(X=x∣Y=Ck)P(Y=Ck)P(X=x∣Y=Ck)P(Y=Ck)∑kP(X=x∣Y=Ck)P(Y=Ck)P(X=x∣Y=Ck)P(Y=Ck)∑kP(Y=Ck)∏jP(X(j)=x(j)∣Y=Ck)P(Y=Ck)∏jP(X(j)=x(j)∣Y=Ck)
另外:
P
(
X
=
x
∣
Y
=
C
k
)
=
P
(
X
(
1
)
=
x
(
1
)
∣
Y
=
C
k
)
P
(
X
(
2
)
=
x
(
2
)
∣
Y
=
C
k
)
⋯
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
C
k
)
=
∏
j
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
C
k
)
\begin{aligned} P\left(X=x \mid Y=C_{k}\right) &=P\left(X^{(1)}=x^{(1)} \mid Y=C_{k}\right) P\left(X^{(2)}=x^{(2)} \mid Y=C_{k}\right) \cdots P\left(X^{(j)}=x^{(j)} \mid Y=C_{k}\right) \\ &=\prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=C_{k}\right) \end{aligned}
P(X=x∣Y=Ck)=P(X(1)=x(1)∣Y=Ck)P(X(2)=x(2)∣Y=Ck)⋯P(X(j)=x(j)∣Y=Ck)=j∏P(X(j)=x(j)∣Y=Ck)
朴素的意义:特征独立
f
(
x
)
=
argmax
C
k
P
(
Y
=
C
k
∣
X
=
x
)
=
P
(
Y
=
C
k
)
∏
j
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
C
k
)
∑
k
P
(
Y
=
C
k
)
∏
j
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
C
k
)
=
P
(
Y
=
C
k
)
∏
j
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
C
k
)
\begin{aligned} f(x)=\underset{C_{k}}{\operatorname{argmax}} P\left(Y=C_{k} \mid X=x\right) &=\frac{P\left(Y=C_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=C_{k}\right)}{\sum_{k} P\left(Y=C_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=C_{k}\right)} \\ &=P\left(Y=C_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=C_{k}\right) \end{aligned}
f(x)=CkargmaxP(Y=Ck∣X=x)=∑kP(Y=Ck)∏jP(X(j)=x(j)∣Y=Ck)P(Y=Ck)∏jP(X(j)=x(j)∣Y=Ck)=P(Y=Ck)j∏P(X(j)=x(j)∣Y=Ck)
又
P
(
Y
=
C
k
)
=
∑
i
=
1
N
I
(
y
i
=
C
k
)
N
,
k
=
1
,
2
,
…
,
K
P\left(Y=C_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=C_{k}\right)}{N}, k=1,2, \ldots, K
P(Y=Ck)=N∑i=1NI(yi=Ck),k=1,2,…,K
其中I是指示函数:
I
(
x
)
=
{
1
,
条件
x
为真
0
,
条件
x
为假
I(x)= \begin{cases}1, & \text { 条件 } x \text { 为真 } \\ 0, & \text { 条件 } x \text { 为假 }\end{cases}
I(x)={1,0, 条件 x 为真 条件 x 为假
假设第
j
\mathrm{j}
j 个特征
x
(
j
)
x^{(j)}
x(j) 可能取值的集合为
{
a
j
1
,
a
j
2
,
…
,
a
j
S
j
}
\left\{a_{j 1}, a_{j 2}, \ldots, a_{j S_{j}}\right\}
{aj1,aj2,…,ajSj}
P
(
X
(
j
)
=
a
j
l
∣
Y
=
C
k
)
=
∑
i
=
1
N
I
(
x
i
(
j
)
=
a
j
l
,
y
i
=
C
k
)
∑
i
=
1
N
I
(
y
i
=
C
k
)
j
=
1
,
2
,
…
,
n
;
l
=
1
,
2
,
…
,
S
j
;
k
=
1
,
2
,
…
,
K
\begin{gathered} P\left(X^{(j)}=a_{j l} \mid Y=C_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=C_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=C_{k}\right)} \\ j=1,2, \ldots, n ; l=1,2, \ldots, S_{j} ; k=1,2, \ldots, K \end{gathered}
P(X(j)=ajl∣Y=Ck)=∑i=1NI(yi=Ck)∑i=1NI(xi(j)=ajl,yi=Ck)j=1,2,…,n;l=1,2,…,Sj;k=1,2,…,K
这里注意:
- 统计数据集当中每个类别的数目, 例如数据集大小为100,其中5出现了20次, 那么 $ P\left(Y=C_{5}\right)=\dfrac{20}{100}=0.2$
- 统计当类别为 C k C_{k} Ck 时, x i ( j ) = a j l x_{i}^{(j)}=a_{j l} xi(j)=ajl 出现的 次数占类别为 C k C_{k} Ck 的样本数目的比例
算法4.1: 朴素贝叶斯算法
输入:训练数据 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , … , ( x N , y N ) } T=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \ldots,\left(x_{N}, y_{N}\right)\right\} T={(x1,y1),(x2,y2),…,(xN,yN)},其中 x i = ( x i ( 1 ) , x i ( 2 ) , … , x i ( n ) ) T , x i ( j ) x_{i}=\left(x_{i}^{(1)}, x_{i}^{(2)}, \ldots, x_{i}^{(n)}\right)^{T}, x_{i}^{(j)} xi=(xi(1),xi(2),…,xi(n))T,xi(j)是第 i i i个样本的第 j j j个特征, x i ( j ) ∈ { a j 1 , a j 2 , … , a j s j } x_{i}^{(j)} \in\left\{a_{j 1}, a_{j 2}, \ldots, a_{j s_{j}}\right\} xi(j)∈{aj1,aj2,…,ajsj}, a j l a_{j l} ajl是第 j j j个特征可能取的第 l l l个值, j = 1 , 2 , … , n , l = 1 , 2 , … , S j , y i ∈ { c 1 , c 2 , … , c K } j=1,2, \ldots, n, l=1,2, \ldots, S_{j}, \quad y_{i} \in\left\{c_{1}, c_{2}, \ldots, c_{K}\right\} j=1,2,…,n,l=1,2,…,Sj,yi∈{c1,c2,…,cK}.实例 x x x
输出:实例 x x x的分类
(1)计算先验概率以及条件概率
P
(
Y
=
c
k
)
=
∑
i
=
1
N
I
(
y
i
=
c
k
)
N
,
k
=
1
,
2
,
…
,
K
P
(
X
(
j
)
=
a
j
l
∣
Y
=
c
k
)
=
∑
i
=
1
N
I
(
x
i
(
j
)
=
a
j
l
,
y
i
=
c
k
)
∑
i
=
1
N
I
(
y
i
=
c
k
)
,
j
=
1
,
2
,
…
,
n
;
l
=
1
,
2
,
…
,
S
j
;
k
=
1
,
2
,
…
,
k
\begin{gathered} P\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}{N}, \quad k=1,2, \ldots, K \\ P\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}, \quad j=1,2, \ldots, n ; l=1,2, \ldots, S_{j} ; k=1,2, \ldots, k \end{gathered}
P(Y=ck)=N∑i=1NI(yi=ck),k=1,2,…,KP(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)∑i=1NI(xi(j)=ajl,yi=ck),j=1,2,…,n;l=1,2,…,Sj;k=1,2,…,k
(2)对于给定的实例
x
=
(
x
(
1
)
,
x
(
2
)
,
…
,
x
(
n
)
)
T
x=\left(x^{(1)}, x^{(2)}, \ldots, x^{(n)}\right)^{T}
x=(x(1),x(2),…,x(n))T,计算
P
(
Y
=
c
k
)
∏
j
=
1
n
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
,
k
=
1
,
2
,
…
,
K
P\left(Y=c_{k}\right) \prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right), \quad k=1,2, \ldots, K
P(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck),k=1,2,…,K
(3)确定实例
x
x
x的类
y
=
argmax
c
k
P
(
Y
=
c
k
)
∏
j
=
1
n
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
y=\underset{c_{k}}{\operatorname{argmax}} P\left(Y=c_{k}\right) \prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right)
y=ckargmaxP(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck)
总结
- 条件概率公式: P ( A ∣ B ) = P ( A B ) P ( B ) = P ( B ∣ A ) P ( A ) P ( B ) P(A \mid B)=\dfrac{P(A B)}{P(B)}=\dfrac{P(B \mid A) P(A)}{P(B)} P(A∣B)=P(B)P(AB)=P(B)P(B∣A)P(A)
- 使用条件概率公式逐步导出到最后参数估计的步骤需牢记。(从条件概率公式逐步替换变复杂, 最后因为分母是常数再简化)。
- 后续会遇到很多类似的推导过程, 一般都是先各 种替换变复杂最后简化(先膨胀再瘪下去)。
4.2 贝叶斯估计
其实我们在上一个section得到的
P
(
X
(
j
)
=
a
j
l
∣
Y
=
C
k
)
=
∑
i
=
1
N
I
(
x
i
(
j
)
=
a
j
l
,
y
i
=
C
k
)
∑
i
=
1
N
I
(
y
i
=
C
k
)
j
=
1
,
2
,
…
,
n
;
l
=
1
,
2
,
…
,
S
j
;
k
=
1
,
2
,
…
,
K
\begin{gathered} P\left(X^{(j)}=a_{j l} \mid Y=C_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=C_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=C_{k}\right)} \\ j=1,2, \ldots, n ; l=1,2, \ldots, S_{j} ; k=1,2, \ldots, K \end{gathered}
P(X(j)=ajl∣Y=Ck)=∑i=1NI(yi=Ck)∑i=1NI(xi(j)=ajl,yi=Ck)j=1,2,…,n;l=1,2,…,Sj;k=1,2,…,K
是有问题的。因为公式的坟墓有可能为0!
例子:
数据集大小为 100 , 其中属于数字0的样本数有 10 个, 属于数字1的样本数0个, ⋯ ⋯ \cdots \cdots ⋯⋯ 。当我要计算 P ( X ( j ) = a j l ∣ Y = C 1 ) P\left(X^{(j)}=a_{j l} \mid Y=C_{1}\right) P(X(j)=ajl∣Y=C1) 时 ⋯ ⋯ \cdots \cdots ⋯⋯
上式就不能直接使用了。
我们要对公式做一点改动
P
(
X
(
j
)
=
a
j
l
∣
Y
=
C
k
)
=
∑
i
=
1
N
I
(
x
i
(
j
)
=
a
j
l
,
y
i
=
C
k
)
+
λ
∑
i
=
1
N
I
(
y
i
=
C
k
)
+
S
j
λ
P\left(X^{(j)}=a_{j l} \mid Y=C_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=C_{k}\right)+\lambda}{\sum_{i=1}^{N} I\left(y_{i}=C_{k}\right)+S_{j} \lambda}
P(X(j)=ajl∣Y=Ck)=∑i=1NI(yi=Ck)+Sjλ∑i=1NI(xi(j)=ajl,yi=Ck)+λ
S
j
:
x
i
S_j:x_i
Sj:xi是可取特征数目
4.3 后验概率最大化
后验概率最大化等价于期望风险最小化。假设朴素贝叶斯使用0-1损失函数:
L
(
Y
,
f
(
x
)
)
=
{
1
,
Y
≠
f
(
x
)
0
,
Y
=
f
(
x
)
L(Y, f(x))= \begin{cases}1, & Y \neq f(x) \\ 0, & Y=f(x)\end{cases}
L(Y,f(x))={1,0,Y=f(x)Y=f(x)
此时期望风险为:
R
exp
(
f
)
=
E
[
L
(
Y
,
f
(
x
)
)
]
=
E
x
∑
k
=
1
K
[
L
(
C
k
,
f
(
x
)
)
]
P
(
C
k
∣
X
=
x
)
\begin{aligned} R_{\exp }(f) &=E[L(Y, f(x))] \\ &=E_{x} \sum_{k=1}^{K}\left[L\left(C_{k}, f(x)\right)\right] P\left(C_{k} \mid X=x\right) \end{aligned}
Rexp(f)=E[L(Y,f(x))]=Exk=1∑K[L(Ck,f(x))]P(Ck∣X=x)
只需对
X
=
x
X=x
X=x逐个极小化:
f
(
x
)
=
argmin
y
∈
γ
∑
k
=
1
K
[
L
(
C
k
,
y
)
]
P
(
C
k
∣
X
=
x
)
=
argmin
y
∈
γ
∑
k
=
1
K
P
(
y
≠
C
k
∣
X
=
x
)
=
argmin
y
∈
γ
(
1
−
P
(
y
≠
C
k
∣
X
=
x
)
)
=
argmax
y
∈
γ
P
(
y
=
C
k
∣
X
=
x
)
\begin{aligned} f(x) &=\underset{y \in \gamma}{\operatorname{argmin}} \sum_{k=1}^{K}\left[L\left(C_{k}, y\right)\right] P\left(C_{k} \mid X=x\right) \\ &=\underset{y \in \gamma}{\operatorname{argmin}} \sum_{k=1}^{K} P\left(y \neq C_{k} \mid X=x\right) \\ &=\underset{y \in \gamma}{\operatorname{argmin}}\left(1-P\left(y \neq C_{k} \mid X=x\right)\right) \\ &=\underset{y \in \gamma}{\operatorname{argmax}} P\left(y=C_{k} \mid X=x\right) \end{aligned}
f(x)=y∈γargmink=1∑K[L(Ck,y)]P(Ck∣X=x)=y∈γargmink=1∑KP(y=Ck∣X=x)=y∈γargmin(1−P(y=Ck∣X=x))=y∈γargmaxP(y=Ck∣X=x)
由此可得,期望风险最小化准则变成了后验概率最大化准则。也就是朴素贝叶斯所采用的定理。