4 朴素贝叶斯
4.1.1 朴素贝叶斯的学习与分类
设输入空间
X
⊆
R
n
\mathcal{X} \subseteq R^n
X⊆Rn为
n
n
n维向量的集合,输出空间为类标记集合
Y
=
{
c
1
,
c
2
,
.
.
.
,
c
k
}
\mathcal{Y} = \{c_1, c_2, ..., c_k\}
Y={c1,c2,...,ck}。输入为特征向量
x
∈
X
x \in \mathcal{X}
x∈X,输出为类标记(class label)
y
∈
Y
y \in \mathcal{Y}
y∈Y。
X
X
X是定义在输入空间
X
\mathcal{X}
X上的随机变量,
Y
Y
Y是定义在输出空间
Y
\mathcal{Y}
Y上的随机变量,
P
(
X
,
Y
)
P(X, Y)
P(X,Y)是
X
X
X和
Y
Y
Y的联合概率分布,训练数据集
T
=
{
(
x
1
,
y
1
)
,
(
x
2
,
y
2
)
,
.
.
.
,
(
x
n
,
y
n
)
}
T=\{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\}
T={(x1,y1),(x2,y2),...,(xn,yn)}
由
P
(
X
,
Y
)
P(X,Y)
P(X,Y)独立同分布产生。
朴素贝叶斯法通过训练数据集学习联合概率分布 P ( X , Y ) P(X,Y) P(X,Y)。具体的,是学习先验概率分布及条件概率分布。
先验概率分布
P
(
Y
=
c
k
)
,
k
=
1
,
2
,
.
.
.
,
K
P(Y=c_k), k=1,2,...,K
P(Y=ck),k=1,2,...,K
条件概率分布
P
(
X
=
x
∣
Y
=
c
k
)
=
P
(
X
(
1
)
=
x
(
1
)
,
X
(
2
)
=
x
(
2
)
,
.
.
.
,
X
(
n
)
=
x
(
n
)
∣
Y
=
c
k
)
,
k
=
1
,
2
,
.
.
.
,
K
P(X=x|Y=c_k)=P(X^{(1)}=x^{(1)}, X^{(2)}=x^{(2)}, ..., X^{(n)}=x^{(n)}|Y=c_k), k=1,2,...,K
P(X=x∣Y=ck)=P(X(1)=x(1),X(2)=x(2),...,X(n)=x(n)∣Y=ck),k=1,2,...,K
从而获得联合概率分布。
条件概率分布 P ( X = x ∣ Y = c k ) P(X=x|Y=c_k) P(X=x∣Y=ck)有指数级的参数,其估计实际上是不可能的。假设 x ( j ) x^{(j)} x(j)可取值有 S j S_j Sj个, j = 1 , 2 , . . , n j=1,2,..,n j=1,2,..,n, Y Y Y可取值有 K K K个,那么参数个数为 K ∏ j = 1 n S j K \prod_{j=1}^{n}{S_j} K∏j=1nSj
朴素贝叶斯法为了解决该问题,作了条件独立性假设,由于这是一个较强的假设,朴素贝叶斯法因此得名。
P
(
X
=
x
∣
Y
=
c
k
)
=
P
(
X
(
1
)
=
x
(
1
)
,
X
(
2
)
=
x
(
2
)
,
.
.
.
,
X
(
n
)
=
x
(
n
)
∣
Y
=
c
k
)
=
∏
j
=
1
n
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
P(X=x|Y=c_k)=P(X^{(1)}=x^{(1)}, X^{(2)}=x^{(2)}, ..., X^{(n)}=x^{(n)}|Y=c_k)\\ =\prod_{j=1}^{n}{P(X^{(j)}=x^{(j)}|Y=c_k)}
P(X=x∣Y=ck)=P(X(1)=x(1),X(2)=x(2),...,X(n)=x(n)∣Y=ck)=j=1∏nP(X(j)=x(j)∣Y=ck)
基于此,后验概率为
P
(
Y
=
c
k
∣
X
=
x
)
=
P
(
X
=
x
∣
Y
=
c
k
)
P
(
Y
=
c
k
)
∑
k
P
(
X
=
x
∣
Y
=
c
k
)
P
(
Y
=
c
k
)
P(Y=c_k|X=x)=\frac{P(X=x|Y=c_k)P(Y=c_k)}{\sum_k P(X=x|Y=c_k)P(Y=c_k)}
P(Y=ck∣X=x)=∑kP(X=x∣Y=ck)P(Y=ck)P(X=x∣Y=ck)P(Y=ck)
将上式代入,可得
P
(
Y
=
c
k
∣
X
=
x
)
=
P
(
Y
=
c
k
)
∏
j
=
1
n
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
∑
k
P
(
Y
=
c
k
)
∏
j
=
1
n
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
P(Y=c_k|X=x)=\frac{P(Y=c_k)\prod_{j=1}^{n}{P(X^{(j)}=x^{(j)}|Y=c_k)}} {\sum_k P(Y=c_k) \prod_{j=1}^{n}{P(X^{(j)}=x^{(j)}|Y=c_k)}}
P(Y=ck∣X=x)=∑kP(Y=ck)∏j=1nP(X(j)=x(j)∣Y=ck)P(Y=ck)∏j=1nP(X(j)=x(j)∣Y=ck)
于是得到
y
=
f
(
x
)
=
a
r
g
m
a
x
c
k
P
(
Y
=
c
k
)
∏
j
=
1
n
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
∑
k
P
(
Y
=
c
k
)
∏
j
=
1
n
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
y=f(x)=\underset{c_k}{argmax}\frac{P(Y=c_k)\prod_{j=1}^{n}{P(X^{(j)}=x^{(j)}|Y=c_k)}} {\sum_k P(Y=c_k) \prod_{j=1}^{n}{P(X^{(j)}=x^{(j)}|Y=c_k)}}
y=f(x)=ckargmax∑kP(Y=ck)∏j=1nP(X(j)=x(j)∣Y=ck)P(Y=ck)∏j=1nP(X(j)=x(j)∣Y=ck)
又因为分母部分对于所有的
c
k
c_k
ck是一致的,所以
y
=
f
(
x
)
=
a
r
g
m
a
x
c
k
P
(
Y
=
c
k
)
∏
j
=
1
n
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
y=f(x)=\underset{c_k}{argmax} {P(Y=c_k)\prod_{j=1}^{n}{P(X^{(j)}=x^{(j)}|Y=c_k)}}
y=f(x)=ckargmaxP(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck)
4.1.2 后验概率最大化的定义
朴素贝叶斯实际是将实例分到后验概率最大化的类中,这等价于期望风险最小化。假设选择0-1损失函数
L
(
Y
,
f
(
X
)
)
=
{
1
,
Y
≠
f
(
X
)
0
,
Y
=
f
(
X
)
L(Y, f(X)) = \begin{cases} 1, Y \neq f(X) \\ 0, Y = f(X) \end{cases}
L(Y,f(X))={1,Y=f(X)0,Y=f(X)
期望风险函数为:
R
e
x
p
(
f
)
=
E
[
L
(
Y
,
f
(
X
)
)
]
=
E
X
∑
k
=
1
K
L
(
Y
,
f
(
X
)
)
P
(
c
k
∣
X
)
R_{exp}(f)=E[L(Y,f(X))] =E_X \sum_{k=1}^{K} L(Y,f(X)) P(c_k|X)
Rexp(f)=E[L(Y,f(X))]=EXk=1∑KL(Y,f(X))P(ck∣X)
为使期望风险最小化,只需对
X
=
x
X=x
X=x逐个极小化,由此
f
(
x
)
=
a
r
g
m
i
n
y
∈
Y
∑
k
=
1
K
L
(
Y
,
f
(
X
)
)
P
(
c
k
∣
X
=
x
)
=
a
r
g
m
i
n
y
∈
Y
∑
k
=
1
K
P
(
y
≠
c
k
∣
X
=
x
)
=
a
r
g
m
i
n
y
∈
Y
∑
k
=
1
K
1
−
P
(
y
=
c
k
∣
X
=
x
)
=
a
r
g
m
a
x
y
∈
Y
∑
k
=
1
K
P
(
y
=
c
k
∣
X
=
x
)
f(x)=\underset{y\in\mathcal{Y}}{argmin}\sum_{k=1}^{K} L(Y,f(X)) P(c_k|X=x) \\ = \underset{y\in\mathcal{Y}}{argmin}\sum_{k=1}^{K} P(y \neq c_k|X=x) \\ = \underset{y\in\mathcal{Y}}{argmin}\sum_{k=1}^{K} 1 - P(y = c_k|X=x) \\ = \underset{y\in\mathcal{Y}}{argmax}\sum_{k=1}^{K} P(y = c_k|X=x)
f(x)=y∈Yargmink=1∑KL(Y,f(X))P(ck∣X=x)=y∈Yargmink=1∑KP(y=ck∣X=x)=y∈Yargmink=1∑K1−P(y=ck∣X=x)=y∈Yargmaxk=1∑KP(y=ck∣X=x)
由此,根据期望风险最小化准则得到了后验概率最大化准则,也就是贝叶斯法所采用的准则。
4.2 朴素贝叶斯法的参数估计
4.2.1 极大似然估计
先验概率的极大似然估计
P
(
Y
=
c
k
)
=
∑
i
=
1
N
I
(
y
i
=
c
k
)
N
,
k
=
1
,
2
,
.
.
.
,
K
P(Y = c_k) = \frac{\sum_{i=1}^{N}I(y_i = c_k)}{N}, k=1,2,...,K
P(Y=ck)=N∑i=1NI(yi=ck),k=1,2,...,K
证明:
首先明确参数是什么,参数是
p
(
y
=
c
k
)
p(y=c_k)
p(y=ck)以及
p
(
x
(
j
)
=
a
j
l
∣
y
=
c
k
)
p(x^{(j)}=a_{jl}|y=c_k)
p(x(j)=ajl∣y=ck),以
ψ
\psi
ψ代表这两个参数
$$
L(\psi) = log \prod_{i=1}^N p(x_i, y_i; \psi) \
= log \prod_{i=1}^N p(x_i | y_i; \psi) p(y_i; \psi) \
= log \prod_{i=1}^N (\prod_{j=1}^n p(x_i^{(j)} | y_i ; \psi)) p(y_i; \psi) \
= \sum_{i=1}^N[log p(y_i; \psi) + \sum_{j=1}^n log p(x_i^{(j)}| y_i ; \psi)] \
代入参数 \
= \sum_{i=1}N[\sum_{k=1}K log p(y = c_k)^{I(y_i=c_k)} + \sum_{k=1}^K \sum_{j=1}^n \sum_{l=1}^{S_j} log p(x_i^{(j)} = a_{jl}| y_i = c_k) {I(x_i{(j)}=a_{jl}, y_i=c_k)}] \
= \sum_{i=1}^N [\sum_{k=1}^K {I(y_i=c_k)}log p(y = c_k) + \sum_{k=1}^K \sum_{j=1}^n \sum_{l=1}^{S_j} {I(x_i^{(j)}=a_{jl}, y_i=c_k)} log p(x_i^{(j)} = a_{jl}| y_i = c_k)] \
$$
但实际上,
p
(
y
=
c
k
)
p(y=c_k)
p(y=ck)也存在相应的约束,有约束的求极值,可以考虑使用拉格朗日乘子法。
上式子中只有前半段含有$p(y = c_k),所以求先验概率估计时只考虑前半部分
先验概率估计
令 F = ∑ i = 1 N [ ∑ k = 1 K I ( y i = c k ) l o g p ( y = c k ) + λ ( 1 − ∑ k = 1 K p ( y = c k ) ) ] F = \sum_{i=1}^N [\sum_{k=1}^K {I(y_i=c_k)}log p(y = c_k) + \lambda (1 - \sum_{k=1}^K p(y = c_k))] F=∑i=1N[∑k=1KI(yi=ck)logp(y=ck)+λ(1−∑k=1Kp(y=ck))]
这里需要注意,并没有直接把
1
−
∑
k
=
1
K
p
(
y
=
c
k
)
)
1 - \sum_{k=1}^K p(y = c_k))
1−∑k=1Kp(y=ck))代入,而是带入了
∑
i
=
1
N
(
1
−
∑
k
=
1
K
p
(
y
=
c
k
)
)
)
\sum_{i=1}^N (1 - \sum_{k=1}^K p(y = c_k)))
∑i=1N(1−∑k=1Kp(y=ck))),区别不大,因为都是0,代入一个和多个是一样的,但是代入多个的情况下,下面更容易求解。
{
∂
F
∂
p
(
y
=
c
1
)
=
∑
i
=
1
N
I
(
y
=
c
1
)
p
(
y
=
c
1
)
−
λ
=
0
∂
F
∂
p
(
y
=
c
2
)
=
∑
i
=
1
N
I
(
y
=
c
2
)
p
(
y
=
c
2
)
−
λ
=
0
.
.
.
∂
F
∂
p
(
y
=
c
K
)
=
∑
i
=
1
N
I
(
y
=
c
K
)
p
(
y
=
c
K
)
−
λ
=
0
∂
F
∂
λ
=
∑
i
=
1
N
{
1
−
∑
k
=
1
K
p
(
y
=
c
k
)
}
=
0
\begin{cases} \frac{\partial F}{\partial p(y = c_1)} = \sum_{i=1}^N {\frac{I(y = c_1)}{p(y = c_1)} - \lambda} = 0 \\ \frac{\partial F}{\partial p(y = c_2)} = \sum_{i=1}^N {\frac{I(y = c_2)}{p(y = c_2)} - \lambda} = 0 \\ ... \\ \frac{\partial F}{\partial p(y = c_K)} = \sum_{i=1}^N {\frac{I(y = c_K)}{p(y = c_K)} - \lambda} = 0 \\ \frac{\partial F}{\partial \lambda} = \sum_{i=1}^N \{1 - \sum_{k=1}^K p(y = c_k)\} = 0 \end{cases}
⎩
⎨
⎧∂p(y=c1)∂F=∑i=1Np(y=c1)I(y=c1)−λ=0∂p(y=c2)∂F=∑i=1Np(y=c2)I(y=c2)−λ=0...∂p(y=cK)∂F=∑i=1Np(y=cK)I(y=cK)−λ=0∂λ∂F=∑i=1N{1−∑k=1Kp(y=ck)}=0
联立前N个式子,可得
{
p
(
y
=
c
1
)
=
∑
i
=
1
N
I
(
y
=
c
1
)
N
λ
p
(
y
=
c
2
)
=
∑
i
=
1
N
I
(
y
=
c
2
)
N
λ
.
.
.
p
(
y
=
c
K
)
=
∑
i
=
1
N
I
(
y
=
c
K
)
N
λ
(2)
\begin{cases} p(y = c_1) = \frac{\sum_{i=1}^N I(y = c_1)}{N \lambda} \\ p(y = c_2) = \frac{\sum_{i=1}^N I(y = c_2)}{N \lambda} \\ ... \\ p(y = c_K) = \frac{\sum_{i=1}^N I(y = c_K)}{N \lambda} \end{cases} \tag{2}
⎩
⎨
⎧p(y=c1)=Nλ∑i=1NI(y=c1)p(y=c2)=Nλ∑i=1NI(y=c2)...p(y=cK)=Nλ∑i=1NI(y=cK)(2)
因为
∑
k
=
1
K
p
(
y
=
c
k
)
=
1
\sum_{k=1}^K p(y = c_k) = 1
∑k=1Kp(y=ck)=1,所以
1
=
∑
i
=
1
N
∑
i
=
1
K
I
(
y
=
c
k
)
N
λ
1
=
N
N
λ
λ
=
1
1 = \frac {\sum_{i=1}^N \sum_{i=1}^K I(y = c_k)} {N \lambda} \\ 1 = \frac {N} {N \lambda} \\ \lambda = 1
1=Nλ∑i=1N∑i=1KI(y=ck)1=NλNλ=1
代入(2)式,得到
p
(
y
=
c
k
)
=
∑
i
=
1
N
I
(
y
=
c
k
)
N
k
=
1
,
2
,
3
,
.
.
.
,
K
p(y = c_k) = \frac{\sum_{i=1}^N I(y = c_k)}{N} k = 1,2,3,...,K
p(y=ck)=N∑i=1NI(y=ck)k=1,2,3,...,K
条件概率的极大似然估计
G = ∑ i = 1 N { ∑ k = 1 K ∑ j = 1 n ( ( ∑ l = 1 S j I ( x i ( j ) = a j l , y i = c k ) l o g p ( x i ( j ) = a j l ∣ y i = c k ) ) + λ k j ( 1 − ∑ l = 1 S j p ( x j = a j l ∣ y = c k ) ) } G = \sum_{i=1}^N \{ \sum_{k=1}^K \sum_{j=1}^n ( (\sum_{l=1}^{S_j} {I(x_i^{(j)}=a_{jl}, y_i=c_k)} log p(x_i^{(j)} = a_{jl}| y_i = c_k)) + \lambda_{kj} (1 - \sum_{l=1}^{S_j} p(x^{j} = a_{jl} | y = c_k)) \} \\ G=i=1∑N{k=1∑Kj=1∑n((l=1∑SjI(xi(j)=ajl,yi=ck)logp(xi(j)=ajl∣yi=ck))+λkj(1−l=1∑Sjp(xj=ajl∣y=ck))}
与上面类似,由于对于每个
k
,
j
k,j
k,j都会存在一个
∑
l
=
1
S
j
p
(
x
j
=
a
j
l
∣
y
=
c
k
)
=
1
\sum_{l=1}^{S_j} p(x^{j} = a_{jl} | y = c_k) = 1
∑l=1Sjp(xj=ajl∣y=ck)=1,所以实际上存在
k
∗
l
k*l
k∗l个约束,求导可得
{
∂
G
∂
p
(
x
i
(
j
)
=
a
j
l
∣
y
i
=
c
k
)
)
=
∑
i
=
1
N
{
I
(
x
i
(
j
)
=
a
j
l
,
y
i
=
c
k
)
p
(
x
i
(
j
)
=
a
j
l
∣
y
i
=
c
k
)
−
λ
k
j
}
=
0
∂
G
∂
λ
k
j
=
∑
i
=
1
N
(
1
−
∑
l
=
1
S
j
p
(
x
j
=
a
j
l
∣
y
=
c
k
)
)
=
0
(3)
\begin{cases} \frac{\partial G}{\partial p(x_i^{(j)} = a_{jl}| y_i = c_k))} = \sum_{i=1}^N \{ \frac{I(x_i^{(j)}=a_{jl}, y_i=c_k)} {p(x_i^{(j)} = a_{jl}| y_i = c_k)} - \lambda_{kj} \} = 0 \\ \frac{\partial G}{\partial \lambda_{kj}} = \sum_{i=1}^N (1 - \sum_{l=1}^{S_j} p(x^{j} = a_{jl} | y = c_k)) = 0 \end{cases} \tag{3}
⎩
⎨
⎧∂p(xi(j)=ajl∣yi=ck))∂G=∑i=1N{p(xi(j)=ajl∣yi=ck)I(xi(j)=ajl,yi=ck)−λkj}=0∂λkj∂G=∑i=1N(1−∑l=1Sjp(xj=ajl∣y=ck))=0(3)
由第一个式子可得
p
(
x
i
(
j
)
=
a
j
l
∣
y
i
=
c
k
)
=
∑
i
=
1
N
I
(
x
i
(
j
)
=
a
j
l
,
y
i
=
c
k
)
N
λ
k
j
(4)
p(x_i^{(j)} = a_{jl}| y_i = c_k) = \frac{\sum_{i=1}^N I(x_i^{(j)}=a_{jl}, y_i=c_k)} {N \lambda_{kj}} \tag{4}
p(xi(j)=ajl∣yi=ck)=Nλkj∑i=1NI(xi(j)=ajl,yi=ck)(4)
由第二个式子可得
∑
l
=
1
S
j
p
(
x
j
=
a
j
l
∣
y
=
c
k
)
)
=
1
(5)
\sum_{l=1}^{S_j} p(x^{j} = a_{jl} | y = c_k)) = 1 \tag{5}
l=1∑Sjp(xj=ajl∣y=ck))=1(5)
联立两个式子可得
1
=
∑
l
=
1
S
j
∑
i
=
1
N
I
(
x
i
(
j
)
=
a
j
l
,
y
i
=
c
k
)
N
λ
k
j
1
=
∑
i
=
1
N
I
(
y
i
=
c
k
)
N
λ
k
j
N
λ
k
j
=
∑
i
=
1
N
I
(
y
i
=
c
k
)
1 = \sum_{l = 1}^{S_j} \frac{\sum_{i=1}^N I(x_i^{(j)}=a_{jl}, y_i=c_k)} {N \lambda_{kj}} \\ 1 = \frac {\sum_{i = 1}^N I(y_i = c_k)} {N \lambda_{kj}} \\ N \lambda_{kj} = \sum_{i = 1}^N I(y_i = c_k)
1=l=1∑SjNλkj∑i=1NI(xi(j)=ajl,yi=ck)1=Nλkj∑i=1NI(yi=ck)Nλkj=i=1∑NI(yi=ck)
代入上式(4),得到
p
(
x
i
(
j
)
=
a
j
l
∣
y
=
c
k
)
=
∑
i
=
1
N
I
(
x
i
(
j
)
=
a
j
l
,
y
i
=
c
k
)
∑
i
=
1
N
I
(
y
i
=
c
k
)
p(x_i^{(j)} = a_{jl}| y = c_k) = \frac{\sum_{i=1}^N I(x_i^{(j)}=a_{jl}, y_i=c_k)} {\sum_{i = 1}^N I(y_i = c_k)}
p(xi(j)=ajl∣y=ck)=∑i=1NI(yi=ck)∑i=1NI(xi(j)=ajl,yi=ck)
证明完毕。
4.2.2 学习与分类算法
输入:训练数据 T = { ( x 1 , y 2 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) } T = \{(x_1, y_2), (x_2, y_2), ..., (x_N, y_N)\} T={(x1,y2),(x2,y2),...,(xN,yN)},其中 x i = ( x i ( 1 ) , x i ( 2 ) , . . . , x i ( j ) ) T x_i = (x_i^{(1)}, x_i^{(2)}, ..., x_i^{(j)})^T xi=(xi(1),xi(2),...,xi(j))T, 其中 x i ( j ) x_i^{(j)} xi(j)是第 i i i个样本的第 j j j个特征, x i ( j ) ∈ { a j 1 , a j 2 , . . . a j S j } x_i^{(j)} \in \{a_{j1}, a_{j2}, ... a_{jS_j}\} xi(j)∈{aj1,aj2,...ajSj}, a j l a_{jl} ajl是第 j j j个特征可能的第 l l l个取值, j = 1 , 2 , . . . , n , l = 1 , 2 , . . . , S j , y i ∈ { c 1 , c 2 , . . . , c K } j = 1,2, ..., n, l = 1,2,..., S_j, y_i \in \{c_1, c_2, ..., c_K\} j=1,2,...,n,l=1,2,...,Sj,yi∈{c1,c2,...,cK};
输出:输出实例 x x x的分类。
(1)计算先验概率及条件概率
$$
p(y = c_k) = \frac{\sum_{i=1}^N I(y = c_k)}{N},\ \ \ \ k = 1,2,3,…,K \
p(x_i^{(j)} = a_{jl}| y = c_k) =
\frac{\sum_{i=1}^N I(x_i^{(j)}=a_{jl}, y_i=c_k)}
{\sum_{i = 1}^N I(y_i = c_k)} \
j = 1,2, …, n; \ \ l = 1,2,…, S_j; \ \ k = 1, 2, …, K
KaTeX parse error: Can't use function '$' in math mode at position 12: (2)对于给定的实例$̲x = (x^{(1)}, x…
P(Y = c_k) \prod_{j = 1}^n P(X ^{(j)} = x^{(j)} | Y = c_k), \ \ \ k=1,2,…,K
KaTeX parse error: Can't use function '$' in math mode at position 9: (3)确定实例$̲x$的类
y = \underset {c_k} {argmax} P(Y = c_k) \prod_{j = 1}^n P(X ^{(j)} = x^{(j)} | Y = c_k)
$$
4.2.3 贝叶斯估计
使用极大似然估计可能会出现所要估计的概率值为0的情况,这时会影响到后验概率的计算结果,使分类产生偏差。解决这一问题的方法是采用贝叶斯估计
P
λ
(
X
(
j
)
=
a
j
l
∣
Y
=
c
k
)
=
∑
i
=
1
N
I
(
x
i
(
j
)
=
a
j
l
,
y
i
=
c
k
)
+
λ
∑
i
=
1
N
I
(
y
i
=
c
k
)
+
S
j
λ
P_{\lambda}(X^{(j)} = a_{jl} | Y = c_k) = \frac{\sum_{i=1}^N I(x_i^{(j)}=a_{jl}, y_i=c_k) + \lambda} {\sum_{i = 1}^N I(y_i = c_k) + S_j \lambda}
Pλ(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)+Sjλ∑i=1NI(xi(j)=ajl,yi=ck)+λ
式中
λ
>
=
0
\lambda >= 0
λ>=0,等价于在随机变量各个取值的频数上赋予一个正数
λ
>
0
\lambda > 0
λ>0,当
λ
=
0
\lambda = 0
λ=0的时候,就是极大似然估计。常取
λ
=
1
\lambda = 1
λ=1,这时称为拉普拉斯平滑(Laplacian smoothing)。由于对于任意的
l
=
1
,
2
,
.
.
.
,
S
j
;
k
=
1
,
2
,
.
.
.
,
K
l = 1,2,..., S_j; \ \ k = 1, 2, ..., K
l=1,2,...,Sj; k=1,2,...,K,都有
P
λ
(
X
(
j
)
=
a
j
l
∣
Y
=
c
k
)
>
0
∑
l
=
1
S
j
P
λ
(
X
(
j
)
=
a
j
l
∣
Y
=
c
k
)
=
0
P_{\lambda}(X^{(j)} = a_{jl} | Y = c_k) > 0 \\ \sum_{l = 1}^{S_j} P_{\lambda}(X^{(j)} = a_{jl} | Y = c_k) = 0
Pλ(X(j)=ajl∣Y=ck)>0l=1∑SjPλ(X(j)=ajl∣Y=ck)=0
所以贝叶斯估计也是一种概率分布,同样的,先验概率的贝叶斯估计是
P
λ
(
Y
=
c
k
)
=
∑
i
=
1
N
I
(
y
=
c
k
)
+
λ
N
+
K
λ
,
k
=
1
,
2
,
3
,
.
.
.
,
K
P_{\lambda}(Y = c_k) = \frac{\sum_{i=1}^N I(y = c_k) + \lambda}{N + K \lambda},\ \ \ \ k = 1,2,3,...,K
Pλ(Y=ck)=N+Kλ∑i=1NI(y=ck)+λ, k=1,2,3,...,K
总结
-
朴素贝叶斯法是典型的生成学习方法。生成学习方法由训练数据学习联合概率分布 P ( X , Y ) P(X,Y) P(X,Y),然后求得后验概率分布 P ( X ∣ Y ) P(X | Y) P(X∣Y)。
-
朴素贝叶斯的基本假设是条件独立性,基于此,省略率大量的参数,学习与预测大为简化,因而优点是高效,且易于实现。缺点就是分类性能不一定很高。
习题
回顾下贝叶斯估计
思路:假设概率 P λ ( Y = c i ) P_{\lambda}(Y=c_i) Pλ(Y=ci)服从狄利克雷(Dirichlet)分布,根据贝叶斯公式,推导后验概率也服从Dirichlet分布,求参数期望;
证明步骤:
- 条件假设
根据朴素贝叶斯法的基本方法,训练数据集
T
=
{
(
x
1
,
y
1
)
,
(
x
2
,
y
2
)
,
…
,
(
x
N
,
y
N
)
}
T=\{(x_1,y_1),(x_2,y_2),\ldots,(x_N,y_N)\}
T={(x1,y1),(x2,y2),…,(xN,yN)},假设:
(1)随机变量
Y
Y
Y出现
y
=
c
k
y=c_k
y=ck的次数为
m
k
m_k
mk,即
m
k
=
∑
i
=
1
N
I
(
y
i
=
c
k
)
\displaystyle m_k=\sum_{i=1}^N I(y_i=c_k)
mk=i=1∑NI(yi=ck),可知
∑
k
=
1
K
m
k
=
N
\displaystyle \sum_{k=1}^K m_k = N
k=1∑Kmk=N(y总共有N个);
(2)
P
λ
(
Y
=
c
k
)
=
u
k
P_\lambda(Y=c_k)=u_k
Pλ(Y=ck)=uk,随机变量
u
k
u_k
uk服从参数为
λ
\lambda
λ的Dirichlet分布。
补充说明:
-
狄利克雷(Dirichlet)分布
参考PRML(Pattern Recognition and Machine Learning)一书的第2.2.1章节:⽤似然函数(2.34)乘以先验(2.38),我们得到了参数 u k u_k uk的后验分布,形式为
p ( u ∣ D , α ) ∝ p ( D ∣ u ) p ( u ∣ α ) ∝ ∏ k = 1 K u k α k + m k − 1 p(u|D,\alpha) \propto p(D|u)p(u|\alpha) \propto \prod_{k=1}^K u_k^{\alpha_k+m_k-1} p(u∣D,α)∝p(D∣u)p(u∣α)∝k=1∏Kukαk+mk−1该书中第B.4章节: 狄利克雷分布是K个随机变量 0 ⩽ u k ⩽ 1 0 \leqslant u_k \leqslant 1 0⩽uk⩽1的多变量分布,其中 k = 1 , 2 , … , K k=1,2,\ldots,K k=1,2,…,K,并满足以下约束
0 ⩽ u k ⩽ 1 , ∑ k = 1 K u k = 1 0 \leqslant u_k \leqslant 1, \quad \sum_{k=1}^K u_k = 1 0⩽uk⩽1,k=1∑Kuk=1
记 u = ( u 1 , … , u K ) T , α = ( α 1 , … , α K ) T u=(u_1,\ldots,u_K)^T, \alpha=(\alpha_1,\ldots,\alpha_K)^T u=(u1,…,uK)T,α=(α1,…,αK)T,有
D i r ( u ∣ α ) = C ( α ) ∏ k − 1 K u k α k − 1 E ( u k ) = α k ∑ k = 1 K α k Dir(u|\alpha) = C(\alpha) \prod_{k-1}^K u_k^{\alpha_k - 1} \\ E(u_k) = \frac{\alpha_k}{\displaystyle \sum_{k=1}^K \alpha_k} Dir(u∣α)=C(α)k−1∏Kukαk−1E(uk)=k=1∑Kαkαk -
为什么假设 Y = c k Y=c_k Y=ck的概率服从Dirichlet分布?
答:原因如下:
(1)首先,根据PRML第B.4章节,Dirichlet分布是Beta分布的推广。
(2)由于,Beta分布是二项式分布的共轭分布,Dirichlet分布是多项式分布的共轭分布。Dirichlet分布可以看作是“分布的分布”;
(3)又因为,Beta分布与Dirichlet分布都是先验共轭的,意味着先验概率和后验概率属于同一个分布。当假设为Beta分布或者Dirichlet分布时,通过获得大量的观测数据,进行数据分布的调整,使得计算出来的概率越来越接近真实值。
(4)因此,对于一个概率未知的事件,Beta分布或Dirichlet分布能作为表示该事件发生的概率的概率分布。
大佬牛逼…感谢网上的解答
-
得到先验概率:
P ( u ) = P ( u 1 , u 2 , … , u K ) = C ( λ ) ∏ k = 1 K u k λ − 1 \displaystyle P(u)=P(u_1,u_2,\ldots,u_K) = C(\lambda) \prod_{k=1}^K u_k^{\lambda - 1} P(u)=P(u1,u2,…,uK)=C(λ)k=1∏Kukλ−1 -
得到似然函数
记 m = ( m 1 , m 2 , … , m K ) T m=(m_1, m_2, \ldots, m_K)^T m=(m1,m2,…,mK)T,可得似然函数为
P ( m ∣ u ) = u 1 m 1 ⋅ u 2 m 2 ⋯ u K m K = ∏ k = 1 K u k m k P(m|u) = u_1^{m_1} \cdot u_2^{m_2} \cdots u_K^{m_K} = \prod_{k=1}^K u_k^{m_k} P(m∣u)=u1m1⋅u2m2⋯uKmK=k=1∏Kukmk -
得到后验概率分布
结合贝叶斯公式,求 u u u的后验概率分布,可得
P ( u ∣ m ) = P ( m ∣ u ) P ( u ) P ( m ) P(u|m) = \frac{P(m|u)P(u)}{P(m)} P(u∣m)=P(m)P(m∣u)P(u) -
根据假设(1),可得
P ( u ∣ m , λ ) ∝ P ( m ∣ u ) P ( u ∣ λ ) ∝ ∏ k = 1 K u k λ + m k − 1 P(u|m,\lambda) \propto P(m|u)P(u|\lambda) \propto \prod_{k=1}^K u_k^{\lambda+m_k-1} P(u∣m,λ)∝P(m∣u)P(u∣λ)∝k=1∏Kukλ+mk−1
上式表明,后验概率分布P(u|m,\lambda)P(u∣m,λ)也服从Dirichlet分布 -
得到随机变量uu的期望
根据后验概率分布 P ( u ∣ m , λ ) P(u|m,\lambda) P(u∣m,λ)和假设(1),求随机变量 u u u的期望,可得
E ( u k ) = α k ∑ k = 1 K α k E(u_k) = \frac{\alpha_k}{\displaystyle \sum_{k=1}^K \alpha_k} E(uk)=k=1∑Kαkαk
其中 α k = λ + m k \alpha_k = \lambda+m_k αk=λ+mk,则
E ( u k ) = α k ∑ k = 1 K α k = λ + m k ∑ k = 1 K ( λ + m k ) = λ + m k ∑ k = 1 K λ + ∑ k = 1 K m k = λ + m k K λ + N = ∑ i = 1 N I ( y i = c k ) + λ N + K λ \begin{aligned} E(u_k) &= \frac{\alpha_k}{\displaystyle \sum_{k=1}^K \alpha_k} \\ &= \frac{\lambda+m_k}{\displaystyle \sum_{k=1}^K (\lambda + m_k)} \\ &= \frac{\lambda+m_k}{\displaystyle \sum_{k=1}^K \lambda +\sum_{k=1}^K m_k} \\ &= \frac{\lambda+m_k}{\displaystyle K \lambda + N } \\ &= \frac{\displaystyle \sum_{i=1}^N I(y_i=c_k) + \lambda}{N+K \lambda} \end{aligned} E(uk)=k=1∑Kαkαk=k=1∑K(λ+mk)λ+mk=k=1∑Kλ+k=1∑Kmkλ+mk=Kλ+Nλ+mk=N+Kλi=1∑NI(yi=ck)+λ
公式(4.11)得证
公式(4.10)的证明类似