朴素贝叶斯法
输出取值: y ∈ { c 1 , c 2 , . . . , c k } y \in \{ c_{1}, c_{2}, ..., c_{k} \} y∈{c1,c2,...,ck}
输入取值: 假设
x
(
j
)
x^{(j)}
x(j) 可取值有
S
j
S_{j}
Sj 个,其中
j
=
1
,
2
,
.
.
.
,
n
j = 1, 2, ..., n
j=1,2,...,n
条件独立性假设:
P
(
X
=
x
∣
Y
=
c
k
)
=
P
(
X
(
1
)
=
x
(
1
)
,
X
(
2
)
=
x
(
2
)
,
.
.
.
,
X
(
n
)
=
x
(
n
)
∣
Y
=
c
k
)
=
∏
j
=
1
n
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
(
1.1
)
P(X = x \space | \space Y = c_{k}) = P(X^{(1)} = x^{(1)}, X^{(2)} = x^{(2)}, ..., X^{(n)} = x^{(n)} \space | \space Y = c_{k}) = \prod_{j = 1}^{n} P(X^{(j)} = x^{(j)} \space | \space Y = c_{k}) \quad (1.1)
P(X=x ∣ Y=ck)=P(X(1)=x(1),X(2)=x(2),...,X(n)=x(n) ∣ Y=ck)=j=1∏nP(X(j)=x(j) ∣ Y=ck)(1.1)
朴素贝叶斯法由此得名。
由贝叶斯定理,得到后验概率:
P
(
Y
=
c
k
∣
X
=
x
)
=
P
(
X
=
x
∣
Y
=
c
k
)
⋅
P
(
Y
=
c
k
)
∑
k
P
(
X
=
x
∣
Y
=
c
k
)
⋅
P
(
Y
=
c
k
)
(
1.2
)
P(Y = c_{k} \space | \space X = x) = \frac {P(X = x \space | \space Y = c_{k}) \cdot P(Y = c_{k})} {\sum_{k} P(X = x \space | \space Y = c_{k}) \cdot P(Y = c_{k})} \quad (1.2)
P(Y=ck ∣ X=x)=∑kP(X=x ∣ Y=ck)⋅P(Y=ck)P(X=x ∣ Y=ck)⋅P(Y=ck)(1.2)
将式(1.1)带入(1.2),有:
P
(
Y
=
c
k
∣
X
=
x
)
=
P
(
Y
=
c
k
)
⋅
∏
j
n
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
∑
k
P
(
Y
=
c
k
)
⋅
∏
j
n
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
(
1.3
)
P(Y = c_{k} \space | \space X = x) = \frac {P(Y = c_{k}) \cdot \prod_{j}^{n} P(X^{(j)} = x^{(j)} \space | \space Y = c_{k})} {\sum_{k} P(Y = c_{k}) \cdot \prod_{j}^{n} P(X^{(j)} = x^{(j)} \space | \space Y = c_{k})} \quad (1.3)
P(Y=ck ∣ X=x)=∑kP(Y=ck)⋅∏jnP(X(j)=x(j) ∣ Y=ck)P(Y=ck)⋅∏jnP(X(j)=x(j) ∣ Y=ck)(1.3)
上式即为朴素贝叶斯法分类的基本公式。故,朴素贝叶斯分类器可表示为:
y
=
f
(
x
)
=
a
r
g
m
a
x
c
k
P
(
Y
=
c
k
)
⋅
∏
j
n
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
∑
k
P
(
Y
=
c
k
)
⋅
∏
j
n
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
(
1.4
)
y= f(x) = arg \space \underset {c_{k}} {max} \space \frac {P(Y = c_{k}) \cdot \prod_{j}^{n} P(X^{(j)} = x^{(j)} \space | \space Y = c_{k})} {\sum_{k} P(Y = c_{k}) \cdot \prod_{j}^{n} P(X^{(j)} = x^{(j)} \space |\space Y = c_{k})} \quad (1.4)
y=f(x)=arg ckmax ∑kP(Y=ck)⋅∏jnP(X(j)=x(j) ∣ Y=ck)P(Y=ck)⋅∏jnP(X(j)=x(j) ∣ Y=ck)(1.4)
由于上式对所有分母都是相同的,故:
y
=
f
(
x
)
=
a
r
g
m
a
x
c
k
P
(
Y
=
c
k
)
⋅
∏
j
n
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
(
1.5
)
y= f(x) = arg \space \underset {c_{k}} {max} \space P(Y = c_{k}) \cdot \prod_{j}^{n} P(X^{(j)} = x^{(j)} \space | \space Y = c_{k}) \quad (1.5)
y=f(x)=arg ckmax P(Y=ck)⋅j∏nP(X(j)=x(j) ∣ Y=ck)(1.5)
后验概率最大化的含义
期望风险最小化,选择 0-1 损失函数:
L
(
Y
,
f
(
X
)
)
=
{
1
,
Y
≠
f
(
X
)
0
,
Y
=
f
(
X
)
(
2.1
)
L(Y, f(X)) = \begin{cases} 1, \quad Y \neq f(X) \\ 0, \quad Y = f(X) \end{cases} \quad (2.1)
L(Y,f(X))={1,Y=f(X)0,Y=f(X)(2.1)
式中, f ( X ) f(X) f(X) 是分类决策函数。
期望风险函数为:
R
e
x
p
(
f
)
=
E
[
L
(
Y
,
f
(
X
)
)
]
(
2.2
)
R_{exp}(f) = E[L(Y, f(X))] \quad (2.2)
Rexp(f)=E[L(Y,f(X))](2.2)
上式中,期望是对联合分布
P
(
X
,
Y
)
P(X, Y)
P(X,Y) 取的,由此取条件期望,得:
R
e
x
p
(
f
)
=
E
[
L
(
Y
,
f
(
X
)
)
]
=
∑
x
∑
k
L
(
c
k
,
f
(
x
)
)
⋅
P
(
Y
=
c
k
,
X
=
x
)
=
∑
x
∑
k
L
(
c
k
,
f
(
x
)
)
⋅
P
(
Y
=
c
k
∣
X
=
x
)
⋅
P
(
X
=
x
)
=
∑
x
P
(
X
=
x
)
∑
k
L
(
c
k
,
f
(
x
)
)
⋅
P
(
Y
=
c
k
∣
X
=
x
)
=
E
X
[
∑
k
L
(
c
k
,
f
(
x
)
)
⋅
P
(
Y
=
c
k
∣
X
=
x
)
]
(
2.3
)
R_{exp}(f) = E[L(Y, f(X))] = \sum_{x} \sum_{k} L(c_{k}, f(x)) \cdot P(Y = c_{k}, X = x) = \sum_{x} \sum_{k} L(c_{k}, f(x)) \cdot P(Y = c_{k} \space | \space X = x) \cdot P(X = x) = \sum_{x} P(X = x) \sum_{k} L(c_{k}, f(x)) \cdot P(Y = c_{k} \space | \space X = x) = E_{X}[\sum_{k} L(c_{k}, f(x)) \cdot P(Y = c_{k} \space | \space X = x)] \quad (2.3)
Rexp(f)=E[L(Y,f(X))]=x∑k∑L(ck,f(x))⋅P(Y=ck,X=x)=x∑k∑L(ck,f(x))⋅P(Y=ck ∣ X=x)⋅P(X=x)=x∑P(X=x)k∑L(ck,f(x))⋅P(Y=ck ∣ X=x)=EX[k∑L(ck,f(x))⋅P(Y=ck ∣ X=x)](2.3)
为使期望风险最小化,只需对
X
=
x
X = x
X=x 逐个极小化,得:
f
(
x
)
=
a
r
g
m
i
n
y
∑
k
L
(
c
k
,
f
(
x
)
)
⋅
P
(
Y
=
c
k
∣
X
=
x
)
=
a
r
g
m
i
n
c
k
∑
k
P
(
Y
≠
c
k
∣
X
=
x
)
=
a
r
g
m
i
n
c
k
(
1
−
P
(
Y
=
c
k
∣
X
=
x
)
)
=
a
r
g
m
a
x
c
k
P
(
Y
=
c
k
∣
X
=
x
)
(
2.4
)
f(x) = arg \space \underset {y} {min} \sum_{k} L(c_{k}, f(x)) \cdot P(Y = c_{k} \space | \space X = x) = arg \space \underset {c_{k}} {min} \sum_{k} P(Y \neq c_{k} \space | \space X = x) = arg \space \underset {c_{k}} {min} \space (1 - P(Y = c_{k} \space | \space X = x)) = arg \space \underset {c_{k}} {max} \space P(Y = c_{k} \space | \space X = x) \quad (2.4)
f(x)=arg ymink∑L(ck,f(x))⋅P(Y=ck ∣ X=x)=arg ckmink∑P(Y=ck ∣ X=x)=arg ckmin (1−P(Y=ck ∣ X=x))=arg ckmax P(Y=ck ∣ X=x)(2.4)
由期望风险最小化准则,可得后验概率最大化准则,即为朴素贝叶斯法所采用的原理。
综上所述,后验概率最大化等价于 0-1 损失函数时的期望风险最小化。
参数估计
在朴素贝叶斯法中,学习意味着估计先验概率
P
(
Y
=
c
k
)
P(Y = c_{k})
P(Y=ck) 和条件概率
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
P(X^{(j)} = x^{(j)} \space | \space Y = c_{k})
P(X(j)=x(j) ∣ Y=ck)。
极大似然估计
先验概率
P
(
Y
=
c
k
)
P(Y = c_{k})
P(Y=ck) 的极大似然估计:
P
(
Y
=
c
k
)
=
∑
i
=
1
N
I
(
y
i
=
c
k
)
N
k
=
1
,
2
,
.
.
.
,
K
(
3.1
)
P(Y = c_{k}) = \frac {\sum_{i = 1}^{N} I(y_{i} = c_{k})} {N} \quad k = 1, 2, ..., K \quad (3.1)
P(Y=ck)=N∑i=1NI(yi=ck)k=1,2,...,K(3.1)
设第
j
j
j 个特征
x
(
j
)
x_{(j)}
x(j) 可能取值的集合为
{
a
j
,
1
,
a
j
,
2
,
.
.
.
,
a
j
,
S
j
}
\{a_{j, 1}, a_{j, 2}, ..., a_{j, S_{j}}\}
{aj,1,aj,2,...,aj,Sj},则条件概率
P
(
X
(
j
)
=
a
j
,
l
∣
Y
=
c
k
)
P(X_{(j)} = a_{j, l} \space | \space Y = c_{k})
P(X(j)=aj,l ∣ Y=ck) 的极大似然估计为:
P
(
X
(
j
)
=
a
j
,
l
∣
Y
=
c
k
)
=
∑
i
=
1
N
I
(
x
i
(
j
)
=
a
j
,
l
,
y
i
=
c
k
)
∑
j
=
1
N
I
(
y
i
=
c
k
)
j
=
1
,
2
,
.
.
.
,
n
;
l
=
1
,
2
,
.
.
.
,
S
j
;
k
=
1
,
2
,
.
.
.
,
K
(
3.2
)
P(X_{(j)} = a_{j, l} \space | \space Y = c_{k}) = \frac {\sum_{i = 1}^{N} I(x_{i}^{(j)} = a_{j, l}, y_{i} = c_{k})} {\sum_{j = 1}^{N} I(y_{i} = c_{k})} \quad j = 1, 2, ..., n; \quad l = 1, 2, ..., S_{j}; \quad k = 1, 2, ..., K \quad (3.2)
P(X(j)=aj,l ∣ Y=ck)=∑j=1NI(yi=ck)∑i=1NI(xi(j)=aj,l,yi=ck)j=1,2,...,n;l=1,2,...,Sj;k=1,2,...,K(3.2)
贝叶斯估计
先验概率的贝叶斯估计:
P
λ
(
Y
=
c
k
)
=
∑
i
=
1
N
I
(
y
i
=
c
k
)
+
λ
N
+
k
⋅
λ
(
3.3
)
P_{\lambda}(Y = c_{k}) = \frac {\sum_{i = 1}^{N} I(y_{i} = c_{k}) + \lambda} {N + k \cdot \lambda} \quad (3.3)
Pλ(Y=ck)=N+k⋅λ∑i=1NI(yi=ck)+λ(3.3)
条件概率的贝叶斯估计:
P
λ
(
X
(
j
)
=
a
j
,
l
∣
Y
=
c
k
)
=
∑
i
=
1
N
I
(
x
i
(
j
)
=
a
j
,
l
,
y
i
=
c
k
)
+
λ
∑
i
=
1
N
I
(
y
i
=
c
k
)
+
S
j
⋅
λ
(
3.4
)
P_{\lambda}(X^{(j)} = a_{j, l} \space | \space Y = c_{k}) = \frac {\sum_{i = 1}^{N} I(x_{i}^{(j)} = a_{j, l}, y_{i} = c_{k}) + \lambda} {\sum_{i = 1}^{N} I(y_{i} = c_{k}) + S_{j} \cdot \lambda} \quad (3.4)
Pλ(X(j)=aj,l ∣ Y=ck)=∑i=1NI(yi=ck)+Sj⋅λ∑i=1NI(xi(j)=aj,l,yi=ck)+λ(3.4)
其中, λ ≥ 0 \lambda \geq 0 λ≥0。
当
λ
=
0
\lambda = 0
λ=0 时,即为极大似然估计;
常取
λ
=
1
\lambda = 1
λ=1,此时称为拉普拉斯平滑(Laplace smoothing)。
显然有:
P
λ
(
X
(
j
)
=
a
j
,
l
∣
Y
=
c
k
)
>
0
j
=
1
,
2
,
.
.
.
,
n
;
k
=
1
,
2
,
.
.
.
,
K
;
l
=
1
,
2
,
.
.
.
,
S
j
∑
l
=
1
S
j
P
(
X
(
j
)
=
a
j
,
l
∣
Y
=
c
k
)
=
1
j
=
1
,
2
,
.
.
.
,
n
;
k
=
1
,
2
,
.
.
.
,
K
P_{\lambda}(X^{(j)} = a_{j, l} \space | \space Y = c_{k}) > 0 \quad j = 1, 2, ..., n; \quad k = 1, 2, ..., K; \quad l = 1, 2, ..., S_{j} \\ \sum_{l = 1}^{S_{j}} P(X^{(j)} = a_{j, l} \space | \space Y = c_{k}) = 1 \quad j = 1, 2, ..., n; \quad k = 1, 2, ..., K
Pλ(X(j)=aj,l ∣ Y=ck)>0j=1,2,...,n;k=1,2,...,K;l=1,2,...,Sjl=1∑SjP(X(j)=aj,l ∣ Y=ck)=1j=1,2,...,n;k=1,2,...,K
参考资料
李航【统计学习方法】第一版,第4章