问题描述
我们有一个天气数据集,包含以下特征和类别 Y Y Y(是否适合外出):
- 数据结构:
- X 1 X_1 X1:天气状况(晴天、阴天、雨天)
- X 2 X_2 X2:温度(高、中、低)
- X 3 X_3 X3:湿度(高、低)
- Y Y Y:是否适合外出(是、否)
训练数据如下:
天气状况 ( X 1 X_1 X1) | 温度 ( X 2 X_2 X2) | 湿度 ( X 3 X_3 X3) | 是否外出 ( Y Y Y) |
---|---|---|---|
晴天 | 高 | 高 | 否 |
晴天 | 高 | 高 | 否 |
阴天 | 高 | 高 | 是 |
雨天 | 中 | 高 | 是 |
雨天 | 低 | 低 | 是 |
雨天 | 低 | 低 | 否 |
阴天 | 低 | 低 | 是 |
晴天 | 中 | 高 | 否 |
晴天 | 低 | 低 | 是 |
雨天 | 中 | 低 | 是 |
我们需要预测一个新样本:
- X = ( 晴天 , 高 , 高 ) X = (\text{晴天}, \text{高}, \text{高}) X=(晴天,高,高)
- 问:该样本是否适合外出?
1. 按朴素贝叶斯算法计算步骤
1.1 计算先验概率 P ( Y ) P(Y) P(Y)
统计训练数据中每个类别的数量:
- P ( Y = 是 ) = 是 的样本数 总样本数 = 5 10 = 0.5 P(Y = \text{是}) = \frac{\text{是 的样本数}}{\text{总样本数}} = \frac{5}{10} = 0.5 P(Y=是)=总样本数是 的样本数=105=0.5
- P ( Y = 否 ) = 否 的样本数 总样本数 = 5 10 = 0.5 P(Y = \text{否}) = \frac{\text{否 的样本数}}{\text{总样本数}} = \frac{5}{10} = 0.5 P(Y=否)=总样本数否 的样本数=105=0.5
1.2 计算条件概率 P ( X i ∣ Y ) P(X_i|Y) P(Xi∣Y)
对每个特征 X 1 , X 2 , X 3 X_1, X_2, X_3 X1,X2,X3,计算特征值在各类别下的条件概率:
天气状况 ( X 1 X_1 X1) 的条件概率:
- P ( X 1 = 晴天 ∣ Y = 是 ) = 是 中 晴天的样本数 是 的样本数 = 0 5 = 0 P(X_1 = \text{晴天}|Y = \text{是}) = \frac{\text{是 中 晴天的样本数}}{\text{是 的样本数}} = \frac{0}{5} = 0 P(X1=晴天∣Y=是)=是 的样本数是 中 晴天的样本数=50=0
- P ( X 1 = 晴天 ∣ Y = 否 ) = 否 中 晴天的样本数 否 的样本数 = 3 5 = 0.6 P(X_1 = \text{晴天}|Y = \text{否}) = \frac{\text{否 中 晴天的样本数}}{\text{否 的样本数}} = \frac{3}{5} = 0.6 P(X1=晴天∣Y=否)=否 的样本数否 中 晴天的样本数=53=0.6
- P ( X 1 = 雨天 ∣ Y = 是 ) = 3 5 = 0.6 P(X_1 = \text{雨天}|Y = \text{是}) = \frac{3}{5} = 0.6 P(X1=雨天∣Y=是)=53=0.6
- P ( X 1 = 雨天 ∣ Y = 否 ) = 1 5 = 0.2 P(X_1 = \text{雨天}|Y = \text{否}) = \frac{1}{5} = 0.2 P(X1=雨天∣Y=否)=51=0.2
- P ( X 1 = 阴天 ∣ Y = 是 ) = 2 5 = 0.4 P(X_1 = \text{阴天}|Y = \text{是}) = \frac{2}{5} = 0.4 P(X1=阴天∣Y=是)=52=0.4
- P ( X 1 = 阴天 ∣ Y = 否 ) = 1 5 = 0.2 P(X_1 = \text{阴天}|Y = \text{否}) = \frac{1}{5} = 0.2 P(X1=阴天∣Y=否)=51=0.2
温度 ( X 2 X_2 X2) 的条件概率:
- P ( X 2 = 高 ∣ Y = 是 ) = 1 5 = 0.2 P(X_2 = \text{高}|Y = \text{是}) = \frac{1}{5} = 0.2 P(X2=高∣Y=是)=51=0.2
- P ( X 2 = 高 ∣ Y = 否 ) = 2 5 = 0.4 P(X_2 = \text{高}|Y = \text{否}) = \frac{2}{5} = 0.4 P(X2=高∣Y=否)=52=0.4
- P ( X 2 = 中 ∣ Y = 是 ) = 2 5 = 0.4 P(X_2 = \text{中}|Y = \text{是}) = \frac{2}{5} = 0.4 P(X2=中∣Y=是)=52=0.4
- P ( X 2 = 中 ∣ Y = 否 ) = 1 5 = 0.2 P(X_2 = \text{中}|Y = \text{否}) = \frac{1}{5} = 0.2 P(X2=中∣Y=否)=51=0.2
- P ( X 2 = 低 ∣ Y = 是 ) = 2 5 = 0.4 P(X_2 = \text{低}|Y = \text{是}) = \frac{2}{5} = 0.4 P(X2=低∣Y=是)=52=0.4
- P ( X 2 = 低 ∣ Y = 否 ) = 2 5 = 0.4 P(X_2 = \text{低}|Y = \text{否}) = \frac{2}{5} = 0.4 P(X2=低∣Y=否)=52=0.4
湿度 ( X 3 X_3 X3) 的条件概率:
- P ( X 3 = 高 ∣ Y = 是 ) = 2 5 = 0.4 P(X_3 = \text{高}|Y = \text{是}) = \frac{2}{5} = 0.4 P(X3=高∣Y=是)=52=0.4
- P ( X 3 = 高 ∣ Y = 否 ) = 3 5 = 0.6 P(X_3 = \text{高}|Y = \text{否}) = \frac{3}{5} = 0.6 P(X3=高∣Y=否)=53=0.6
- P ( X 3 = 低 ∣ Y = 是 ) = 3 5 = 0.6 P(X_3 = \text{低}|Y = \text{是}) = \frac{3}{5} = 0.6 P(X3=低∣Y=是)=53=0.6
- P ( X 3 = 低 ∣ Y = 否 ) = 2 5 = 0.4 P(X_3 = \text{低}|Y = \text{否}) = \frac{2}{5} = 0.4 P(X3=低∣Y=否)=52=0.4
2. 分类计算
后验概率计算公式
根据朴素贝叶斯公式:
P
(
Y
∣
X
)
∝
P
(
Y
)
⋅
P
(
X
1
∣
Y
)
⋅
P
(
X
2
∣
Y
)
⋅
P
(
X
3
∣
Y
)
P(Y|X) \propto P(Y) \cdot P(X_1|Y) \cdot P(X_2|Y) \cdot P(X_3|Y)
P(Y∣X)∝P(Y)⋅P(X1∣Y)⋅P(X2∣Y)⋅P(X3∣Y)
分别计算 P ( Y = 是 ∣ X ) P(Y = \text{是}|X) P(Y=是∣X) 和 P ( Y = 否 ∣ X ) P(Y = \text{否}|X) P(Y=否∣X):
2.1 计算 P ( Y = 是 ∣ X ) P(Y = \text{是}|X) P(Y=是∣X)
P
(
Y
=
是
∣
X
)
∝
P
(
Y
=
是
)
⋅
P
(
X
1
=
晴天
∣
Y
=
是
)
⋅
P
(
X
2
=
高
∣
Y
=
是
)
⋅
P
(
X
3
=
高
∣
Y
=
是
)
P(Y = \text{是}|X) \propto P(Y = \text{是}) \cdot P(X_1 = \text{晴天}|Y = \text{是}) \cdot P(X_2 = \text{高}|Y = \text{是}) \cdot P(X_3 = \text{高}|Y = \text{是})
P(Y=是∣X)∝P(Y=是)⋅P(X1=晴天∣Y=是)⋅P(X2=高∣Y=是)⋅P(X3=高∣Y=是)
代入:
P
(
Y
=
是
∣
X
)
∝
0.5
⋅
0
⋅
0.2
⋅
0.4
=
0
P(Y = \text{是}|X) \propto 0.5 \cdot 0 \cdot 0.2 \cdot 0.4 = 0
P(Y=是∣X)∝0.5⋅0⋅0.2⋅0.4=0
2.2 计算 P ( Y = 否 ∣ X ) P(Y = \text{否}|X) P(Y=否∣X)
P
(
Y
=
否
∣
X
)
∝
P
(
Y
=
否
)
⋅
P
(
X
1
=
晴天
∣
Y
=
否
)
⋅
P
(
X
2
=
高
∣
Y
=
否
)
⋅
P
(
X
3
=
高
∣
Y
=
否
)
P(Y = \text{否}|X) \propto P(Y = \text{否}) \cdot P(X_1 = \text{晴天}|Y = \text{否}) \cdot P(X_2 = \text{高}|Y = \text{否}) \cdot P(X_3 = \text{高}|Y = \text{否})
P(Y=否∣X)∝P(Y=否)⋅P(X1=晴天∣Y=否)⋅P(X2=高∣Y=否)⋅P(X3=高∣Y=否)
代入:
P
(
Y
=
否
∣
X
)
∝
0.5
⋅
0.6
⋅
0.4
⋅
0.6
=
0.072
P(Y = \text{否}|X) \propto 0.5 \cdot 0.6 \cdot 0.4 \cdot 0.6 = 0.072
P(Y=否∣X)∝0.5⋅0.6⋅0.4⋅0.6=0.072
3. 结果
由于:
P
(
Y
=
是
∣
X
)
=
0
,
P
(
Y
=
否
∣
X
)
=
0.072
P(Y = \text{是}|X) = 0, \quad P(Y = \text{否}|X) = 0.072
P(Y=是∣X)=0,P(Y=否∣X)=0.072
分类结果为:
Y
^
=
否
\hat{Y} = \text{否}
Y^=否
总结
- 多特征朴素贝叶斯计算的关键在于逐一计算每个特征在不同类别下的条件概率。
- 条件独立性假设简化了联合概率的计算,将其分解为各特征条件概率的乘积。
- 在本例中,基于训练数据和条件概率,新样本被分类为 “否”。