题目7.1
试使用极大似然法估算西瓜数据集3.0中前3个属性的类条件概率。
对于属性色泽,根据表4.3,有
- 好瓜:3个青绿,4个乌黑,1个浅白
- 坏瓜:3个青绿,2个乌黑,4个浅白
设
P
(
青
绿
∣
好
瓜
)
=
ζ
1
P(青绿|好瓜) = \zeta_1
P(青绿∣好瓜)=ζ1,
P
(
乌
黑
∣
好
瓜
)
=
ζ
2
P(乌黑|好瓜) = \zeta_2
P(乌黑∣好瓜)=ζ2,
P
(
浅
白
∣
好
瓜
)
=
1
−
ζ
1
−
ζ
2
P(浅白|好瓜) = 1 - \zeta_1 - \zeta_2
P(浅白∣好瓜)=1−ζ1−ζ2,由式(7.9)有
L
(
D
好
瓜
∣
θ
好
瓜
)
=
∏
x
∈
D
好
瓜
P
(
x
∣
θ
好
瓜
)
=
ζ
1
3
⋅
ζ
2
4
⋅
(
1
−
ζ
1
−
ζ
2
)
1
,
L(D_{好瓜} | \theta_{好瓜}) = \prod_{\bm{x} \in D_{好瓜}} P(\bm{x} | \theta_{好瓜}) = \zeta_1^3 \cdot \zeta_2^4 \cdot (1 - \zeta_1 - \zeta_2)^1,
L(D好瓜∣θ好瓜)=x∈D好瓜∏P(x∣θ好瓜)=ζ13⋅ζ24⋅(1−ζ1−ζ2)1,
则
L
L
(
θ
好
瓜
)
=
l
o
g
P
(
D
好
瓜
∣
θ
好
瓜
)
=
3
l
o
g
ζ
1
+
4
l
o
g
ζ
2
+
l
o
g
(
1
−
ζ
1
−
ζ
2
)
LL(\theta_{好瓜}) = log ~ P(D_{好瓜} | \theta_{好瓜}) = 3log ~ \zeta_1 + 4log ~ \zeta_2 + log ~ (1 - \zeta_1 - \zeta_2)
LL(θ好瓜)=log P(D好瓜∣θ好瓜)=3log ζ1+4log ζ2+log (1−ζ1−ζ2),分别对
ζ
1
\zeta_1
ζ1、
ζ
2
\zeta_2
ζ2求偏导数为0时的值,有
ζ
1
=
3
/
8
\zeta_1 = 3/8
ζ1=3/8,
ζ
2
=
4
/
8
\zeta_2 = 4/8
ζ2=4/8和
ζ
3
=
1
/
8
\zeta_3 = 1/8
ζ3=1/8。
设
P
(
青
绿
∣
坏
瓜
)
=
η
1
P(青绿|坏瓜) = \eta_1
P(青绿∣坏瓜)=η1,
P
(
乌
黑
∣
坏
瓜
)
=
η
2
P(乌黑|坏瓜) = \eta_2
P(乌黑∣坏瓜)=η2,
P
(
浅
白
∣
坏
瓜
)
=
1
−
η
1
−
η
2
P(浅白|坏瓜) = 1 - \eta_1 - \eta_2
P(浅白∣坏瓜)=1−η1−η2,由式(7.9)有
L
(
D
坏
瓜
∣
θ
坏
瓜
)
=
∏
x
∈
D
坏
瓜
P
(
x
∣
θ
坏
瓜
)
=
η
1
3
⋅
η
2
2
⋅
(
1
−
η
1
−
η
2
)
4
,
L(D_{坏瓜} | \theta_{坏瓜}) = \prod_{\bm{x} \in D_{坏瓜}} P(\bm{x} | \theta_{坏瓜}) = \eta_1^3 \cdot \eta_2^2 \cdot (1 - \eta_1 - \eta_2)^4,
L(D坏瓜∣θ坏瓜)=x∈D坏瓜∏P(x∣θ坏瓜)=η13⋅η22⋅(1−η1−η2)4,
则
L
L
(
θ
坏
瓜
)
=
l
o
g
P
(
D
坏
瓜
∣
θ
坏
瓜
)
=
3
l
o
g
η
1
+
2
l
o
g
η
2
+
4
l
o
g
(
1
−
η
1
−
η
2
)
LL(\theta_{坏瓜}) = log ~ P(D_{坏瓜} | \theta_{坏瓜}) = 3log ~ \eta_1 + 2log ~ \eta_2 + 4log ~ (1 - \eta_1 - \eta_2)
LL(θ坏瓜)=log P(D坏瓜∣θ坏瓜)=3log η1+2log η2+4log (1−η1−η2),分别对
η
1
\eta_1
η1、
η
2
\eta_2
η2和
η
3
\eta_3
η3求偏导数为0时的值,有
η
1
=
3
/
9
\eta_1 = 3/9
η1=3/9,
η
2
=
2
/
9
\eta_2 = 2/9
η2=2/9和
η
3
=
4
/
9
\eta_3 = 4/9
η3=4/9。
上述结果与直接观测一致,即式(7.17),条件概率
P
(
x
i
∣
c
)
P(x_i | c)
P(xi∣c)可估计为
P
(
x
i
∣
c
)
=
∣
D
c
,
x
i
∣
∣
D
c
∣
。
P(x_i | c) = \frac{|D_{c, x_i}|}{|D_c|}。
P(xi∣c)=∣Dc∣∣Dc,xi∣。
题目7.3
试编程实现拉普拉斯修正的朴素贝叶斯分类器,并以西瓜数据集3.0为训练集,对p.151“测1”样本进行判别。
粗糙版本
import math
import numpy as np
data_ = [
['青绿','蜷缩','浊响','清晰','凹陷','硬滑',0.697,0.460,'是'],
['乌黑','蜷缩','沉闷','清晰','凹陷','硬滑',0.774,0.376,'是'],
['乌黑','蜷缩','浊响','清晰','凹陷','硬滑',0.634,0.264,'是'],
['青绿','蜷缩','沉闷','清晰','凹陷','硬滑',0.608,0.318,'是'],
['浅白','蜷缩','浊响','清晰','凹陷','硬滑',0.556,0.215,'是'],
['青绿','稍蜷','浊响','清晰','稍凹','软粘',0.403,0.237,'是'],
['乌黑','稍蜷','浊响','稍糊','稍凹','软粘',0.481,0.149,'是'],
['乌黑','稍蜷','浊响','清晰','稍凹','硬滑',0.437,0.211,'是'],
['乌黑','稍蜷','沉闷','稍糊','稍凹','硬滑',0.666,0.091,'否'],
['青绿','硬挺','清脆','清晰','平坦','软粘',0.243,0.267,'否'],
['浅白','硬挺','清脆','模糊','平坦','硬滑',0.245,0.057,'否'],
['浅白','蜷缩','浊响','模糊','平坦','软粘',0.343,0.099,'否'],
['青绿','稍蜷','浊响','稍糊','凹陷','硬滑',0.639,0.161,'否'],
['浅白','稍蜷','沉闷','稍糊','凹陷','硬滑',0.657,0.198,'否'],
['乌黑','稍蜷','浊响','清晰','稍凹','软粘',0.360,0.370,'否'],
['浅白','蜷缩','浊响','模糊','平坦','硬滑',0.593,0.042,'否'],
['青绿','蜷缩','沉闷','稍糊','稍凹','硬滑',0.719,0.103,'否'],
]
is_discrete = [True] * 6 + [False] * 2
set_list = [set() for i in range(8)]
for d in data_:
for i in range(8):
set_list[i].add(d[i])
features_list = [[] for i in range(8)]
for i in range(8):
features_list[i] = list(set_list[i])
data = np.mat(data_)
labels = np.unique(data[:, -1].A)
cnt_labels = [0] * len(labels)
for i in range(data.shape[0]):
if data[i, -1] == labels[0]:
cnt_labels[0] = cnt_labels[0] + 1
elif data[i, -1] == labels[1]:
cnt_labels[1] = cnt_labels[1] + 1
def train_discrete(data, labels, cnt_labels, features_list, xi):
prob = np.ones([len(labels), np.unique(data[:, xi].A).shape[0]])
for i in range(data.shape[0]):
tmp = features_list[xi].index(data[i, xi])
if data[i, -1] == labels[0]:
prob[0, tmp] = prob[0, tmp] + 1
elif data[i, -1] == labels[1]:
prob[1, tmp] = prob[1, tmp] + 1
for i in range(len(labels)):
prob[i] = prob[i] / (cnt_labels[i] + len(features_list[xi]))
return prob
def train_continuous(data, labels, xi):
vec0, vec1 = [], []
for i in range(data.shape[0]):
if data[i, -1] == labels[0]:
vec0.append(data[i, xi])
elif data[i, -1] == labels[1]:
vec1.append(data[i, xi])
vec0, vec1 = np.array(vec0).astype(float), np.array(vec1).astype(float)
u0, u1 = np.mean(vec0), np.mean(vec1)
s0, s1 = np.var(vec0), (np.var(vec1))
return np.mat([[u0, s0], [u1, s1]])
param = []
for i in range(8):
if is_discrete[i]:
param.append(train_discrete(data, labels, cnt_labels, features_list, i))
else:
param.append(train_continuous(data, labels, i))
p0 = (cnt_labels[0] + 1) / (len(data_) + 2)
p1 = (cnt_labels[1] + 1) / (len(data_) + 2)
d = data_[0]
for i in range(len(d) - 1):
if is_discrete[i]:
ind = features_list[i].index(d[i])
p0 *= param[i][0, ind]
p1 *= param[i][1, ind]
else:
p0 *= 1 / (math.sqrt(2 * math.pi * param[i][0, 1])) * math.exp(-(d[i] - param[i][0, 0])**2 / (2 * param[i][0,1]))
p1 *= 1 / (math.sqrt(2 * math.pi * param[i][1, 1])) * math.exp(-(d[i] - param[i][1, 0]) ** 2 / (2 * param[i][1, 1]))
print(p0, p1)
if p0 > p1:
print(labels[0])
else:
print(labels[1])
print()
err = 0
for d in data_:
p0 = (cnt_labels[0] + 1) / (len(data_) + 2)
p1 = (cnt_labels[1] + 1) / (len(data_) + 2)
for i in range(len(d) - 1):
if is_discrete[i]:
ind = features_list[i].index(d[i])
p0 *= param[i][0, ind]
p1 *= param[i][1, ind]
else:
p0 *= 1 / (math.sqrt(2 * math.pi * param[i][0, 1])) * math.exp(
-(d[i] - param[i][0, 0]) ** 2 / (2 * param[i][0, 1]))
p1 *= 1 / (math.sqrt(2 * math.pi * param[i][1, 1])) * math.exp(
-(d[i] - param[i][1, 0]) ** 2 / (2 * param[i][1, 1]))
plabel = None
if p0 > p1:
plabel = labels[0]
else:
plabel = labels[1]
if plabel != d[-1]:
err += 1
print(1 - err / len(data_))
训练误差为0.1765。本题对“测1”的计算结果与链接中所给的计算结果一致,需要留意的是numpy与pandas的var函数有些区别,pandas在计算方差时分母为 N − 1 N-1 N−1,pandas设置var(ddof=0)时链接中的代码与本题代码所给对数似然概率一致,pandas的var函数ddof参数默认值为1。
改进版本
import numpy as np
import pandas as pd
from sklearn.utils.multiclass import type_of_target
from collections import namedtuple
import json
columns_ = ['色泽', '根蒂', '敲声', '纹理', '脐部', '触感', '密度', '含糖率', '好瓜']
data_ = [
['青绿','蜷缩','浊响','清晰','凹陷','硬滑',0.697,0.460,'是'],
['乌黑','蜷缩','沉闷','清晰','凹陷','硬滑',0.774,0.376,'是'],
['乌黑','蜷缩','浊响','清晰','凹陷','硬滑',0.634,0.264,'是'],
['青绿','蜷缩','沉闷','清晰','凹陷','硬滑',0.608,0.318,'是'],
['浅白','蜷缩','浊响','清晰','凹陷','硬滑',0.556,0.215,'是'],
['青绿','稍蜷','浊响','清晰','稍凹','软粘',0.403,0.237,'是'],
['乌黑','稍蜷','浊响','稍糊','稍凹','软粘',0.481,0.149,'是'],
['乌黑','稍蜷','浊响','清晰','稍凹','硬滑',0.437,0.211,'是'],
['乌黑','稍蜷','沉闷','稍糊','稍凹','硬滑',0.666,0.091,'否'],
['青绿','硬挺','清脆','清晰','平坦','软粘',0.243,0.267,'否'],
['浅白','硬挺','清脆','模糊','平坦','硬滑',0.245,0.057,'否'],
['浅白','蜷缩','浊响','模糊','平坦','软粘',0.343,0.099,'否'],
['青绿','稍蜷','浊响','稍糊','凹陷','硬滑',0.639,0.161,'否'],
['浅白','稍蜷','沉闷','稍糊','凹陷','硬滑',0.657,0.198,'否'],
['乌黑','稍蜷','浊响','清晰','稍凹','软粘',0.360,0.370,'否'],
['浅白','蜷缩','浊响','模糊','平坦','硬滑',0.593,0.042,'否'],
['青绿','蜷缩','沉闷','稍糊','稍凹','硬滑',0.719,0.103,'否'],
]
labels = ['是', '否']
def Train_nb(x, y, labels, columns):
p_size = len(x)
p_labels = []
x_c = []
for label in labels:
tx_c = x[y == label]
x_c.append(tx_c)
p_labels.append(len(tx_c))
p_xi_cs = []
PItem = namedtuple("PItem", ['is_continuous', 'data', 'n_i'])
for i in range(len(x_c)):
d_c = x_c[i] # d_c即D_c
p_xi_c = []
for column in columns[:-1]: # 遍历所有属性,除了“好瓜”一列
d_c_col = d_c.loc[:, column] # 取出类别为c的数据D_c中对应column属性的一列
if type_of_target(d_c_col) == 'continuous': # 连续值属性
imean = np.mean(d_c_col)
ivar = np.var(d_c_col)
p_xi_c.append(PItem(True, [imean, ivar], None))
else:
n_i = len(pd.unique(x.loc[:, column])) # 该列属性可能的取值数
p_xi_c.append(PItem(False, pd.value_counts(d_c_col).to_json(), n_i)) # python的广播处理
p_xi_cs.append(p_xi_c)
return p_size, p_labels, p_xi_cs
def Predict_nb(sample, labels, p_size, p_labels, p_xi_cs):
res = None
p_best = 0
for i in range(len(labels)):
p_tmp = np.log((p_labels[i] + 1) / (p_size + len(labels)))
p_xi_c = p_xi_cs[i]
for j in range(len(sample)):
pitem = p_xi_c[j]
if not pitem.is_continuous:
jdata = json.loads(pitem.data)
if sample[j] in jdata:
p_tmp += np.log((jdata[sample[j]] + 1) / (p_labels[i] + pitem.n_i))
else:
p_tmp += np.log(1 / (p_labels[i] + pitem.n_i))
else:
[imean, ivar] = pitem.data
p_tmp += np.log(1 / np.sqrt(2 * np.pi * ivar) * np.exp(- (sample[j] - imean) ** 2 / (2 * ivar)))
if i == 0:
res = labels[i]
p_bes = p_tmp
elif p_bes < p_tmp:
res = labels[i]
p_bes = p_tmp
print(p_tmp, end=", ")
print()
return res
if __name__ == '__main__':
data = pd.DataFrame(data=data_, columns=columns_)
x = data.iloc[:, :-1]
y = data.iloc[:, -1]
p_size, p_labels, p_xi_cs = Train_nb(x, y, labels, columns_)
err = 0
for i in range(len(data)):
if Predict_nb(x.iloc[i, :], labels, p_size, p_labels, p_xi_cs) != y[i]:
err += 1
print(err)
print(err / len(data_))
参考自链接,使用了pandas、sklearn,工具库用起来很爽!链接中的代码有2个逻辑小错误:
①方差使用错误了;
②公式(7.20)中的
N
i
N_i
Ni应该是第
i
i
i个属性可能的取值数,而不是
D
c
D_c
Dc中属性的取值数。
总之非常感谢@我是韩小琦,python数据用具用起来真的非常爽!哈哈!
题目7.4
实践中使用式(7.15)决定分类类别时,若数据的维数非常高,则概率连乘 ∏ i = 1 d P ( x i ∣ c ) \prod^d_{i=1} P(x_i | c) ∏i=1dP(xi∣c)的结果通常会非常接近于0从而导致下溢。试述防止下溢的可能方案。
数据维数过高的话,一个可行的方法就是降维;另一个可行的办法书中P149已经提到,即式(7.9)到式(7.10)的变化,使用对数似然(log-likelihood)
l
o
g
(
P
(
c
)
∏
i
=
1
d
P
(
x
i
∣
c
)
)
=
l
o
g
P
(
c
)
+
∑
i
=
1
d
l
o
g
P
(
x
i
∣
c
)
,
\mathsf{log} \Big( ~P(c) \prod^d_{i=1} P(x_i | c) \Big) = \mathsf{log} P(c) + \sum^d_{i=1} \mathsf{log} ~ P(x_i | c),
log( P(c)i=1∏dP(xi∣c))=logP(c)+i=1∑dlog P(xi∣c),
在使用朴素贝叶斯式(7.15)的对数似然形式时一定要记得使用拉普拉斯平滑,否则可能会出现
l
o
g
0
\mathsf{log} ~ 0
log 0的情形导致程序运行出错。
题目7.7
给定 d d d个二值属性的二分类任务,假设对于任何先验概率项的估算至少需30个样例,则在朴素贝叶斯分类器式(7.15)中估算先验概率项 P ( c ) P(c) P(c)需30×2=60个样例。试估计在AODE式(7.23)中估算先验概率项 P ( c , x i ) P(c, x_i) P(c,xi)所需的样例数(分别考虑最好和最坏情形)。
当 d = 1 d=1 d=1时:估算 P ( c = 0 , x 1 = 0 ) P(c=0, x_1=0) P(c=0,x1=0)、 P ( c = 1 , x 1 = 0 ) P(c=1, x_1=0) P(c=1,x1=0)、 P ( c = 0 , x 1 = 1 ) P(c=0, x_1=1) P(c=0,x1=1)和 P ( c = 1 , x 1 = 1 ) P(c=1, x_1=1) P(c=1,x1=1)共需要 30 + 30 + 30 + 30 = 120 30 + 30 + 30 + 30 = 120 30+30+30+30=120个样例。
当 d = 2 d=2 d=2时:最好情况是,在 c = 0 c=0 c=0的 60 60 60个样本中,恰巧有 30 30 30个对应 x 2 = 0 x_2=0 x2=0,有 30 30 30个对应 x 2 = 1 x_2 = 1 x2=1;在 c = 1 c=1 c=1的 60 60 60个样本中,恰巧有 30 30 30个对应 x 2 = 0 x_2=0 x2=0,有 30 30 30个对应 x 2 = 1 x_2 = 1 x2=1,这样样例数不变,依旧是 120 120 120个。最坏的情况是,与 x 1 x_1 x1对应的 120 120 120个样例,它们的 x 2 = 0 x_2 = 0 x2=0(或 x 2 = 1 ) x_2 = 1) x2=1),这样,估算 P ( c , x 2 = 1 ) P(c, x_2 = 1) P(c,x2=1)(或 P ( c , x 2 ) = 0 P(c, x_2)=0 P(c,x2)=0)需要新的 60 60 60个样例,即总共需要 180 180 180个样例。
当 d ≥ 3 d \geq 3 d≥3时,依次类推,可得,最好情况需要 120 120 120个样例,最坏情况需要 60 × ( d + 1 ) 60 \times (d+1) 60×(d+1)个样例。
Acknowledge
题目7.3参考自:
机器学习(周志华) 西瓜书 第七章课后习题7.3—— Python实现
感谢@Zetrue_Li
机器学习(周志华)课后习题——第七章——贝叶斯分类
感谢@我是韩小琦
Panadas 中利用DataFrame对象的.loc[,]、.iloc[,]方法抽取数据
感谢@马尔代夫Maldives