引入一下?
本文件主要描述了贝叶斯模型用于分类的一个实例,使用一个汽车改革的数据集作为实验
本例子中涉及到的各维特征都是离散的。
贝叶斯模型核心在于:给定特征向量X,计算该向量所属于的类别y。
我们将这个问题转化为选择使得条件概率p(yi|X)最大的yi。为了选择出最大的p(yi|X),我们首先需要计算出这些条件概率(不然拿头比较呀)。
根据贝叶斯公式:
p
(
y
i
∣
X
)
=
p
(
X
,
y
i
)
p
(
X
)
=
p
(
X
∣
y
i
)
∗
p
(
y
i
)
p
(
X
)
=
p
(
y
i
)
∗
p
(
x
1
,
x
2
,
x
3
,
x
.
.
.
,
x
n
∣
y
i
)
p
(
X
)
p(yi|X)=\frac{p(X,yi)}{p(X)}=\frac{p(X|yi)*p(yi)}{p(X)}=\frac{p(yi)*p(x1,x2,x3,x...,xn|yi)}{p(X)}
p(yi∣X)=p(X)p(X,yi)=p(X)p(X∣yi)∗p(yi)=p(X)p(yi)∗p(x1,x2,x3,x...,xn∣yi)
其中,我们需要计算p(X|yi),而X是一个多维的向量,这是一个联合条件概率.而p(X)一般我们视为对于所有的特征向量这个数值是相等的。
于是我们只需计算
p
(
y
i
)
∗
p
(
x
1
,
x
2
,
x
3
,
x
.
.
.
,
x
n
∣
y
i
)
p(yi)*p(x1,x2,x3,x...,xn|yi)
p(yi)∗p(x1,x2,x3,x...,xn∣yi)即可。
为了计算这个公式,我们对这个联合条件概率进行一个简化。naive 就是假设特征向量的各个特征是相互独立的。
因为特征独立,于是有:
p
(
x
1
,
x
2
,
x
3
,
x
.
.
.
,
x
n
∣
y
i
)
=
p
(
x
1
∣
y
i
)
∗
p
(
x
2
∣
y
i
)
∗
p
(
x
3
∣
y
i
)
⋅
⋅
⋅
p(x1,x2,x3,x...,xn|yi) = p(x1|yi)*p(x2|yi)*p(x3|yi)···
p(x1,x2,x3,x...,xn∣yi)=p(x1∣yi)∗p(x2∣yi)∗p(x3∣yi)⋅⋅⋅使得计算变成可能。
一般使用贝叶斯模型的时候,会离线计算出各个条件概率分布。一般:我们需要确定:m+m*n个概率分布。其中m是类别数量,n是向量的维度
于是我们得先计算这些条件概率,当然我们需要在计算的过程中考虑条件概率的平滑,防止因为数据集稀疏导致条件概率为0。
分析数据集
数据集采集自:http://archive.ics.uci.edu/ml/datasets/Car+Evaluation
其每条数据的形式为:
vhigh,vhigh,2,2,small,low,unacc
vhigh,vhigh,2,2,small,med,unacc
vhigh,vhigh,2,2,small,high,unacc
····
这几维度分别为:(销售价格buying,维护价格maint,门的数量door,座位数量persons,后备箱大小lug_boot,安全系数safety,用户接受的程度acceptability)
销售价格和维护价格分为四个档次:vhigh(very high),high,med(medium),low
门的数量取:2,3,4,5more,5more表示大于等于5
座位数取:2,4,more
后备箱:small,med,big
安全系数:low,med,high
是否被接受:unacc,acc,good,vgood.四类
我们把最后一维是否被接受当做label。
编写数据预处理代码
主要是编写一个类,用于读取源数据文件,以及将数据集切分为测试集以及训练集。
风格仿照 mnist的数据生成类。
import numpy
import random
import copy
class CarData(object):
def __init__(self,datapath="car.data.txt",sperate_rate=0.1):
x=[]
y=[]
with open(datapath) as fp :
lines = fp.readlines()
for each in lines:
each = each.replace("\n",'')
_da = each.split(',')
_y = _da[-1]
_x = _da[:-1]
x.append(_x)
y.append(_y)
test_index =set()
train_index =set()
data_num = len(y)
test_num = int(data_num * sperate_rate )
while (len(test_index) < test_num ):
test_index.add(random.randint(0,data_num-1))
for i in range(len(x)):
if i not in test_index:
train_index.add(i)
self.test_index = copy.deepcopy(test_index) #测试样本集合
self.train_index = copy.deepcopy(train_index) #训练样本集合
self.data_num =data_num #数据集大小
self.train_num = data_num - test_num #训练集大小
self.test_num = test_num #测试集大小
self.x = copy.deepcopy(x) #数据集 特征向量
self.y = copy.deepcopy(y) #数据集 label; (x[i],y[i])一一对应。
def test(self):
return self._sample(self.test_index)
def train(self):
return self._sample(self.train_index)
def _sample(self,sampleSet):
_x = []
_y = []
for each in sampleSet:
_x.append(copy.deepcopy(self.x[each]))
_y.append(copy.deepcopy(self.y[each]))
return _x,_y
计算贝叶斯里面需要的各概率分布
#coding:utf-8
__author__ = 'jmh081701'
import numpy
import json
from tool.genData import CarData #上面的数据生成类
data = CarData("..//tool//car.data.txt")
train_x,train_y = data.train()
#计算各个条件概率
pre_p={}
#{'acc':54,'unacc':344 ···}
likehood={}
#{'acc':{1:{'vhigh':300,'high':100 ···},2:{'3':2332···}}
在这里,我选择字典来保存统计结果。
pre_p是各个类别的先验概率,pre_p[‘acc’]表示在训练集里面acc类别出现了54次。
likehood保存各个类别的似然概率数据。
likehood[y][i][x],表示这个类别里面第y类中,第i维 取值为x出现的次数
遍历数据集,统计各个数据:
for simple_index in range(data.train_num):
i = simple_index
if train_y[i] not in pre_p:
pre_p.setdefault(train_y[i],1) #第train_y[i]类首次出现
likehood.setdefault(train_y[i],{})
else:
pre_p[train_y[i]]+=1 #将train_y[i]类的出现次数累加
for vector_subindex in range(len(train_x[i])):
if vector_subindex not in likehood[train_y[i]]:
likehood[train_y[i]].setdefault(vector_subindex,{})
if train_x[i][vector_subindex] not in likehood[train_y[i]][vector_subindex]:
likehood[train_y[i]][vector_subindex].setdefault(train_x[i][vector_subindex],1)
else:
likehood[train_y[i]][vector_subindex][train_x[i][vector_subindex]]+=1
我们输出看看,统计出来的pre_p和likehood的值。
{'unacc': 1097, 'vgood': 52, 'good': 62, 'acc': 345}
{'unacc': {0: {'vhigh': 318, 'low': 234, 'med': 237, 'high': 308}, 1: {'vhigh': 334, 'low': 242, 'med': 239, 'high': 282}, 2: {'3': 276, '5more': 262, '4': 268, '2': 291}, 3: {'4': 280, '2': 524, 'more': 293}, 4: {'small': 405, 'big': 339, 'med': 353}, 5: {'low': 519, 'med': 329, 'high': 249}}, 'vgood': {0: {'low': 31, 'med': 21}, 1: {'low': 21, 'med': 20, 'high': 11}, 2: {'3': 10, '5more': 18, '4': 16, '2': 8}, 3: {'4': 25, 'more': 27}, 4: {'big': 29, 'med': 23}, 5: {'high': 52}}, 'good': {0: {'low': 42, 'med': 20}, 1: {'low': 40, 'med': 22}, 2: {'3': 17, '5more': 15, '4': 16, '2': 14}, 3: {'4': 32, 'more': 30}, 4: {'small': 19, 'big': 23, 'med': 20}, 5: {'med': 35, 'high': 27}}, 'acc': {0: {'vhigh': 66, 'low': 81, 'med': 105, 'high': 93}, 1: {'vhigh': 67, 'low': 80, 'med': 106, 'high': 92}, 2: {'3': 87, '5more': 95, '4': 90, '2': 73}, 3: {'4': 174, 'more': 171}, 4: {'small': 98, 'big': 129, 'med': 118}, 5: {'med': 158, 'high': 187}}}
说明,unacc的数目为1097,而unacc所有的样本里面,第一维取值为vhigh有318个,第一维取值为’low’的有234,第二维取值为vhigh有334个样本···。
ok ,现在来计算这些概率。
先统计一共有几个类别:
class_number = len(pre_p)
统计特性向量里面各个维度分别有那些不同的取值
vector_sub_dimension_values={0:set(),1:set(),2:set()}
vector_len = len(train_x[0])
for each in likehood:
for vector_subindex in range(vector_len):
if vector_subindex not in vector_sub_dimension_values:
vector_sub_dimension_values.setdefault(vector_subindex,set())
for item in likehood[each][vector_subindex]:
vector_sub_dimension_values[vector_subindex].add(item)
将vector_sub_dimension_values输出看看:
{0: {'vhigh', 'low', 'med', 'high'}, 1: {'vhigh', 'low', 'med', 'high'}, 2: {'3', '5more', '4', '2'}, 3: {'4', '2', 'more'}, 4: {'small', 'big', 'med'}, 5: {'low', 'med', 'high'}}
说明在整个数据集,第一维向量有四个不同的取值,第二维有4个不同取值···这与第二部分数据集分析吻合。
计算条件概率
for class_item in likehood:
for vector_subindex in range(vector_len):
for item in vector_sub_dimension_values[vector_subindex]:
N =len(vector_sub_dimension_values[vector_subindex])
if item not in likehood[class_item][vector_subindex]:
likehood[class_item][vector_subindex].setdefault(item,1.0/(pre_p[class_item]+N))
else:
likehood[class_item][vector_subindex][item] = (likehood[class_item][vector_subindex][item]+1.0)/(pre_p[class_item]+N)
计算各个类别的先验概率
for class_item in pre_p:
pre_p[class_item] = (pre_p[class_item]+1.0)/(data.train_num+len(pre_p))
注意,上面的计算过程对概率进行“加1”平滑处理,确保各层次下每个类别至少出现一次。
现在,我们输出这些概率看看:
print(pre_p)
输出各个类别的先验概率:
{'unacc': 0.7038461538461539, 'vgood': 0.03397435897435897, 'good': 0.04038461538461539, 'acc': 0.22179487179487178}
输出各个条件概率:
print(likehood)
{'unacc': {0: {'vhigh': 0.28973660308810173, 'low': 0.2134423251589464, 'med': 0.2161671207992734, 'high': 0.28065395095367845}, 1: {'vhigh': 0.3042688465031789, 'low': 0.22070844686648503, 'med': 0.21798365122615804, 'high': 0.25703905540417804}, 2: {'3': 0.25158946412352406, '5more': 0.23887375113533152, '4': 0.24432334241598547, '2': 0.2652134423251589}, 3: {'4': 0.25545454545454543, '2': 0.4772727272727273, 'more': 0.2672727272727273}, 4: {'small': 0.3690909090909091, 'big': 0.3090909090909091, 'med': 0.32181818181818184}, 5: {'low': 0.4727272727272727, 'med': 0.3, 'high': 0.22727272727272727}}, 'vgood': {0: {'vhigh': 0.017857142857142856, 'low': 0.5714285714285714, 'med': 0.39285714285714285, 'high': 0.017857142857142856}, 1: {'vhigh': 0.017857142857142856, 'low': 0.39285714285714285, 'med': 0.375, 'high': 0.21428571428571427}, 2: {'3': 0.19642857142857142, '5more': 0.3392857142857143, '4': 0.30357142857142855, '2': 0.16071428571428573}, 3: {'4': 0.4727272727272727, 'more': 0.509090909090909, '2': 0.01818181818181818}, 4: {'small': 0.01818181818181818, 'big': 0.5454545454545454, 'med': 0.43636363636363634}, 5: {'low': 0.01818181818181818, 'med': 0.01818181818181818, 'high': 0.9636363636363636}}, 'good': {0: {'vhigh': 0.015151515151515152, 'low': 0.6515151515151515, 'med': 0.3181818181818182, 'high': 0.015151515151515152}, 1: {'vhigh': 0.015151515151515152, 'low': 0.6212121212121212, 'med': 0.3484848484848485, 'high': 0.015151515151515152}, 2: {'3': 0.2727272727272727, '5more': 0.24242424242424243, '4': 0.25757575757575757, '2': 0.22727272727272727}, 3: {'4': 0.5076923076923077, 'more': 0.47692307692307695, '2': 0.015384615384615385}, 4: {'small': 0.3076923076923077, 'big': 0.36923076923076925, 'med': 0.3230769230769231}, 5: {'low': 0.015384615384615385, 'med': 0.5538461538461539, 'high': 0.4307692307692308}}, 'acc': {0: {'vhigh': 0.19197707736389685, 'low': 0.2349570200573066, 'med': 0.3037249283667622, 'high': 0.2693409742120344}, 1: {'vhigh': 0.19484240687679083, 'low': 0.23209169054441262, 'med': 0.30659025787965616, 'high': 0.2664756446991404}, 2: {'3': 0.2521489971346705, '5more': 0.27507163323782235, '4': 0.2607449856733524, '2': 0.21203438395415472}, 3: {'4': 0.5028735632183908, 'more': 0.4942528735632184, '2': 0.0028735632183908046}, 4: {'small': 0.28448275862068967, 'big': 0.3735632183908046, 'med': 0.34195402298850575}, 5: {'low': 0.0028735632183908046, 'med': 0.45689655172413796, 'high': 0.5402298850574713}}}
给出计算p(yi|X)后验概率的公式
def posterior(X,yi):
p = pre_p[yi]
for vector_subindex in range(len(X)):
p *= likehood[yi][vector_subindex][X[vector_subindex]]
return p
定义贝叶斯决策函数:
def bayes_stragty(X):
_label = None
_maxp =-numpy.inf
for class_item in pre_p:
_pi = posterior(X,class_item)
if _pi > _maxp:
_maxp =_pi
_label = class_item
return _label
依次遍历所有的类别,计算各自的概率,取最大的那个类别作为输出X对应的类别。
测试一下:
test_x,test_y = data.test()
right_cnt=0
total_cnt=data.test_num
for i in range(data.test_num):
predict_y = bayes_stragty(test_x[i])
if predict_y == test_y[i] :
right_cnt+=1.0
print({'right_rate':right_cnt/total_cnt,'error_rate':1-right_cnt/total_cnt})
最后的准确率和错误率为:
{'error_rate': 0.12209302325581395, 'right_rate': 0.877906976744186}
87.7%的正确率,一般般吧。
完整代码:
genData.py的CarData类:
import numpy
import random
import copy
class CarData(object):
#数据样例:vhigh,vhigh,2,2,small,low,unacc
#(销售价格buying,维护价格maint,门的数量door,座位数量persons,后备箱大小lug_boot,安全系数safety,用户接受的程度acceptability)
#销售价格和维护价格分为四个档次:vhigh(very high),high,med(medium),low
#门的数量取:2,3,4,5more,5more表示大于等于5
#座位数取:2,4,more
#后备箱:small,med,big
#安全系数:low,med,high
#是否被接受:unacc,acc,good,vgood.四类
def __init__(self,datapath="car.data.txt",sperate_rate=0.1):
x=[]
y=[]
with open(datapath) as fp :
lines = fp.readlines()
for each in lines:
each = each.replace("\n",'')
_da = each.split(',')
_y = _da[-1]
_x = _da[:-1]
x.append(_x)
y.append(_y)
test_index =set()
train_index =set()
data_num = len(y)
test_num = int(data_num * sperate_rate )
while (len(test_index) < test_num ):
test_index.add(random.randint(0,data_num-1))
for i in range(len(x)):
if i not in test_index:
train_index.add(i)
self.test_index = copy.deepcopy(test_index) #测试样本集合
self.train_index = copy.deepcopy(train_index) #训练样本集合
self.data_num =data_num #数据集大小
self.train_num = data_num - test_num #训练集大小
self.test_num = test_num #测试集大小
self.x = copy.deepcopy(x) #数据集 特征向量
self.y = copy.deepcopy(y) #数据集 label; (x[i],y[i])一一对应。
def test(self):
return self._sample(self.test_index)
def train(self):
return self._sample(self.train_index)
def _sample(self,sampleSet):
_x = []
_y = []
for each in sampleSet:
_x.append(copy.deepcopy(self.x[each]))
_y.append(copy.deepcopy(self.y[each]))
return _x,_y
bayes.py 贝叶斯模型
#coding:utf-8
__author__ = 'jmh081701'
import numpy
import json
#本文件主要描述了贝叶斯模型用于分类的一个实例,使用一个汽车改革的数据集作为实验
#本例子中涉及到的各维特征都是离散的。
#贝叶斯模型核心在于:
#给定特征向量X,计算该向量所属于的类别y。
#我们将这个问题转化为选择选择使得条件概率p(yi|X)最大的yi。为了选择出p(yi|X),我们首先需要计算出这些条件概率。
#根据贝叶斯公式:p(yi|X)=p(X,yi)/p(X)=p(X|yi)*p(yi)/p(X)=p(yi)*p(x1,x2,x3,x...,xn|yi)/p(X)
#其中,我们需要计算p(X|yi),而X是一个多维的向量,这是一个联合条件概率.而p(X)一般我们视为对于所有的特征向量这个数值是相等的。
#于是我们只需计算p(yi)*p(x1,x2,x3,x...,xn|yi)即可。
#为了计算这个公式,我们对这个联合条件概率进行一个简化。naive 就是假设特征向量的各个特征是相互独立的。
#因为特征独立,于是有:p(x1,x2,x3,x...,xn|yi) = p(x1|yi)*p(x2|yi)*p(x3|yi)···使得计算变成可能。
#一般使用贝叶斯模型的时候,会离线计算出各个条件概率分布。一般:我们需要确定:m+m*n个概率分布。其中m是类别数量,n是向量的维度
#于是我们得先计算这些条件概率,当然我们需要在计算的过程中考虑条件概率的平滑,防止因为数据集稀疏导致条件概率为0
from tool.genData import CarData
data = CarData("..//tool//car.data.txt")
train_x,train_y = data.train()
#计算各个条件概率
pre_p={}
#{'acc':54,'unacc':344 ···}
likehood={}
#{'acc':{1:{'vhigh':300,'high':100 ···},2:{'3':2332···}}
for simple_index in range(data.train_num):
i = simple_index
if train_y[i] not in pre_p:
pre_p.setdefault(train_y[i],1) #第train_y[i]类首次出现
likehood.setdefault(train_y[i],{})
else:
pre_p[train_y[i]]+=1 #将train_y[i]类的出现次数累加
for vector_subindex in range(len(train_x[i])):
if vector_subindex not in likehood[train_y[i]]:
likehood[train_y[i]].setdefault(vector_subindex,{})
if train_x[i][vector_subindex] not in likehood[train_y[i]][vector_subindex]:
likehood[train_y[i]][vector_subindex].setdefault(train_x[i][vector_subindex],1)
else:
likehood[train_y[i]][vector_subindex][train_x[i][vector_subindex]]+=1
#开始来计算概率。
#先统计一共有几个类别:
class_number = len(pre_p)
#统计特性向量里面各个维度分别有那些不同的取值
vector_sub_dimension_values={0:set(),1:set(),2:set()}
vector_len = len(train_x[0])
for each in likehood:
for vector_subindex in range(vector_len):
if vector_subindex not in vector_sub_dimension_values:
vector_sub_dimension_values.setdefault(vector_subindex,set())
for item in likehood[each][vector_subindex]:
vector_sub_dimension_values[vector_subindex].add(item)
print(vector_sub_dimension_values)
print(pre_p)
print(likehood)
#计算条件概率
for class_item in likehood:
for vector_subindex in range(vector_len):
for item in vector_sub_dimension_values[vector_subindex]:
N =len(vector_sub_dimension_values[vector_subindex])
if item not in likehood[class_item][vector_subindex]:
likehood[class_item][vector_subindex].setdefault(item,1.0/(pre_p[class_item]+N))
else:
likehood[class_item][vector_subindex][item] = (likehood[class_item][vector_subindex][item]+1.0)/(pre_p[class_item]+N)
#计算各个类别的先验概率
for class_item in pre_p:
pre_p[class_item] = (pre_p[class_item]+1.0)/(data.train_num+len(pre_p))
print(pre_p)
print(likehood)
#给出计算p(yi|X)的公式
def posterior(X,yi):
p = pre_p[yi]
for vector_subindex in range(len(X)):
p *= likehood[yi][vector_subindex][X[vector_subindex]]
return p
def bayes_stragty(X):
_label = None
_maxp =-numpy.inf
for class_item in pre_p:
_pi = posterior(X,class_item)
if _pi > _maxp:
_maxp =_pi
_label = class_item
return _label
#has a test:
test_x,test_y = data.test()
right_cnt=0
total_cnt=data.test_num
for i in range(data.test_num):
predict_y = bayes_stragty(test_x[i])
if predict_y == test_y[i] :
right_cnt+=1.0
print({'right_rate':right_cnt/total_cnt,'error_rate':1-right_cnt/total_cnt})