毕竟眉湖六月中,风光不与四时同。
图书馆路上随手拍的
正文:
我们都对贝叶斯理论了解一二,但啥是朴素贝叶斯?何谓朴素?见下图:
以下是朴素贝叶斯算法的流程:
但在实际的运用中,我们使用的数据集有可能是偏倚的或者数据集包含的实例较少。简单的来说,当我们通过掷筛子来统计每个点数的频率来判断这个骰子是否被动过手脚。当我们测试了六次,发现三点出现频率为0,那么就说这个骰子被动了手脚,显然这是不准确的,这就需要我们在掷筛子前,令所有点数出现的频率不为0,来避免此种状况的出现。
在朴素贝叶斯算法时,我们便做如下处理:
我们本次仍然使用Mnist数据集来进行算法的训练和测试。以下为代码:
import time
import numpy as np
def calculate_py(train_set):
py=[0,0,0,0,0,0,0,0,0,0]
for i in range(train_set.shape[0]):
py[train_set[i][0]]+=1
return py
def calculate_pxy(train_set,py):
pxy_array=np.ones((10, 784, 256))
for i in range(train_set.shape[0]):
for j in range(train_set.shape[1]-1):
pxy_array[train_set[i][0]][j][train_set[i][j+1]]+=1
for i in range(10):
pxy_array[i]=pxy_array[i][0:][0:]/(py[i]+256)
return pxy_array
def predict(pxy_array,test_set):
calculate_result=np.ones((test_set.shape[0],10))
for i in range(test_set.shape[0]):
for k in range(10):
calculate_result[i][k]*=py_i[k]
for j in range(test_set.shape[1]-1):
calculate_result[i][k]*=pxy_array[k][j][test_set[i][j+1]]
predict_result=np.argmax(calculate_result,axis=1)
return predict_result
def calculate_accuracy_rate(test_set,predict_result):
cnt=0
for i in range(test_set.shape[0]):
if test_set[i][0]==predict_result[i]:
cnt+=1
return cnt/test_set.shape[0]
start=time.time()
train_file=open("D:\\bpnetwork\\mnist_train.csv",'r')
test_file=open("D:\\bpnetwork\\mnist_test.csv",'r')
train_set=[]
test_set=[]
for line in train_file:
value=line.split(',')
tempt_list=list(map(int,value))
train_set.append(tempt_list)
train_set=np.array(train_set)
for line in test_file:
value=line.split(',')
tempt_list=list(map(int,value))
test_set.append(tempt_list)
test_set=np.array(test_set)
py=calculate_py(train_set)
pxy_array=calculate_pxy(train_set,py)
py_i=[x/(train_set.shape[0]+10) for x in py]
predict_result=predict(pxy_array,test_set)
r=calculate_accuracy_rate(test_set,predict_result)
train_file.close()
test_file.close()
#print(train_set[9][0:])
print('the accuracy rate:',r,'\n')
print('the time:',time.time()-start)
其算法运行结果如上,可知其准确率是极其不理想的,俺蒙得都有可能比它准…,那么问题出现在哪里了?
首先算法的逻辑思路是没有问题的,那问题可能出现在了语言实现上,注意我们pxy_array中的元素大部分是小于1的,而经过连乘后,其可能下溢,直接判为0,那这毫无疑问对准确率有着重大的影响。故根据之前的经验,我们应该取对数,将连乘变为连加。本次以2为底。
import time
import numpy as np
def calculate_py(train_set):
py=[0,0,0,0,0,0,0,0,0,0]
for i in range(train_set.shape[0]):
py[train_set[i][0]]+=1
return py
def calculate_pxy(train_set,py):
pxy_array=np.ones((10, 784, 256))
for i in range(train_set.shape[0]):
for j in range(train_set.shape[1]-1):
pxy_array[train_set[i][0]][j][train_set[i][j+1]]+=1
for i in range(10):
pxy_array[i]=pxy_array[i][0:][0:]/(py[i]+256)
return pxy_array
def predict(pxy_array,test_set,py_i):
calculate_result=np.ones((test_set.shape[0],10))
for i in range(test_set.shape[0]):
for k in range(10):
calculate_result[i][k]+=py_i[k]
for j in range(test_set.shape[1]-1):
calculate_result[i][k]+=pxy_array[k][j][test_set[i][j+1]]
predict_result=np.argmax(calculate_result,axis=1)
return predict_result
def calculate_accuracy_rate(test_set,predict_result):
cnt=0
for i in range(test_set.shape[0]):
if test_set[i][0]==predict_result[i]:
cnt+=1
return cnt/test_set.shape[0]
start=time.time()
train_file=open("D:\\bpnetwork\\mnist_train.csv",'r')
test_file=open("D:\\bpnetwork\\mnist_test.csv",'r')
train_set=[]
test_set=[]
for line in train_file:
value=line.split(',')
tempt_list=list(map(int,value))
train_set.append(tempt_list)
train_set=np.array(train_set)
for line in test_file:
value=line.split(',')
tempt_list=list(map(int,value))
test_set.append(tempt_list)
test_set=np.array(test_set)
py=calculate_py(train_set)
pxy_array=calculate_pxy(train_set,py)
pxy_array=np.log2(pxy_array)
py_i=[x/(train_set.shape[0]+10) for x in py]
py_i=np.array(py_i)
py_i=np.log2(py_i)
predict_result=predict(pxy_array,test_set,py_i)
r=calculate_accuracy_rate(test_set,predict_result)
train_file.close()
test_file.close()
#print(train_set[9][0:])
print('the accuracy rate:',r,'\n')
print('the time:',time.time()-start,'s')