理论推导:https://blog.csdn.net/ACM_hades/article/details/89677342
数据集
- 数据集:MNIST数据,图片大小是28×28的,10个类别,使用数据的原始特征,所有每个样本有28×28=784个特征。
- 朴素贝叶斯比较适合特征维度较小的情况,但是MNIST数据已到达上百唯的特征,概率联乘起来超过Python float能表示的极限,
- 由于Python 浮点数精度的原因,784个浮点数联乘后结果变为Inf,而Python中int可以无限相乘的,因此可以利用python int的特性对先验概率与条件概率进行一些改造。 由决策函数:
y
=
f
(
x
)
=
max
c
k
P
(
Y
=
c
k
)
∗
∏
j
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
y=f(x)=\max_{c_k }P(Y=c_k )*∏_j P(X^{(j) }=x^{(j) } |Y=c_k)
y=f(x)=ckmaxP(Y=ck)∗j∏P(X(j)=x(j)∣Y=ck)可知我们对先验概率
P
(
Y
=
c
k
)
P(Y=c_k )
P(Y=ck)同时扩大
N
N
N倍,对各条件概率
P
(
X
(
j
)
=
x
(
j
)
∣
Y
=
c
k
)
P(X^{(j) }=x^{(j) } |Y=c_k)
P(X(j)=x(j)∣Y=ck)同时扩大
M
M
M倍不影响选择概率最大值
- 先验概率: 由于先验概率分母都是 N N N,因此不用除于 N N N,直接用分子即可。
- 条件概率: 条件概率公式如下图所示,我们得到概率后再乘以10000,将概率映射到[0,10000]中,但是为防止出现概率值为0的情况,人为的加上1,使概率映射到[1,10001]中。
代码
#encoding=utf-8
import pandas as pd
import numpy as np
import cv2
import random
import time
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from collections import Counter
from collections import defaultdict
class Naive_bayes(object):
def __init__(self):
self.class_num = 10
self.feature_len = 784
# 二值化
def binaryzation(self,img):
for i in range(len(img)):
img_1 =img[i] # 图片二值化
cv_img = img_1.astype(np.uint8)#将图片的0-255取值变为0-1
cv2.threshold(cv_img,50,1,cv2.THRESH_BINARY_INV,cv_img)
img[i]=cv_img
#训练
def Train(self,trainset,train_labels):
trainset=list(trainset)
train_labels=list(train_labels)
Date=defaultdict(list)
for i in range(len(train_labels)):
Date[train_labels[i]].append(trainset[i])
self.prior_probability = np.zeros(self.class_num) # 先验概率
self.conditional_probability = np.zeros((self.class_num, self.feature_len, 2)) # 条件概率
for label in Date:
self.prior_probability[label]=len(Date[label])
temp = list(np.sum(np.array(Date[label]),axis=0))
for j in range(self.feature_len):
self.conditional_probability[label][j][1] += temp[j]
self.conditional_probability[label][j][0]+=(self.prior_probability[label]-temp[j])
# 将概率归到[1.10001]
for i in range(self.class_num):
for j in range(self.feature_len):
# 经过二值化后图像只有0,1两种取值
pix_0 = self.conditional_probability[i][j][0]
pix_1 = self.conditional_probability[i][j][1]
# 计算0,1像素点对应的条件概率
probalility_0 = (float(pix_0)/float(pix_0+pix_1))*10000 + 1
probalility_1 = (float(pix_1)/float(pix_0+pix_1))*10000 + 1
self.conditional_probability[i][j][0] = probalility_0
self.conditional_probability[i][j][1] = probalility_1
def Predict(self,testset):
predict = []
for img in testset:
temp=[]#一定要转化为python中的int型list不然就会溢出
for j in range(10):
temp.append(int(self.prior_probability[j]))
for i in range(len(img)):
temp_1=self.conditional_probability[:, i, img[i]]
for j in range(10):
temp[j] *= int(temp_1[j])
max_label=np.argmax(temp)
predict.append(max_label)
return np.array(predict)
if __name__ == '__main__':
Model=Naive_bayes()
print('Start read data')
S = time.time()
raw_data = pd.read_csv('../data/train.csv') # 读取数据
data = raw_data.values # 获取数据
print("data shape:", data.shape)
imgs = data[0:, 1:]
labels = data[:, 0]
Model.binaryzation(imgs)
print("imgs shape:", imgs.shape)
print("labels shape:", labels.shape)
# 选取 2/3 数据作为训练集, 1/3 数据作为测试集
train_features, test_features, train_labels, test_labels = train_test_split(
imgs, labels, test_size=0.33, random_state=23323)
print("train data count :%d" % len(train_labels))
print("test data count :%d" % len(test_labels))
print('read data cost ', time.time() - S, ' second')
print('Start training')
S = time.time()
Model.Train(train_features, train_labels)
print('training cost ', time.time() - S, ' second')
print('Start predicting')
S = time.time()
test_predict = Model.Predict(test_features)
print('predicting cost ', time.time() - S, ' second')
score = accuracy_score(test_labels, test_predict)
print("The accruacy socre is ", score)
输出:
Start read data
data shape: (42000, 785)
imgs shape: (42000, 784)
labels shape: (42000,)
train data count :28140
test data count :13860
read data cost 4.07903265953064 second
Start training
training cost 0.21043634414672852 second
Start predicting
predicting cost 93.11626148223877 second
The accruacy socre is 0.8331168831168831