数据准备
朴素贝叶斯算法可以进行多分类,因此使用的还是原版手写识别数据集;但是在实现过程中发现,对于朴素贝叶斯算法来说:数据维度或取值范围过大(mnist每行数据有784维,每维数据取值0~255 共256个值),都会引起概率计算过小甚至为零的问题,尤其是在做了拉普拉斯平滑的步骤之后。
因此在使用过程中,对图像数据进行二值化操作,把数据取值范围从256维压缩到2维,可以在一定程度上降低概率弥散的情况,但同时也损失了一定的数据信息。这部分操作直接在代码中完成,就不提前做生成新的数据集了。
朴素贝叶斯算法
朴素贝叶斯算法之所以朴素,是因为它对条件概率分布作了条件独立的假设;
通过基于贝叶斯定理和特征条件独立假设,对给定的训练数据集,首先基于特征条件独立假设学习输入输出的联合概率分布;然后基于学习的模型,对给定的输入 x x x,利用贝叶斯定理求出后验概率最大的输出 y y y。
基于极大斯然估计的朴素贝叶斯算法步骤如下:
由于极大似然的概率可能为0,导致错误的分类结果,因此一般使用贝叶斯估计:
基于贝叶斯估计和极大似然估计的朴素贝叶斯算法步骤基本上是一致的,区别在于有没有对概率做平滑;基于贝叶斯估计的朴素贝叶斯算法代码如下:
# @Author: phd
# @Date: 2019/7/10
# @Site: github.com/phdsky
# @Description: NULL
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Binarizer
def calc_accuracy(y_pred, y_truth):
assert len(y_pred) == len(y_truth)
n = len(y_pred)
hit_count = 0
for i in range(0, n):
if y_pred[i] == y_truth[i]:
hit_count += 1
print("Predicting accuracy %f\n" % (hit_count / n))
class NaiveBayes(object):
def __init__(self, _lambda, Sj, K):
self._lambda = _lambda
self.Sj = Sj # Feature Dimension (Simple assume to the same)
self.K = K # Label range
# Use bayes estimate
# Not max-likelihood estimate, avoid probability is 0
def train(self, X_train, y_train):
# Calculate prior probability & conditional probability
N = len(y_train)
D = X_train.shape[1] # Dimension
prior = np.full(self.K, 0)
condition = np.full((self.K, D, self.Sj), 0)
# conditional_probability = np.full((self.K, D, self.Sj), 0.)
for i in range(0, N):
prior[y_train[i]] += 1
for j in range(0, D):
condition[y_train[i]][j][X_train[i][j]] += 1
prior_probability = (prior + self._lambda) / (N + self.K*self._lambda)
# Too Slow
# for i in range(0, self.K):
# for j in range(0, D):
# for k in range(0, self.Sj):
# conditional_probability[i][j][k] = \
# (condition[i][j][k] + self._lambda) / (sum(condition[i][j]) + self.Sj*self._lambda)
return prior_probability, condition # , conditional_probability
def predict(self, prior_probability, condition, X_test):
n = len(X_test)
d = X_test.shape[1]
predict_label = np.full(n, -1)
for i in range(0, n):
predict_probability = np.full(self.K, 1.)
to_predict = X_test[i]
for j in range(0, self.K):
prior_prob = prior_probability[j]
# If d or self.Sj is large, predict_probability gets close to 0
for k in range(0, d):
conditional_probability = \
(condition[j][k][to_predict[k]] + self._lambda) / (sum(condition[j][k]) + self.Sj*self._lambda)
predict_probability[j] *= conditional_probability
predict_probability[j] *= prior_prob
predict_label[i] = np.argmax(predict_probability)
print("Sample %d predicted as %d" % (i, predict_label[i]))
return predict_label
def example_large():
mnist_data = pd.read_csv("../data/mnist.csv")
mnist_values = mnist_data.values
images = mnist_values[::, 1::]
labels = mnist_values[::, 0]
X_train, X_test, y_train, y_test = train_test_split(
images, labels, test_size=100, random_state=42
)
# Binary the image to avoid predict_probability gets 0
binarizer_train = Binarizer(threshold=127).fit(X_train)
X_train_binary = binarizer_train.transform(X_train)
binarizer_test = Binarizer(threshold=127).fit(X_test)
X_test_binary = binarizer_test.transform(X_test)
# Laplace Smoothing
# X values 0~255 = 256 Every axis has the same range
# Y values 0~9 = 10
naive_bayes = NaiveBayes(_lambda=1, Sj=2, K=10)
print("Start naive bayes training...")
prior, conditional = naive_bayes.train(X_train=X_train_binary, y_train=y_train)
print("Testing on %d samples..." % len(X_test))
y_predicted = naive_bayes.predict(prior_probability=prior,
condition=conditional,
X_test=X_test_binary)
calc_accuracy(y_pred=y_predicted, y_truth=y_test)
def example_small():
X_train = np.asarray([[0, 0], [0, 1], [0, 1], [0, 0], [0, 0],
[1, 0], [1, 1], [1, 1], [1, 2], [1, 2],
[2, 2], [2, 1], [2, 1], [2, 2], [2, 2]])
y_train = np.asarray([0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0])
X_test = np.asarray([[1, 0]])
naive_bayes = NaiveBayes(_lambda=1, Sj=3, K=2)
print("Start naive bayes training...")
prior, conditional = naive_bayes.train(X_train=X_train, y_train=y_train)
print("Testing on %d samples..." % len(X_test))
naive_bayes.predict(prior_probability=prior,
condition=conditional,
X_test=X_test)
if __name__ == "__main__":
# example_small()
example_large()
代码实现过程中有几个注意点:
- 在训练train函数中,由于计算条件概率实在是太费时间了,然而条件概率又没有必要全都算出来,因此训练函数直接返回条件的统计值,在预测时需要的地方再进行计算
- 代码最初是直接使用原版mnist数据(784维,256数据范围)的,但是算法预测的结果一直是0,因此搬了书上的数据来进行测试(对类别和输入特征进行了平移改动,但不影响结果),发现算法计算出来的后验概率是正确的,证明算法是没问题的
- 切换回原版mnist数据,调试中发现在后验概率计算过程中,不断累乘条件概率和先验概率导致概率值一直变小最终为0;曾试图调小平滑lambda值和在循环的过程中乘以相同的量纲但是没用,因为在784维然后又256个值的取值下,概率确实会越乘越小;因此将输入图像二值化,数据维度暂时不变,发现代码有了较正确的输出
- 继续调试发现,最终后验概率依然还是很小(e的负几百多次方吧),但是好歹是能够比较了,因此暂时不往下调了;如果还是不行,可以尝试减少维度(相当于crop或者resize图片了),感觉会进一步降低精度。
输出结果:
/Users/phd/Softwares/anaconda3/bin/python /Users/phd/Desktop/ML/naive_bayes/naive_bayes.py
Start naive bayes training...
Testing on 100 samples...
Sample 0 predicted as 8
Sample 1 predicted as 1
Sample 2 predicted as 9
Sample 3 predicted as 9
Sample 4 predicted as 8
Sample 5 predicted as 5
Sample 6 predicted as 2
Sample 7 predicted as 2
Sample 8 predicted as 7
Sample 9 predicted as 1
Sample 10 predicted as 6
Sample 11 predicted as 3
Sample 12 predicted as 1
Sample 13 predicted as 2
Sample 14 predicted as 7
Sample 15 predicted as 4
Sample 16 predicted as 3
Sample 17 predicted as 3
Sample 18 predicted as 6
Sample 19 predicted as 4
Sample 20 predicted as 0
Sample 21 predicted as 5
Sample 22 predicted as 2
Sample 23 predicted as 6
Sample 24 predicted as 0
Sample 25 predicted as 0
Sample 26 predicted as 0
Sample 27 predicted as 8
Sample 28 predicted as 6
Sample 29 predicted as 3
Sample 30 predicted as 5
Sample 31 predicted as 6
Sample 32 predicted as 1
Sample 33 predicted as 2
Sample 34 predicted as 8
Sample 35 predicted as 6
Sample 36 predicted as 7
Sample 37 predicted as 3
Sample 38 predicted as 6
Sample 39 predicted as 1
Sample 40 predicted as 9
Sample 41 predicted as 7
Sample 42 predicted as 4
Sample 43 predicted as 6
Sample 44 predicted as 8
Sample 45 predicted as 3
Sample 46 predicted as 4
Sample 47 predicted as 2
Sample 48 predicted as 7
Sample 49 predicted as 8
Sample 50 predicted as 4
Sample 51 predicted as 3
Sample 52 predicted as 3
Sample 53 predicted as 7
Sample 54 predicted as 1
Sample 55 predicted as 8
Sample 56 predicted as 6
Sample 57 predicted as 2
Sample 58 predicted as 9
Sample 59 predicted as 6
Sample 60 predicted as 6
Sample 61 predicted as 0
Sample 62 predicted as 9
Sample 63 predicted as 8
Sample 64 predicted as 5
Sample 65 predicted as 5
Sample 66 predicted as 4
Sample 67 predicted as 3
Sample 68 predicted as 9
Sample 69 predicted as 3
Sample 70 predicted as 9
Sample 71 predicted as 4
Sample 72 predicted as 2
Sample 73 predicted as 8
Sample 74 predicted as 1
Sample 75 predicted as 6
Sample 76 predicted as 3
Sample 77 predicted as 7
Sample 78 predicted as 0
Sample 79 predicted as 3
Sample 80 predicted as 1
Sample 81 predicted as 7
Sample 82 predicted as 6
Sample 83 predicted as 7
Sample 84 predicted as 6
Sample 85 predicted as 1
Sample 86 predicted as 9
Sample 87 predicted as 5
Sample 88 predicted as 3
Sample 89 predicted as 6
Sample 90 predicted as 4
Sample 91 predicted as 3
Sample 92 predicted as 7
Sample 93 predicted as 2
Sample 94 predicted as 6
Sample 95 predicted as 5
Sample 96 predicted as 2
Sample 97 predicted as 9
Sample 98 predicted as 3
Sample 99 predicted as 5
Predicting accuracy 0.880000
Process finished with exit code 0
可以看到最终100个预测对了88个,在数据损失这么多信息的情况下,已经还算是不错了;此外朴素贝叶斯算法假设数据之间没有关联,可能也损失了一部分信息;算法整体跑下来速度比较快。
总结
-
朴素贝叶斯算法数学步骤:
- 首先通过训练数据学习 条件概率
P
(
X
∣
Y
)
P(X|Y)
P(X∣Y) 和 先验概率
P
(
Y
)
P(Y)
P(Y) 的估计,估计方法可以使用极大似然估计或者贝叶斯估计方法,推得联合概率分布:
P ( X , Y ) = P ( Y ) P ( X ∣ Y ) P(X,Y) = P(Y)P(X|Y) P(X,Y)=P(Y)P(X∣Y) - 朴素贝叶斯法通过利用贝叶斯定理和第1步中学习到的联合概率模型进行分类预测:
P ( Y ∣ X ) = P ( X , Y ) P ( X ) = P ( Y ) P ( X ∣ Y ) ∑ Y P ( Y ) P ( X ∣ Y ) P(Y|X)=\frac{P(X,Y)}{P(X)}=\frac{P(Y)P(X|Y)}{\sum\limits_{Y}P(Y)P(X|Y)} P(Y∣X)=P(X)P(X,Y)=Y∑P(Y)P(X∣Y)P(Y)P(X∣Y) - 通过条件独立性假设,且对于每个
Y
=
c
k
Y=c_k
Y=ck 分母都是相同的,将输入
x
x
x 分到后验概率最大的类
y
y
y,其中后验概率最大等价于0-1损失函数时的期望风险最小化:
y = arg max c k P ( Y = c k ) ∏ j = 1 n P ( X j = x ( j ) ∣ Y = c k ) y=\mathop{\arg\max}\limits_{c_k}P(Y=c_k)\prod_{j=1}^{n}P(X_j=x^{(j)}|Y=c_k) y=ckargmaxP(Y=ck)j=1∏nP(Xj=x(j)∣Y=ck)
- 首先通过训练数据学习 条件概率
P
(
X
∣
Y
)
P(X|Y)
P(X∣Y) 和 先验概率
P
(
Y
)
P(Y)
P(Y) 的估计,估计方法可以使用极大似然估计或者贝叶斯估计方法,推得联合概率分布:
-
朴素贝叶斯方法的基本假设是条件独立性,即满足:
P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , . . . , X ( n ) = X ( n ) ∣ Y = c k ) = ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) \begin{aligned} P(X=x|Y=c_k) &= P(X^{(1)} = x^{(1)},...,X^{(n)}=X^{(n)}|Y=c_k) \\ &= \prod_{j=1}^nP(X^{(j)}=x^{(j)}|Y=c_k) \end{aligned} P(X=x∣Y=ck)=P(X(1)=x(1),...,X(n)=X(n)∣Y=ck)=j=1∏nP(X(j)=x(j)∣Y=ck)
此条件假设性较强,忽略了变量间存在的关系,大幅减少了条件概率的计算个数;
因此算法简单高效,同时分类性能有所降低。 -
假设输入变量之间存在概率依存关系,模型将变成贝叶斯网络。
参考
- 《统计学习方法》