从零开始实现朴素贝叶斯分类算法(离散特征情形)

http://blog.csdn.net/u013719780?viewmode=contents

1. 朴素贝叶斯算法原理

1.1 贝叶斯定理

p(y|x)=p(y)p(x|y)p(x)(1)
<script type="math/tex; mode=display" id="MathJax-Element-1"> p(y|x) = \frac{p(y)p(x|y)}{p(x)} \,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,(1) </script>

p(y=ck|x=(x1,x2,...,xn))=p(y=ck)p(x|y=ck)p(x)=p(y=ck)p(x=(x1,x2,...,xn)|y=ck)p(x)=p(y=ck)p(x=x1|y=ck)p(x=x2|y=ck)...p(x=xn|y=ck)p(x)=p(y=ck)ni=1p(x=xi|y=ck)p(x)=p(y=ck)ni=1p(x=xi|y=ck)nlabelsk=1p(x|y=ck)p(y=ck)=p(y=ck)ni=1p(x=xi|y=ck)nlabelsk=1p(y=ck)ni=1p(x=xi|y=ck)(2)
<script type="math/tex; mode=display" id="MathJax-Element-2">\begin{align*} p(y=c_k|x=(x_1, x_2, ..., x_n)) & = \frac{p(y=c_k)p(x|y=c_k)}{p(x)} \\ & = \frac{p(y=c_k)p(x=(x_1, x_2, ..., x_n)|y=c_k)}{p(x)} \\ & = \frac{p(y=c_k)p(x=x_1|y=c_k)p(x=x_2|y=c_k)...p(x=x_n|y=c_k)}{p(x)} \\ & = \frac{p(y=c_k) \prod_{i=1}^{n} p(x=x_i|y=c_k)}{p(x)} \\ & = \frac{p(y=c_k) \prod_{i=1}^{n} p(x=x_i|y=c_k)} { \sum_{k=1}^{n_{labels}} p(x|y=c_k)p(y=c_k)} \\ & = \frac{p(y=c_k) \prod_{i=1}^{n} p(x=x_i|y=c_k)} { \sum_{k=1}^{n_{labels}} p(y=c_k)\prod_{i=1}^{n} p(x=x_i|y=c_k)} \,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\, (2) \end{align*}</script>

1.2 朴素贝叶斯算法的参数学习

p(y=ck)=imI(y(i)=ck)m3
<script type="math/tex; mode=display" id="MathJax-Element-49">p(y=c_k)=\sum_i^m \frac{I(y^{(i)}=c_k)}{m} \,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\, （3）</script>

p(xj=ajl|y=ck)=mi=I(xji=ajl,y(i)=ck)miI(y(i)=ck)4
<script type="math/tex; mode=display" id="MathJax-Element-57"> p(x_j=a_{jl}|y=c_k)=\frac{\sum_i^m=I(x_{ji}=a_{jl},y^{(i)}=c_k)}{ \sum_i^m I(y^{(i)}=c_k)} \,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,（4）</script>

1.3 朴素贝叶斯算法的分类

1.计算该实例属于 y=ck <script type="math/tex" id="MathJax-Element-60">y=c_k</script>类的概率

p(y=ck|x)=p(y=ck)i=1np(x=xi|y=ck)(5)
<script type="math/tex; mode=display" id="MathJax-Element-61">p(y=c_k|x)=p(y=c_k)\prod_{i=1}^{n} p(x=x_i|y=c_k) \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,(5) </script>

y=argmaxckp(y=ck|x)(6)
<script type="math/tex; mode=display" id="MathJax-Element-63">y=arg \, \underset{c_k}{ max } p(y=c_k|x) \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, (6) </script>

2. 朴素贝叶斯算法的应用

id 　　性别　　 城市　　　转化

1  　 　M　    北京　　　1

2  　 　F　    上海　　　0

3   　　M　　  广州　    1

4  　 　M　　  北京　    1

5  　 　F　    上海　　　0


　P(A|B) = P(B|A) P(A) / P(B)


　　　P(转化|Mx上海)
= P(Mx上海|转化) x P(转化) / P(Mx上海)


　　　P(转化|Mx上海)
= (P(M|转化) x P(上海|转化) x P(转化)) / (P(M) x P(上海))
= (2/2 x 0/2 x 2/5) / (P(M) x P(上海))
= 0


　　　P(不转化|Mx上海)
= (P(M|不转化) x P(上海|不转化) x P(不转化)) / (P(M) x P(上海))
= (1/3 x 2/3 x 3/5) / (P(M) x P(上海))


3. 朴素贝叶斯算法的注意事项

3.1 拉普拉斯平滑

p(y=ck)=imI(y(i)=ck)+λm+λnlabels7
<script type="math/tex; mode=display" id="MathJax-Element-31">p(y=c_k)=\sum_i^m \frac{I(y^{(i)}=c_k) + \lambda}{m + \lambda n_{labels}} \,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\, （7）</script>

p(xj=ajl|y=ck)=mi=I(xji=ajl,y(i)=ck)+λmiI(y(i)=ck)+λLj8
<script type="math/tex; mode=display" id="MathJax-Element-34"> p(x_j=a_{jl}|y=c_k)=\frac{\sum_i^m=I(x_{ji}=a_{jl},y^{(i)}=c_k) + \lambda}{ \sum_i^m I(y^{(i)}=c_k) + \lambda L_j} \,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,（8）</script>

p(xj=ajl|y=ck)>0,l=1Ljp(xj=ajl|y=ck)=1
<script type="math/tex; mode=display" id="MathJax-Element-41"> p(x_j=a_{jl}|y=c_k)>0, \,\,\,\,\,\, \sum_{l=1}^{L_j} p(x_j=a_{jl}|y=c_k)=1 </script>

　　　P(转化|Mx上海)
= (P(M|转化) x P(上海|转化) x P(转化)) / (P(M) x P(上海))
= (3/3 x 1/4 x 3/7) / (P(M) x P(上海))

P(不转化|Mx上海)
= (P(M|不转化) x P(上海|不转化) x P(不转化)) / (P(M) x P(上海))
= (2/5 x 3/5 x 4/7) / (P(M) x P(上海))


4. 朴素贝叶斯算法的优缺点

• 朴素贝叶斯模型发源于古典数学理论，有稳定的分类效率。

• 对小规模的数据表现很好，能个处理多分类任务，适合增量式训练，尤其是数据量超出内存时，我们可以一批批的去增量训练。

• 对缺失数据不太敏感，算法也比较简单，常用于文本分类。

• 理论上，朴素贝叶斯模型与其他分类方法相比具有最小的误差率。但是实际上并非总是如此，这是因为朴素贝叶斯模型假设属性之间相互独立，这个假设在实际应用中往往是不成立的，在属性个数比较多或者属性之间相关性较大时，分类效果不好。而在属性相关性较小时，朴素贝叶斯性能最为良好。对于这一点，有半朴素贝叶斯之类的算法通过考虑部分关联性适度改进。

• 需要知道先验概率，且先验概率很多时候取决于假设，假设的模型可以有很多种，因此在某些时候会由于假设的先验模型的原因导致预测效果不佳。

• 由于我们是通过先验和数据来决定后验的概率从而决定分类，所以分类决策存在一定的错误率。

• 对输入数据的表达形式很敏感。

• 对于连续特征，假设了每个类别标签的数据集中的每个特征均服从高斯分布。

from __future__ import division, print_function
from sklearn import datasets
import matplotlib.pyplot as plt
import math
import numpy as np
import pandas as pd
%matplotlib inline

def shuffle_data(X, y, seed=None):
if seed:
np.random.seed(seed)

idx = np.arange(X.shape[0])
np.random.shuffle(idx)

return X[idx], y[idx]

# 正规化数据集 X
def normalize(X, axis=-1, p=2):
lp_norm = np.atleast_1d(np.linalg.norm(X, p, axis))
lp_norm[lp_norm == 0] = 1
return X / np.expand_dims(lp_norm, axis)

# 标准化数据集 X
def standardize(X):
X_std = np.zeros(X.shape)
mean = X.mean(axis=0)
std = X.std(axis=0)

# 做除法运算时请永远记住分母不能等于0的情形
# X_std = (X - X.mean(axis=0)) / X.std(axis=0)
for col in range(np.shape(X)[1]):
if std[col]:
X_std[:, col] = (X_std[:, col] - mean[col]) / std[col]
return X_std

# 划分数据集为训练集和测试集
def train_test_split(X, y, test_size=0.2, shuffle=True, seed=None):
if shuffle:
X, y = shuffle_data(X, y, seed)

n_train_samples = int(X.shape[0] * (1-test_size))
x_train, x_test = X[:n_train_samples], X[n_train_samples:]
y_train, y_test = y[:n_train_samples], y[n_train_samples:]

return x_train, x_test, y_train, y_test

def accuracy(y, y_pred):
y = y.reshape(y.shape[0], -1)
y_pred = y_pred.reshape(y_pred.shape[0], -1)
return np.sum(y == y_pred)/len(y)

class NaiveBayes():
"""朴素贝叶斯分类模型. """
def __init__(self):
self.classes = None
self.X = None
self.y = None
# 存储每个类别标签数据集中每个特征中每个特征值的出现概率, 因为预测的时候需要, 模型训练的过程中其实就是计算出这些概率
self.parameters = []

def fit(self, X, y):
self.X = X
self.y = y
self.classes = np.unique(y)
# 遍历所有类别的数据集，计算每一个类别数据集每个特征中每个特征值的出现概率
for i in range(len(self.classes)):
c = self.classes[i]
# 选出该类别的数据集
x_where_c = X[np.where(y == c)]

self.parameters.append([])
# 遍历该类别数据的所有特征，计算该类别数据集每个特征中每个特征值的出现概率
for j in range(x_where_c.shape[1]):
feautre_values_where_c_j = np.unique(x_where_c[:, j])

parameters = {}
# 遍历整个训练数据集该特征的所有特征值(如果遍历该类别数据集x_where_c中该特征的所有特征值,
# 则每列的特征值都不全，因此整个数据集X中存在但是不在x_where_c中的特征值将得不到其概率,
# feautre_values_where_c_j), 计算该类别数据集该特征中每个特征值的出现概率
for feature_value in X[:, j]: # feautre_values_where_c_j
n_feature_value = x_where_c[x_where_c[:, j]==feature_value].shape[0]
# 用Laplance平滑对概率进行修正, 并且用取对数的方法将累乘转成累加的形式
parameters[feature_value] = np.log((n_feature_value + 1) /
(x_where_c.shape[0] + len(feautre_values_where_c_j)))
self.parameters[i].append(parameters)

# 计算先验概率
def calculate_priori_probability(self, c):
x_where_c = self.X[np.where(self.y == c)]
n_samples_for_c = x_where_c.shape[0]
n_samples = self.X.shape[0]
return (n_samples_for_c + 1) / (n_samples + len(self.classes))

def classify(self, sample):
posteriors = []

# 遍历所有类别
for i in range(len(self.classes)):
c = self.classes[i]
prior = self.calculate_priori_probability(c)
posterior = np.log(prior)

# probability = P(Y)*P(x1|Y)*P(x2|Y)*...*P(xN|Y)
# 遍历所有特征
for j, params in enumerate(self.parameters[i]):
# 取出预测样本的第j个特征
sample_feature = sample[j]
# 取出参数中第i个类别第j个特征特征值为sample_feature的概率, 如果测试集中的样本
# 有特征值没有出现, 则假设该特征值的概率为1/self.X.shape[0]
proba = params.get(sample_feature, np.log(1/self.X.shape[0]))

# 朴素贝叶斯模型假设特征之间条件独立，即P(x1,x2,x3|Y) = P(x1|Y)*P(x2|Y)*P(x3|Y)
posterior += proba

posteriors.append(posterior)

# 对概率进行排序
index_of_max = np.argmax(posteriors)
max_value = posteriors[index_of_max]

return self.classes[index_of_max]

# 对数据集进行类别预测
def predict(self, X):
y_pred = []
for sample in X:
y = self.classify(sample)
y_pred.append(y)
return np.array(y_pred)

def main():
X = np.array([['M','北京'], ['F', '上海'], ['M' ,'广州'], ['M' ,'北京'],
['F' ,'上海'], ['M','北京'], ['F', '上海'], ['M' ,'广州'],
['M' ,'北京'], ['F' ,'上海']])
y = np.array([1, 0, 1, 1, 0, 1, 0, 1, 1, 0])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.6)

clf = NaiveBayes()
clf.fit(X_train, y_train)
y_pred = np.array(clf.predict(X_test))

accu = accuracy(y_test, y_pred)

print ("Accuracy:", accu)

if __name__ == "__main__":
main()


01-24 1万+
12-27 1万+

11-23 8639
06-08 3070
12-25 3159
10-29 3951
07-07 7511
12-19 2203
11-20 9291
03-26 1万+
©️2020 CSDN 皮肤主题: 大白 设计师:CSDN官方博客