前言
朴素贝叶斯(naive Bayes)法是基于贝叶斯定理和特征条件独立假设的分布方法。对于给定的训练数据集,首先基于特征条件独立独立假设学习输入输出的联合概率分布。然后基于此模型,对于给定的输入x,利用贝叶斯定理求出后验概率最大的输出。
一、naive Bayes是什么?
给定训练集 T = ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . , ( x N , y N ) T={(x_1,y_1),(x_2,y_2),..,(x_N,y_N)} T=(x1,y1),(x2,y2),..,(xN,yN),设类别可选数目为K,即 c 1 , c 2 , . . . , c K c_1,c_2,...,c_K c1,c2,...,cK,特征维度为m,即 x i = ( x i 1 , x i 2 , . . . x i m ) x_i=(x_i^1,x_i^2,...x_i^m) xi=(xi1,xi2,...xim),第j维的特征可取值数目为 S j S_j Sj,分别为 a j 1 , a j 2 , . . . , a j S j a_j^1,a_j^2,...,a_j^{S_j} aj1,aj2,...,ajSj。
朴素贝叶斯通过训练数据学习联合分布
P
(
X
,
Y
)
P(X,Y)
P(X,Y),具体地,学习以下先验概率和条件概率。
先验概率为:
P
(
Y
=
c
k
)
,
k
=
1
,
2
,
.
.
.
,
K
P(Y=c_k),k=1,2,...,K
P(Y=ck),k=1,2,...,K
条件概率为:
P
(
X
=
x
∣
Y
=
c
k
)
=
P
(
X
1
=
x
1
,
X
2
=
x
2
,
.
.
.
,
X
m
=
x
m
∣
Y
=
c
k
)
,
k
=
1
,
2
,
.
.
.
,
K
P(X=x|Y=c_k)=P(X^1=x^1,X^2=x^2,...,X^m=x^m|Y=c_k),k=1,2,...,K
P(X=x∣Y=ck)=P(X1=x1,X2=x2,...,Xm=xm∣Y=ck),k=1,2,...,K
然后通过:
p
(
X
=
x
,
Y
=
c
k
)
=
P
(
Y
=
c
k
)
P
(
X
=
x
∣
Y
=
c
k
)
p(X=x,Y=c_k)=P(Y=c_k)P(X=x|Y=c_k)
p(X=x,Y=ck)=P(Y=ck)P(X=x∣Y=ck)
来获得联合概率。
为了降低模型的复杂度,朴素贝叶斯作了条件独立性的假设:
P
(
X
=
x
∣
Y
=
c
k
)
=
P
(
X
1
=
x
1
,
X
2
=
x
2
,
.
.
.
,
X
m
=
x
m
∣
Y
=
c
k
)
=
∏
j
=
1
m
P
(
X
j
=
x
j
∣
Y
=
c
k
)
P(X=x|Y=c_k)=P(X^1=x^1,X^2=x^2,...,X^m=x^m|Y=c_k) \\=\prod_{j=1}^mP(X^j=x^j|Y=c_k)
P(X=x∣Y=ck)=P(X1=x1,X2=x2,...,Xm=xm∣Y=ck)=j=1∏mP(Xj=xj∣Y=ck)
朴素贝叶斯由此得名。由于朴素贝叶斯学习到了联合概率分布,因此为生成模型。
对于后验概率
P
(
Y
=
C
K
∣
X
=
x
)
P(Y=C_K|X=x)
P(Y=CK∣X=x),由贝叶斯公式有:
P
(
Y
=
C
K
∣
X
=
x
)
=
p
(
X
=
x
,
Y
=
c
k
)
P
(
X
=
x
)
=
P
(
Y
=
c
k
)
P
(
X
=
x
∣
Y
=
c
k
)
P
(
X
=
x
)
=
P
(
Y
=
c
k
)
∏
j
=
1
m
P
(
X
j
=
x
j
∣
Y
=
c
k
)
P
(
X
=
x
)
P(Y=C_K|X=x)=\frac{p(X=x,Y=c_k)}{P(X=x)} \\=\frac{P(Y=c_k)P(X=x|Y=c_k)}{P(X=x)} \\=\frac{P(Y=c_k)\prod_{j=1}^mP(X^j=x^j|Y=c_k)}{P(X=x)}
P(Y=CK∣X=x)=P(X=x)p(X=x,Y=ck)=P(X=x)P(Y=ck)P(X=x∣Y=ck)=P(X=x)P(Y=ck)∏j=1mP(Xj=xj∣Y=ck)
选取后验概率最大的类别作为预测的样本类别,由于对于同一个样本的所有类别,上式分母相同,因此朴素贝叶斯最后预测的类别可以表示为:
y
=
arg
max
c
k
P
(
Y
=
c
k
)
∏
j
=
1
m
P
(
X
j
=
x
j
∣
Y
=
c
k
)
y=\arg \max_{c_k}P(Y=c_k)\prod_{j=1}^mP(X^j=x^j|Y=c_k)
y=argckmaxP(Y=ck)j=1∏mP(Xj=xj∣Y=ck)
二、朴素贝叶斯法的参数估计
- 先验概率和条件概率的极大似然估计
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k = 1 , 2 , . . . , K P ( X j = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i j = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) , l = 1 , 2 , . . . , S j P(Y=c_k)=\frac{\sum_{i=1}^{N}I(y_i=c_k)}{N},k=1,2,...,K \\ P(X^{j}=a_j^l|Y=c_k)=\frac{\sum_{i=1}^{N}I(x_i^j=a_j^l,y_i=c_k)}{\sum_{i=1}^{N}I(y_i=c_k)},l=1,2,...,S_j P(Y=ck)=N∑i=1NI(yi=ck),k=1,2,...,KP(Xj=ajl∣Y=ck)=∑i=1NI(yi=ck)∑i=1NI(xij=ajl,yi=ck),l=1,2,...,Sj - 增加平滑项(为了防止用极大似然估计可能出现所要估计的概率值为0的情况)
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) + λ N + λ K , k = 1 , 2 , . . . , K P ( X j = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i j = a j l , y i = c k ) + λ ∑ i = 1 N I ( y i = c k ) + λ S j , l = 1 , 2 , . . . , S j P(Y=c_k)=\frac{\sum_{i=1}^{N}I(y_i=c_k)+\lambda}{N+\lambda K},k=1,2,...,K \\ P(X^{j}=a_j^l|Y=c_k)=\frac{\sum_{i=1}^{N}I(x_i^j=a_j^l,y_i=c_k)+\lambda}{\sum_{i=1}^{N}I(y_i=c_k)+\lambda S_j},l=1,2,...,S_j P(Y=ck)=N+λK∑i=1NI(yi=ck)+λ,k=1,2,...,KP(Xj=ajl∣Y=ck)=∑i=1NI(yi=ck)+λSj∑i=1NI(xij=ajl,yi=ck)+λ,l=1,2,...,Sj
λ \lambda λ为平滑因子,常取1,这时称为拉普拉斯平滑。
三、多项式朴素贝叶斯和伯努利朴素贝叶斯(文本分类)
- 多项式朴素贝叶斯
在多项式模型中,对于输入 设某文本 t i = ( x i ( 1 ) , x i ( 2 ) , … , x i ( k ) ) t_i=(x_i^{(1)},x_i^{(2)},…,x_i^{(k)} ) ti=(xi(1),xi(2),…,xi(k)), x i ( k ) x_i^{(k)} xi(k)是该文档中出现过的单词,允许重复,则
先验概率 P ( Y = c k ) P( Y = c_k ) P(Y=ck)= c k c_k ck下单词总数/整个训练样本的单词总数
条件概率 P ( X ( i ) = x ( i ) ∣ Y = c k ) P(X^{(i)} = x^{(i)} |Y = c_k) P(X(i)=x(i)∣Y=ck)=(类 c k c_k ck下单词 x ( i ) x^{(i)} x(i)在各个文档中出现过的次数之和+1) / (类 c k c_k ck下单词总数+|M|)
M是训练样本的单词表(即抽取单词,单词出现多次,只算一个)。 - 伯努利朴素贝叶斯
先验概率 P ( Y = c k ) P( Y = c_k ) P(Y=ck) = 分类为 c k c_k ck的实例总数 / 整个训练样本的样本容量
条件概率 P ( X ( i ) = x ( i ) ∣ Y = c k ) P(X^{(i)} = x^{(i)} |Y = c_k) P(X(i)=x(i)∣Y=ck)=(类 c k c_k ck下包含单词 x ( i ) x^{(i)} x(i)的实例数 + 1) / (属于类 c k c_k ck的实例总数 + 2)
总结:二者的计算粒度不一样,多项式模型以单词为粒度,伯努利模型以文件为粒度,因此二者的先验概率和类条件概率的计算方法都不同。
四、代码实现
代码如下(示例):
"""
朴素贝叶斯的两个常用模型:
伯努利
多项式
"""
import numpy as np
from data_util import load_text_cla_corpus # 加载文本分类数据的函数
from sklearn.feature_extraction.text import CountVectorizer # 词频统计
from sklearn.model_selection import train_test_split # 划分测试集和训练集
class NaiveBayes(object):
def __init__(self, _type, _lambda=1):
assert _type in ['poly', 'bernoulli']
self._type = _type
self._lambda = _lambda
def train(self, train_xs, train_ys):
'''
训练
计算先验概率和斯然函数
'''
m, n = train_xs.shape
unique_ys, count_y = np.unique(train_ys, return_counts=True)
self.unique_ys = unique_ys
if self._type == 'bernoulli':
train_xs = (train_xs > 0).astype(int)
denominator = m
self.prior_probs = np.log2(count_y / denominator) # 先验概率
self.likelihood_probs = np.zeros((n, len(unique_ys))) # 条件概率分布
for i, y in enumerate(self.unique_ys):
sub_xs = train_xs[train_ys == y]
self.likelihood_probs[:, i] = (np.sum(sub_xs, axis=0) + self._lambda) \
/ (count_y[i] + 2*self._lambda)
self.negative_likehood_probs = np.log2(1 - self.likelihood_probs)
self.likelihood_probs = np.log2(self.likelihood_probs)
else:
self.likelihood_probs = np.zeros((n, len(unique_ys))) # 条件概率分布
self.prior_probs = np.zeros((1, len(unique_ys))).flatten()
for i, y in enumerate(self.unique_ys):
sub_xs = train_xs[train_ys == y]
self.prior_probs[i] = np.log2(np.sum(sub_xs) / np.sum(train_xs))
self.likelihood_probs[:, i] = (np.sum(sub_xs, axis=0) + self._lambda) \
/ (np.sum(sub_xs) + n * self._lambda)
self.likelihood_probs = np.log2(self.likelihood_probs)
def predict(self, x):
'''
预测一个样本的类别
'''
x = x.reshape((1, -1))
if self._type == 'bernoulli':
log_probs = np.dot(x, self.likelihood_probs) + np.dot(1-x, self.negative_likehood_probs)
else:
log_probs = np.dot(x, self.likelihood_probs)
log_probs = log_probs.reshape((-1,)) + self.prior_probs
return self.unique_ys[np.argmax(log_probs)]
def test(self, test_xs, test_ys):
'''
测试函数
'''
predict_ys = []
for i in range(len(test_xs)):
predict_y = self.predict(test_xs[i])
predict_ys.append(predict_y)
predict_ys = np.array(predict_ys)
accuracy = (test_ys == predict_ys).mean()
print('Accuracy:%.4f' % accuracy)
if __name__ == "__main__":
_type = 'poly' # ['poly','bernoulli'] 多项式朴素贝叶斯、伯努利朴素贝叶斯
texts, labels = load_text_cla_corpus('Data/TextClassification/datasets.tsv')
train_texts, test_texts, train_ys, test_ys = train_test_split(texts, labels,train_size=0.8, random_state=2021)
# 将文本转换为代表出现频率的数值特征
vectorizer = CountVectorizer(binary=True)
train_xs = vectorizer.fit_transform(train_texts).toarray()
test_xs = vectorizer.transform(test_texts).toarray()
naive_bayes = NaiveBayes(_type)
naive_bayes.train(train_xs, train_ys)
naive_bayes.test(test_xs, test_ys)