机器学习算法python实现---朴素贝叶斯算法（朴素Bayes）_自行实现朴素贝叶斯算法,测试准确率;然后在此基础上改进,对性能进行比较(必须基于-CSDN博客

本文链接：https://blog.csdn.net/weixin_38215395/article/details/78702985

1、算法基本原理

我是这样理解的，通过已知的训练数据及其对应的类别，利用贝叶斯理论（即条件概率公式），得到这种问题的一个概率模型。模型的输入是特征数据，输出是数据对应的类别。那么，将待分类数据的特征数据代入此模型，就可得到其属于所有类别的概率，概率大者作为该数据的类别。
另外，该算法中的“朴素”二字说明此方法存在一个很强的假设：用于分类的特征，即特征向量的每一维度间，在类别确定的条件下是相互独立的。

2、Bayes算法推导

下面将结合李航所著的《统计学习方法》，给出Bayes算法的推导过程。
Bayes算法中的关键是如何构造这种数据的一个好的概率模型。那么，我们现在已知的数据集如下：

训练数据集 $T={(x_1,y_1),(x_2,y_2),...,(x_N,y_N)}$ ，其中 $x_i$ 是第 $i$ 个特征向量（假设为n维的）， $y_i$ 为第 $i$ 个特征向量的类别， $i=1,2...,N$ ;
还有数据集的所有的类别集合 $\beta={c_1,c_2,...,c_k}$ ，显然 $y_i\in\beta$ ;

再回过头想一下，我们构建这个概率模型的目的就是想得到已知特征数据的条件下为某个类别的概率，用条件概率表示就是：得到 $P(Y=c_k|X=x)$ 。
现在引入条件概率公式：
这里写图片描述
上述公式中：

$P(X=x)$ 表示的是特征向量为向量 $x$ 的概率，此时我们给定的向量就是向量 $x$ ，因此，此概率为1。
$P(Y=c_k)$ 表示的是，样本类别为 $c_k$ 的概率。当数据量足够大时，类别为 $c_k$ 的概率可以用频数代替，因此:
对于公式 $P(X=x,Y=c_k)$ ，因为特征向量是n维的，因此

根据独立性假设，上式又可写为相乘的形式：

通过以上分析，就得到了 $P(Y=c_k|X=x)$ ，使此概率取最大的 $c_k$ 就是Bayes分类模型的输出，即：
这里写图片描述

3、Bayes算法的应用

在应用Bayes分类模型时，有时会遇到某一类别不存在的情况 $P(Y=c_k)=0$ ，即，这是会影响到 $P(Y=c_k|X=x)$ 的计算。为了避免这种情况，在计算 $P(Y=c_k)=0$ 引入一个正整数 $\lambda$ 。即：

P λ (Y = c_k) = \sum i = 1 N I ( y i = c k ) + λ N + K λ

${P_\lambda }\left( {Y = c\_k} \right) = \frac{{\sum\limits_{i = 1}^N {I\left( {{y_i} = {c_k}} \right)} + \lambda }}{{N + K\lambda }}$
另外，在实际应用过程中，通常将分类模型取对数，将乘法变为加法。这样一方面便于计算，另一方面可以避免当某一个概率为0时，相乘后结果肯定为0的情况。

3.1 Bayes算法在电影评论分类问题中的应用

数据来源于电影评论数据。关于数据的预处理，在文章pyhton .txt文件读取及数据处理总结中已详细说明。下面直接给出Bayes算法的代码，由于数据量比较大，这里只利用了全部数据的1/100。

所有代码和数据可去我的码云下载。

下面代码存放于Bayes.py文件中：

#__author__=='qustl_000'
#-*- coding: utf-8 -*-

import Bayes
import os
import re
from numpy import *
import random

'''获取数据，并去除数据中的多余符号,返回list类型的数据集'''
def loadData(pathDirPos,pathDirNeg):
    posAllData = []  # 积极评论
    negAllData = []  # 消极评论
    # 积极评论
    for allDir in pathDirPos:
        lineDataPos = []
        child = os.path.join('%s' % allDir)
        filename = r"review_polarity/txt_sentoken/pos/" + child
        with open(filename) as childFile:
            for lines in childFile:
                lineString = re.sub("[\n\.\!\/_\-$%^*(+\"\')]+|[+—()?【】“”！:,;.？、~@#￥%…&*（）0123456789]+", " ", lines)
                line = lineString.split(' ')          #用空白分割每个文件中的数据集（此时还包含许多空白字符）
                for strc in line:
                    if strc != "" and len(strc) > 1:  #删除空白字符，并筛选出长度大于1的单词
                        lineDataPos.append(strc)
                posAllData.append(lineDataPos)
    # 消极评论
    for allDir in pathDirNeg:
        lineDataNeg = []
        child = os.path.join('%s' % allDir)
        filename = r"review_polarity/txt_sentoken/neg/" + child
        with open(filename) as childFile:
            for lines in childFile:
                lineString = re.sub("[\n\.\!\/_\-$%^*(+\"\')]+|[+—()?【】“”！:,;.？、~@#￥%…&*（）0123456789]+", " ", lines)
                line = lineString.split(' ')
                for strc in line:
                    if strc != "" and len(strc) > 1:
                        lineDataNeg.append(strc)
                negAllData.append(lineDataNeg)
    return posAllData,negAllData

'''划分数据集，将数据集划分为训练数据和测试数据,参数splitPara为分割比例'''
def splitDataSet(pathDirPos,pathDirNeg,splitPara):
    trainingData=[]
    testData=[]
    traingLabel=[]
    testLabel=[]
    posData,negData=loadData(pathDirPos,pathDirNeg)
    pos_len=int(len(posData)/100)
    neg_len=int(len(negData)/100)
    #操作积极评论数据
    for i in range(pos_len):
        if(random.random()<splitPara):
            trainingData.append(posData[i])
            traingLabel.append(1)
        else:
            testData.append(posData[i])
            testLabel.append(1)
    for j in range(neg_len):
        if(random.random()<splitPara):
            trainingData.append(negData[j])
            traingLabel.append(0)
        else:
            testData.append(negData[j])
            testLabel.append(0)
    return trainingData,traingLabel,testData,testLabel

'''获取文本中的所有词汇，不重复'''
def getVocab(dataSet):
    dataVec=[]
    lenData=len(dataSet)
    for i in range(lenData):
        dataVec.extend(dataSet[i])
    vocab=set(dataVec)
    return vocab

'''将待处理文本转化为数值向量'''
def word2Vec(Vocablist,wordData):
    Vocablist=list(Vocablist)
    lenWordData=len(Vocablist)
    mathVec=zeros(lenWordData)
    for word in wordData:
        if word in Vocablist:
            mathVec[Vocablist.index(word)]+=1
    return mathVec

'''Bayes分类器训练函数'''
def TrainBayes(trainingData,trainingLabel,Vocablist):
    numfile=len(trainingData)
    P1=sum(trainingLabel)/float(numfile)       #类型为1的概率
    numWords=len(Vocablist)
    p1num=ones(numWords);p0num=ones(numWords)
    p1Denom=2.0;p0Denom=2.0                    #为了避免0概率的出现
    for i in range(numfile):
        if(trainingLabel[i]==1):
            p1num+=word2Vec(Vocablist,trainingData[i])
            p1Denom+=sum(word2Vec(Vocablist,trainingData[i]))
        else:
            p0num+=word2Vec(Vocablist,trainingData[i])
            p0Denom+=sum(word2Vec(Vocablist,trainingData[i]))
    p1A=log(p1num/float(p1Denom))
    p0A=log(p0num/float(p0Denom))
    return p0A,p1A,P1

'''Bayes分类函数'''
def classifyBayes(testVec,p0A,p1A,P1):
    class1=sum(testVec*p1A)+log(P1)
    class0=sum(testVec*p0A)+log(P1)
    if class1>class0:
        return 1
    else:
        return 0

下面代码存放于main.py文件中：

#__author__=='qustl_000'
#-*- coding: utf-8 -*-

import Bayes
import os
import re
from numpy import *

pathDirPos=os.listdir("review_polarity/txt_sentoken/pos")
pathDirNeg=os.listdir("review_polarity/txt_sentoken/neg")
trainingData,traingLabel,testData,testLabel=Bayes.splitDataSet(pathDirPos,pathDirNeg,0.67)
#print(trainingData,traingLabel)
vocab=Bayes.getVocab(trainingData)
p0A,p1A,p1=Bayes.TrainBayes(trainingData,traingLabel,vocab)
len_testData=len(testData)
'''分类并计算错误率'''
error_count=0
for i in range(len_testData):
    mathVec=Bayes.word2Vec(vocab,testData[i])
    result=Bayes.classifyBayes(mathVec,p0A,p1A,p1)
    if(result!=testLabel[i]):
        error_count+=1
error_rate=error_count/float(len_testData)
print(error_rate)

4、利用Sklearn实现贝叶斯算法

利用了“20类新闻数据集”数据集链接。
”’利用sklearn库实现Bayes分类器”’

def GaussianBayes():
    from sklearn import datasets
    #训练数据集路径    
    twenty_train=datasets.load_files("F:/Self_Learning/机器学习/python/20news-bydate-train")
    #测试数据集路径
    twenty_test=datasets.load_files("F:/Self_Learning/机器学习/python/20news-bydate-test")

    '''计算词频'''
    from sklearn.feature_extraction.text import CountVectorizer
    count_vect=CountVectorizer(stop_words="english",decode_error='ignore')
    X_train_counts=count_vect.fit_transform(twenty_train.data)

    '''利用TF-IDF进行特征提取'''
    from sklearn.feature_extraction.text import TfidfTransformer
    tfidf_transformer=TfidfTransformer()
    X_train_tfidf=tfidf_transformer.fit_transform(X_train_counts)

    '''分类器训练'''
    from sklearn.naive_bayes import MultinomialNB
    clf=MultinomialNB().fit(X_train_tfidf,twenty_train.target)

    '''分类效果评价'''
    #建立管道
    from sklearn.pipeline import Pipeline
    text_clf=Pipeline([('vect',CountVectorizer(stop_words="english",decode_error='ignore')),
                       ('tfidf',TfidfTransformer()),
                       ('clf',MultinomialNB()),
                       ])
    text_clf=text_clf.fit(twenty_train.data,twenty_train.target)
    #测试集分类准确率
    import numpy as np
    docs_test=twenty_test.data
    predicted=text_clf.predict(docs_test)
    print("朴素贝叶斯分类准确率：",np.mean(predicted==twenty_test.target))

分类准确率为：0.8169