基于朴素贝叶斯分类算法实现垃圾邮箱分类

最新推荐文章于 2024-03-02 14:13:31 发布

yqtaowhu

最新推荐文章于 2024-03-02 14:13:31 发布

阅读量2.9w

点赞数 12

分类专栏： Machine Learn 文章标签：朴素贝叶斯机器学习

本文链接：https://blog.csdn.net/taoyanqi8932/article/details/53083983

版权

Machine Learn 专栏收录该内容

25 篇文章 3 订阅

订阅专栏

贝叶斯决策理论

在机器学习中，朴素贝叶斯是基于贝叶斯决策的一种简单形式,下面给出贝叶斯的基本公式，也是最重要的公式：

这里写图片描述

其中X是一个m*n的矩阵，m为他的样本数，n为特征的个数，即我们要求的是：在已知的样本情况下的条件概率。
$示例$ )表示的是条件概率
$示例$ )为先验概率

为什么需要朴素贝叶斯

这里有两个原因：

由统计学知识，如果每个特征需要N个样本，那么对于10个特征需要 $示例$ 个样本，这个数据无疑是非常大的，会随着特征的增大而迅速的增大。如果特征之间独立，那么样本数就可以减少到10*N，所谓的独立及时每个特征与其他的特征没有关系，当然这个假设很强，因为我们知道实际中是很难的。
因为独立同分布，因此在计算条件概率的时，可以避免了求联合概率，因此可以写成：

下面给算法的实现过程（打公式太难，不喜勿喷）：
这里写图片描述

上述算法的修正

1. 最后概率可能为0

在进行分类是，多个概率乘积得到类别，但是如果有一个概率为0,则最后的结果为0，因此未来避免未出现的属性值，在估计概率时同城要进行“平滑”，常用的是“拉普拉斯修正”（Laplacian correction）,具体来说，在计算 $示例$ )的时候，将分子加1，分母加上类别数N.同样在计算 $示例$ )的时候在分子加1，分母加上Ni,其表示第i个属性的可能的取值数。

2. 值过小可能会溢出

在实际中对概率取对数的形式，可以防止相乘是溢出。

垃圾邮件分类

在email/spam文件夹中有25封垃圾邮件，在email/ham中有25封正常邮件，将其进行垃圾邮件分类。

分词

首先遇到的问题是怎样把一封邮件进行分词，即将其划分成一个个单词的形式。可以想到用正则表达式，关于正则表达式可以参考网上的资料，这里给出python的程序，实现怎样将一个长的字符冲进行分词的操作。

def textParse(bigString):
    import re          #导入正则表达式的库
    listOfTokens=re.split(r'\W*',bigString)   #返回列表
    return [tok.lower() for tok in listOfTokens if len(tok)>2]

示例：

#见最后的总程序
#可见它将一封邮件进行了分词
[yqtao@localhost ml]$ python bayes.py   
['peter', 'with', 'jose', 'out', 'town', 'you', 'want', 'meet', 'once', 'while', 'keep', 'things', 'going', 'and', 'some', 'interesting', 'stuff', 'let', 'know', 'eugene']

这里的re.split(r'\W*',bigString),表示以除了数字，字母和下划线的符合进行划分，return 的是一个列表推到式生成的列表，其中将单词长度小于等于2的过滤掉，并且将其变成小写字母。

生成词汇表

将所有的邮件进行分词后生成一个dataSet，然后生成一个词汇表，这个词汇表是一个集合，即每个单词只出现一次，词汇表是一个列表形式如：
[“cute”,”love”,help”,garbage”,”quit”…]

def createVocabList(dataSet):
    vocabSet=set([])
    for docment in dataSet:
        vocabSet=vocabSet| set(docment) #union of tow sets
    return list(vocabSet) #convet if to list

生成词向量

每一封邮件的词汇都存在了词汇表中，因此可以将每一封邮件生成一个词向量，存在几个则为几，不存在为0，例如：[“love”,”garbage”],则他的词向量为
[0,1,0,1,0,…],其位置是与词汇表所对应的，因此词向量的维度与词汇表相同。

#vocablist为词汇表，inputSet为输入的邮件
def bagOfWords2Vec(vocabList,inputSet):
    returnVec=[0]*len(vocabList)    #他的大小与词向量一样
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)]+=1 #查找单词的索引
        else: print ("the word is not in my vocabulry")
    return returnVec

训练算法

这一步是算法的核心，要计算：
1. 先验概率
2. 计算 $示例$ ), $示例$ ) 这里0表示正常邮件，1表示垃圾邮件。
这里需要重点的理解是如何计算第二步的，例如， $示例$ )，表示在垃圾邮件的条件下第i个特征的概率，首先先将所有的类别为1的词向量相加，可以得到每个特征的个数，因此在除以在类别1的单词总数就是在垃圾邮件中每个单词的概率了。注意，这里所说的特征即词汇的每一个单词。
python程序如下：

#这里的trainMat是训练样本的词向量，其是一个矩阵，他的每一行为一个邮件的词向量
#trainGategory为与trainMat对应的类别，值为0，1表示正常，垃圾
def train(trainMat,trainGategory):
    numTrain=len(trainMat)
    numWords=len(trainMat[0])  #is vocabulry length
    pAbusive=sum(trainGategory)/float(numTrain)
    p0Num=ones(numWords);p1Num=ones(numWords)
    p0Denom=2.0;p1Denom=2.0
    for i in range(numTrain):
        if trainGategory[i] == 1:
            p1Num += trainMat[i] #统计类1中每个单词的个数
            p1Denom += sum(trainMat[i]) #类1的单词总数
        else:
            p0Num += trainMat[i]
            p0Denom +=sum(trainMat[i])
    p1Vec=log(p1Num/p1Denom) #类1中每个单词的概率
    p0Vec=log(p0Num/p0Denom)
    return p0Vec,p1Vec,pAbusive

处理数据验证过程

这里首先将50封邮件读进docList列表中，然后生成一个词汇表包含所有的单词，接下来使用交叉验证，随机的选择10个样本进行测试，40个样本进行训练。

#spam email classfy
def spamTest():
    fullTest=[];docList=[];classList=[]
    for i in range(1,26): #it only 25 doc in every class
        wordList=textParse(open('email/spam/%d.txt' % i).read())
        docList.append(wordList)
        fullTest.extend(wordList)
        classList.append(1)
        wordList=textParse(open('email/ham/%d.txt' % i).read())
        docList.append(wordList)
        fullTest.extend(wordList)
        classList.append(0)
    vocabList=createVocabList(docList)   # create vocabulry
    trainSet=range(50);testSet=[]
#choose 10 sample to test ,it index of trainMat
    for i in range(10):
        randIndex=int(random.uniform(0,len(trainSet)))#num in 0-49
        testSet.append(trainSet[randIndex])
        del(trainSet[randIndex])
    trainMat=[];trainClass=[]
    for docIndex in trainSet:
        trainMat.append(bagOfWords2Vec(vocabList,docList[docIndex]))
        trainClass.append(classList[docIndex])
    p0,p1,pSpam=train(array(trainMat),array(trainClass))
    errCount=0
    for docIndex in testSet:
        wordVec=bagOfWords2Vec(vocabList,docList[docIndex])
        if classfy(array(wordVec),p0,p1,pSpam) != classList[docIndex]:
            errCount +=1
            print ("classfication error"), docList[docIndex]

    print ("the error rate is ") , float(errCount)/len(testSet)

完整的代码如下：

from numpy import *

#create a vocablist of set ,word can only exit once

def createVocabList(dataSet):
    vocabSet=set([])
    for docment in dataSet:
        vocabSet=vocabSet| set(docment) #union of tow sets
    return list(vocabSet) #convet if to list 


def bagOfWords2Vec(vocabList,inputSet):
    returnVec=[0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)]+=1
        else: print ("the word is not in my vocabulry")
    return returnVec

# tranin algorithm
# the p1Num is mean claclualte in 1 class evrey word contain weight
def train(trainMat,trainGategory):
    numTrain=len(trainMat)
    numWords=len(trainMat[0])  #is vocabulry length
    pAbusive=sum(trainGategory)/float(numTrain)
    p0Num=ones(numWords);p1Num=ones(numWords)
    p0Denom=2.0;p1Denom=2.0
    for i in range(numTrain):
        if trainGategory[i] == 1:
            p1Num += trainMat[i] 
            p1Denom += sum(trainMat[i])
        else:
            p0Num += trainMat[i]
            p0Denom +=sum(trainMat[i])
    p1Vec=log(p1Num/p1Denom)
    p0Vec=log(p0Num/p0Denom)
    return p0Vec,p1Vec,pAbusive
# classfy funtion
def classfy(vec2classfy,p0Vec,p1Vec,pClass1):
    p1=sum(vec2classfy*p1Vec)+log(pClass1)
    p0=sum(vec2classfy*p0Vec)+log(1-pClass1)
    if p1 > p0:
        return 1;
    else:
        return 0

# split the big string
def textParse(bigString):
    import re
    listOfTokens=re.split(r'\W*',bigString)
    return [tok.lower() for tok in listOfTokens if len(tok)>2]

#spam email classfy
def spamTest():
    fullTest=[];docList=[];classList=[]
    for i in range(1,26): #it only 25 doc in every class
        wordList=textParse(open('email/spam/%d.txt' % i).read())
        docList.append(wordList)
        fullTest.extend(wordList)
        classList.append(1)
        wordList=textParse(open('email/ham/%d.txt' % i).read())
        docList.append(wordList)
        fullTest.extend(wordList)
        classList.append(0)
    vocabList=createVocabList(docList)   # create vocabulry
    trainSet=range(50);testSet=[]
#choose 10 sample to test ,it index of trainMat
    for i in range(10):
        randIndex=int(random.uniform(0,len(trainSet)))#num in 0-49
        testSet.append(trainSet[randIndex])
        del(trainSet[randIndex])
    trainMat=[];trainClass=[]
    for docIndex in trainSet:
        trainMat.append(bagOfWords2Vec(vocabList,docList[docIndex]))
        trainClass.append(classList[docIndex])
    p0,p1,pSpam=train(array(trainMat),array(trainClass))
    errCount=0
    for docIndex in testSet:
        wordVec=bagOfWords2Vec(vocabList,docList[docIndex])
        if classfy(array(wordVec),p0,p1,pSpam) != classList[docIndex]:
            errCount +=1
            print ("classfication error"), docList[docIndex]

    print ("the error rate is ") , float(errCount)/len(testSet)

if __name__ == '__main__':
    #下面的为了演示分词的，可注释
    #listWord=textParse(open('email/ham/1.txt').read())
    spamTest()

运行与测试：直接python bayes.py运行，结果如下：

[yqtao@localhost ml]$ python bayes.py
classfication error ['home', 'based', 'business', 'opportunity', 'knocking', 'your', 'door', 'don', 'rude', 'and', 'let', 'this', 'chance', 'you', 'can', 'earn', 'great', 'income', 'and', 'find', 'your', 'financial', 'life', 'transformed', 'learn', 'more', 'here', 'your', 'success', 'work', 'from', 'home', 'finder', 'experts']
[yqtao@localhost ml]$ python bayes.py
the error rate is  0.0
[yqtao@localhost ml]$ python bayes.py
the error rate is  0.0
[yqtao@localhost ml]$ python bayes.py
the error rate is  0.0
[yqtao@localhost ml]$ python bayes.py
classfication error ['scifinance', 'now', 'automatically', 'generates', 'gpu', 'enabled', 'pricing', 'risk', 'model', 'source', 'code', 'that', 'runs', '300x', 'faster', 'than', 'serial', 'code', 'using', 'new', 'nvidia', 'fermi', 'class', 'tesla', 'series', 'gpu', 'scifinance', 'derivatives', 'pricing', 'and', 'risk', 'model', 'development', 'tool', 'that', 'automatically', 'generates', 'and', 'gpu', 'enabled', 'source', 'code', 'from', 'concise', 'high', 'level', 'model', 'specifications', 'parallel', 'computing', 'cuda', 'programming', 'expertise', 'required', 'scifinance', 'automatic', 'gpu', 'enabled', 'monte', 'carlo', 'pricing', 'model', 'source', 'code', 'generation', 'capabilities', 'have', 'been', 'significantly', 'extended', 'the', 'latest', 'release', 'this', 'includes']
the error rate is  0.1

因为是随机选择的样本，可以运行10次取平均值，可以观察到，测试效果还不错。还有一点要注意，这里一直出现的是将垃圾邮件误判为正常邮件，这会比将正常的误判为垃圾邮件要好。

完整的代码和数据在我的github:
https://github.com/yqtaowhu/MachineLearning

参考资料：

《统计学习方法》李航著
《机器学习》周志华著
《机器学习实战》Peter Harrington

yqtaowhu

关注

12
点赞
踩
100

收藏

觉得还不错? 一键收藏
15
评论
基于朴素贝叶斯分类算法实现垃圾邮箱分类

贝叶斯决策理论在机器学习中，朴素贝叶斯是基于贝叶斯决策的一种简单形式,下面给出贝叶斯的基本公式，也是最重要的公式：其中X是一个m*n的矩阵，m为他的样本数，n为特征的个数，即我们要求的是：在已知的样本情况下的条件概率。 )表示的是条件概率 )为先验概率为什么需要朴素贝叶斯这里有两个原因：由统计学知识，如果每个特征需要N个样本，那么对于10个特征需要个样本，这个数
复制链接

扫一扫