Python3：《机器学习实战》之朴素贝叶斯（3）过滤垃圾邮件

最新推荐文章于 2021-11-28 13:57:33 发布

咸鱼半条

最新推荐文章于 2021-11-28 13:57:33 发布

阅读量776

点赞数 1

分类专栏：机器学习

机器学习专栏收录该内容

14 篇文章 1 订阅

订阅专栏

Python3：《机器学习实战》之朴素贝叶斯（3）过滤垃圾邮件

转载请注明作者和出处：http://blog.csdn.net/u011475210
代码地址：https://github.com/WordZzzz/ML/tree/master/Ch04
操作系统：WINDOWS 10
软件版本：python-3.6.2-amd64
编者：WordZzzz

Python3机器学习实战之朴素贝叶斯3过滤垃圾邮件

前言：

使用朴素贝叶斯解决一些现实生活的问题时，需要先从文本内容得到字符串列表，然后生成词向量。下面这个例子中，我们将了解朴素贝叶斯的一个最著名的应用：电子邮件垃圾过滤。

示例：使用朴素贝叶斯对电子邮件进行分类

收集数据：提供文本文件。
准备数据：将文本文件解析成词条向量。
分析数据：检查词条确保解析的正确性。
训练算法：使用我们之前建立的trainNB0()函数。
测试算法：使用calssifyNB()，并且构建一个新的测试函数来计算文档集的错误率。
使用算法：构建一个完整的程序对一组文档进行分类，将错分的文档输出。

准备数据：切分文本

先前的次向量都是我们预先给定的，这次将介绍如何从文本文档中构建自己的词列表。对于一个文本字符串，可以使用Python的string.split()方法将其切分。

string.split()的使用详解，请打开传送门：http://blog.csdn.net/u011475210/article/details/77925994

代码实现：

def textParse(bigString):
    """
    Function：   切分文本

    Args：       bigString：输入字符串

    Returns：    [*]：切分后的字符串列表
    """
    import re
    #利用正则表达式，来切分句子，其中分隔符是除单词、数字外的任意字符串

    listOfTokens = re.split(r'\W*', bigString)
    #返回切分后的字符串列表
    return [tok.lower() for tok in listOfTokens if len(tok) > 2]
   
   1
2
3
4
5
6
7
8
9
10
11
12
13
14

Python中有一些内嵌的方法，可以将字符串全部转换成小写（.lower()）或者大写（.upper()），借助这些方法可以达到目的。程序的最后一行就是用的这种方法。同时，如果某些文件包含一些URL（http://docs.google.com/support/bin/answer.py?hl=en&answer=66343），例如ham下的6.txt，那么切分文本时就会出现很多单词，如py、hl，很显然这些都是没用的，所以我们在程序最后一行只输出长度大于2的词条，好机智哦！

测试算法：使用朴素贝叶斯进行交叉验证

下面我们将文本解析器集成到一个完整的分类器中。

代码实现：

def spamTest():
    """
    Function：   贝叶斯垃圾邮件分类器

    Args：       无

    Returns：    float(errorCount)/len(testSet)：错误率
                vocabList：词汇表
                fullText：文档中全部单词
    """
    #初始化数据列表
    docList = []; classList = []; fullText = []
    #导入文本文件
    for i in range(1, 26):
        #切分文本
        wordList = textParse(open('email/spam/%d.txt' % i).read())
        #切分后的文本以原始列表形式加入文档列表
        docList.append(wordList)
        #切分后的文本直接合并到词汇列表
        fullText.extend(wordList)
        #标签列表更新
        classList.append(1)
        #切分文本
        #print('i = :', i)
        wordList = textParse(open('email/ham/%d.txt' % i).read())
        #切分后的文本以原始列表形式加入文档列表
        docList.append(wordList)
        #切分后的文本直接合并到词汇列表
        fullText.extend(wordList)
        #标签列表更新
        classList.append(0)
    #创建一个包含所有文档中出现的不重复词的列表
    vocabList = createVocabList(docList)
    #初始化训练集和测试集列表
    trainingSet = list(range(50)); testSet = []
    #随机构建测试集，随机选取十个样本作为测试样本，并从训练样本中剔除
    for i in range(10):
        #随机得到Index
        randIndex = int(random.uniform(0, len(trainingSet)))
        #将该样本加入测试集中
        testSet.append(trainingSet[randIndex])
        #同时将该样本从训练集中剔除
        del(trainingSet[randIndex])
    #初始化训练集数据列表和标签列表
    trainMat = []; trainClasses = []
    #遍历训练集
    for docIndex in trainingSet:
        #词表转换到向量，并加入到训练数据列表中
        trainMat.append(setOfWords2Vec(vocabList, docList[docIndex]))
        #相应的标签也加入训练标签列表中
        trainClasses.append(classList[docIndex])
    #朴素贝叶斯分类器训练函数
    p0V, p1V, pSpam = trainNB0(array(trainMat), array(trainClasses))
    #初始化错误计数
    errorCount = 0
    #遍历测试集进行测试
    for docIndex in testSet:
        #词表转换到向量
        wordVector = setOfWords2Vec(vocabList, docList[docIndex])
        #判断分类结果与原标签是否一致
        if classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:
            #如果不一致则错误计数加1
            errorCount += 1
            #并且输出出错的文档
            print("classification error",docList[docIndex])
    #打印输出信息
    print('the erroe rate is: ', float(errorCount)/len(testSet))
    #返回词汇表和全部单词列表
    #return vocabList, fullText
   
   1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69

输出结果：

>>> bayes.spamTest()
the erroe rate is:  0.0
>>> bayes.spamTest()
the erroe rate is:  0.0
>>> bayes.spamTest()
classification error ['home', 'based', 'business', 'opportunity', 'knocking', 'your', 'door', 'don抰', 'rude', 'and', 'let', 'this', 'chance', 'you', 'can', 'earn', 'great', 'income', 'and', 'find', 'your', 'financial', 'life', 'transformed', 'learn', 'more', 'here', 'your', 'success', 'work', 'from', 'home', 'finder', 'experts']
classification error ['scifinance', 'now', 'automatically', 'generates', 'gpu', 'enabled', 'pricing', 'risk', 'model', 'source', 'code', 'that', 'runs', '300x', 'faster', 'than', 'serial', 'code', 'using', 'new', 'nvidia', 'fermi', 'class', 'tesla', 'series', 'gpu', 'scifinance', 'derivatives', 'pricing', 'and', 'risk', 'model', 'development', 'tool', 'that', 'automatically', 'generates', 'and', 'gpu', 'enabled', 'source', 'code', 'from', 'concise', 'high', 'level', 'model', 'specifications', 'parallel', 'computing', 'cuda', 'programming', 'expertise', 'required', 'scifinance', 'automatic', 'gpu', 'enabled', 'monte', 'carlo', 'pricing', 'model', 'source', 'code', 'generation', 'capabilities', 'have', 'been', 'significantly', 'extended', 'the', 'latest', 'release', 'this', 'includes']
the erroe rate is:  0.2
   
   1
2
3
4
5
6
7
8

函数spamTest()会输出在10封随机选择的电子邮件上的分类错误率。因为是随机的，所以每次输出结果可能有些差别。所以如果想要更好的估计错误率，就需要多次重复求平均值。

报错信息汇总

运行报错：

>>> reload(bayes)
<module 'bayes' from 'E:\\机器学习实战\\mycode\\Ch04\\bayes.py'>
>>> bayes.spamTest()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "E:\机器学习实战\mycode\Ch04\bayes.py", line 221, in spamTest
    wordList = textParse(open('email/ham/%d.txt' % i).read())
UnicodeDecodeError: 'gbk' codec can't decode byte 0xae in position 199: illegal multibyte sequence
   
   1
2
3
4
5
6
7
8

一看就是编码问题，所以在程序中加入了打印信息，想看看是哪个文档读取出了问题，最后发现数据集ham下第23个文本中有不能识别的字符（®），修改之后程序运转正常。如果从我的github上下载的数据集，那就大可放心，不会出现这种问题的。

报错文档：

SciFinance now automatically generates GPU-enabled pricing & risk model source code that runs up to 50-300x faster than serial code using a new NVIDIA Fermi-class Tesla 20-Series GPU.

SciFinance® is a derivatives pricing and risk model development tool that automatically generates C/C++ and GPU-enabled source code from concise, high-level model specifications. No parallel computing or CUDA programming expertise is required.

SciFinance's automatic, GPU-enabled Monte Carlo pricing model source code generation capabilities have been significantly extended in the latest release. This includes:
   
   1
2
3
4
5

运行报错：

>>> reload(bayes)
<module 'bayes' from 'E:\\机器学习实战\\mycode\\Ch04\\bayes.py'>
>>> bayes.spamTest()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "E:\机器学习实战\mycode\Ch04\bayes.py", line 239, in spamTest
    del(trainingSet[randIndex])
TypeError: 'range' object doesn't support item deletion
   
   1
2
3
4
5
6
7
8