使用朴素贝叶斯算法过滤（中英文）垃圾邮件

最新推荐文章于 2024-07-29 15:51:04 发布

沐染懒懒

最新推荐文章于 2024-07-29 15:51:04 发布

阅读量3.1k

点赞数 1

本文链接：https://blog.csdn.net/qq_34205793/article/details/84994159

版权

使用朴素贝叶斯解决一些现实生活的问题时，需要先从文本内容得到字符串列表，然后生成词向量。下面这个例子中，我们将了解朴素贝叶斯的一个最著名的应用：电子邮件垃圾过滤。
首先看一下使用朴素贝叶斯对电子邮件进行分类的步骤：

收集数据：提供文本文件。
准备数据：将文本文件解析成词条向量。
分析数据：检查词条确保解析的正确性。
训练算法：使用我们之前建立的trainNB0()函数。
测试算法：使用classifyNB()，并构建一个新的测试函数来计算文档集的错误率。
使用算法：构建一个完整的程序对一组文档进行分类，将错分的文档输出到屏幕上。
一、准备数据：切分文本
之前算法中输入的文档格式为单词向量，例如[‘my’, ‘dog’, ‘has’, ‘flea’, ‘problems’, ‘help’, ‘please’]，而实际情况中通常要处理的是文本（例如邮件），那么就要先将文本转换为词向量，在bayes.py中加入代码：
对一个文本字符串，可以使用Python的string.split()方法将其切分。

由结果可知，标点符号也被当成了词的一部分。可以使用正则表达式来切分句子，其中分隔符是除单词、数字外的任意字符串。
compile()编译一个正则表达式模式，返回一个模式对象
具体可参考：https://www.cnblogs.com/papapython/p/7482349.html
Python中str.split()和re.split()对比可参考：https://blog.csdn.net/hawkerou/article/details/53518154
现在得到了一系列词组的词表，但是里面的空字符串需要去掉。可以计算每个字符串的长度，只返回长度大于0的字符串。
最后，我们发现句子中的第一个单词是大写的。如果目的是句子查找，那么这个特点会很有用。但是这里的文本只看成词袋，所以我们希望所有词的形式都是统一的，不论它们出现在句子中间、结尾还是开头。
Python中有一些内嵌的方法，可以将字符串全部转换成小写（.lower()）
或者大写（.upper()）。
对数据集中一封完整的电子邮件的实际处理结果。该数据放在email文件夹中，该文件夹又包含两个子文件夹，分别是spam和ham。
出现编码错误因为文中？不是utf-8编码修改代码：ISO-8859-1编码是单字节编码，向下兼容ASCII，其编码范围是0x00-0xFF，0x00-0x7F之间完全和ASCII一致，0x80-0x9F之间是控制字符，0xA0-0xFF之间是文字符号。

关于编码错误解决方法：
https://segmentfault.com/a/1190000004625718
https://blog.csdn.net/u013555719/article/details/77991010
https://www.cnblogs.com/Alier/p/6794719.html
http://www.cnblogs.com/tina-ma/p/3924854.html
https://blog.csdn.net/xiongchao2011/article/details/7276834
https://segmentfault.com/a/1190000004625718
https://www.cnblogs.com/amunamuna/p/8922125.html
https://segmentfault.com/q/1010000004268196/a-1020000004269556
https://blog.csdn.net/angela_0612/article/details/80405179
https://wanghuan-em.iteye.com/blog/2396671
https://blog.csdn.net/liy010/article/details/79504006

在这里插入图片描述下边斜体部分是没处理前和处理后的文本内容。
Hello,
Since you are an owner of at least one Google Groups group that uses the customized welcome message, pages or files, we are writing to inform you that we will no longer be supporting these features starting February 2011. We made this decision so that we can focus on improving the core functionalities of Google Groups – mailing lists and forum discussions. Instead of these features, we encourage you to use products that are designed specifically for file storage and page creation, such as Google Docs and Google Sites.
For example, you can easily create your pages on Google Sites and share the site (http://www.google.com/support/sites/bin/answer.py?hl=en&answer=174623) with the members of your group. You can also store your files on the site by attaching files to pages (http://www.google.com/support/sites/bin/answer.py?hl=en&answer=90563) on the site. If you�re just looking for a place to upload your files so that your group members can download them, we suggest you try Google Docs. You can upload files (http://docs.google.com/support/bin/answer.py?hl=en&answer=50092) and share access with either a group (http://docs.google.com/support/bin/answer.py?hl=en&answer=66343) or an individual (http://docs.google.com/support/bin/answer.py?hl=en&answer=86152), assigning either edit or download only access to the files.
you have received this mandatory email service announcement to update you about important changes to Google Groups.
[‘Hello’, ‘Since’, ‘you’, ‘are’, ‘an’, ‘owner’, ‘of’, ‘at’, ‘least’, ‘one’, ‘Google’, ‘Groups’, ‘group’, ‘that’, ‘uses’, ‘the’, ‘customized’, ‘welcome’, ‘message’, ‘pages’, ‘or’, ‘files’, ‘we’, ‘are’, ‘writing’, ‘to’, ‘inform’, ‘you’, ‘that’, ‘we’, ‘will’, ‘no’, ‘longer’, ‘be’, ‘supporting’, ‘these’, ‘features’, ‘starting’, ‘February’, ‘2011’, ‘We’, ‘made’, ‘this’, ‘decision’, ‘so’, ‘that’, ‘we’, ‘can’, ‘focus’, ‘on’, ‘improving’, ‘the’, ‘core’, ‘functionalities’, ‘of’, ‘Google’, ‘Groups’, ‘mailing’, ‘lists’, ‘and’, ‘forum’, ‘discussions’, ‘Instead’, ‘of’, ‘these’, ‘features’, ‘we’, ‘encourage’, ‘you’, ‘to’, ‘use’, ‘products’, ‘that’, ‘are’, ‘designed’, ‘specifically’, ‘for’, ‘file’, ‘storage’, ‘and’, ‘page’, ‘creation’, ‘such’, ‘as’, ‘Google’, ‘Docs’, ‘and’, ‘Google’, ‘Sites’, ‘For’, ‘example’, ‘you’, ‘can’, ‘easily’, ‘create’, ‘your’, ‘pages’, ‘on’, ‘Google’, ‘Sites’, ‘and’, ‘share’, ‘the’, ‘site’, ‘http’, ‘www’, ‘google’, ‘com’, ‘support’, ‘sites’, ‘bin’, ‘answer’, ‘py’, ‘hl’, ‘en’, ‘answer’, ‘174623’, ‘with’, ‘the’, ‘members’, ‘of’, ‘your’, ‘group’, ‘You’, ‘can’, ‘also’, ‘store’, ‘your’, ‘files’, ‘on’, ‘the’, ‘site’, ‘by’, ‘attaching’, ‘files’, ‘to’, ‘pages’, ‘http’, ‘www’, ‘google’, ‘com’, ‘support’, ‘sites’, ‘bin’, ‘answer’, ‘py’, ‘hl’, ‘en’, ‘answer’, ‘90563’, ‘on’, ‘the’, ‘site’, ‘If’, ‘you’, ‘re’, ‘just’, ‘looking’, ‘for’, ‘a’, ‘place’, ‘to’, ‘upload’, ‘your’, ‘files’, ‘so’, ‘that’, ‘your’, ‘group’, ‘members’, ‘can’, ‘download’, ‘them’, ‘we’, ‘suggest’, ‘you’, ‘try’, ‘Google’, ‘Docs’, ‘You’, ‘can’, ‘upload’, ‘files’, ‘http’, ‘docs’, ‘google’, ‘com’, ‘support’, ‘bin’, ‘answer’, ‘py’, ‘hl’, ‘en’, ‘answer’, ‘50092’, ‘and’, ‘share’, ‘access’, ‘with’, ‘either’, ‘a’, ‘group’, ‘http’, ‘docs’, ‘google’, ‘com’, ‘support’, ‘bin’, ‘answer’, ‘py’, ‘hl’, ‘en’, ‘answer’, ‘66343’, ‘or’, ‘an’, ‘individual’, ‘http’, ‘docs’, ‘google’, ‘com’, ‘support’, ‘bin’, ‘answer’, ‘py’, ‘hl’, ‘en’, ‘answer’, ‘86152’, ‘assigning’, ‘either’, ‘edit’, ‘or’, ‘download’, ‘only’, ‘access’, ‘to’, ‘the’, ‘files’, ‘you’, ‘have’, ‘received’, ‘this’, ‘mandatory’, ‘email’, ‘service’, ‘announcement’, ‘to’, ‘update’, ‘you’, ‘about’, ‘important’, ‘changes’, ‘to’, ‘Google’, ‘Groups’, ‘’]

def textParse(bigString):
    """
    文件解析为向量
    :param bigString:
    :return:
    """
    import re
    listOfTokens = re.split(r'\W*', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2]

上面3行代码包含了很多内容，它的输入参数是字符串，返回字符串列表。

因为要使用正则表达式切分文本，首先引用了re模块。
调用re.split方法切分文本，正则表达式‘\W*’代表以单词和数字外的任意长度字符作为分隔符，前面的r表示原生字符，用于声明\不作为转义字符而是与W一起作为正则表达式。

listOfTokens是被切分后的词列表，但其中包含了空字符串，解析文本中的url产生的无含义的短字符串等。代码最后一行负责生成并返回一个列表，列表的内容为listOfTokens中长度大于2的单词，并统一转为小写。
二、测试算法——过滤垃圾邮件
现在有50封电子邮件，垃圾邮件(spam)和正常邮件(ham)各25个，随机选取10个作为测试集，剩余的40个作为训练集。测试过程可以分解为3部分：

读取文本文件，并将其转换为词向量列表
构建训练集和测试集，利用训练集构建分类器
对测试集样本分类，输出分类错误的文本和错误率

def spamTest():
"""
Function：   贝叶斯垃圾邮件分类器
Args：       无
Returns：    float(errorCount)/len(testSet)：错误率
            vocabList：词汇表
            fullText：文档中全部单词
"""
#初始化数据列表
docList = []
classList = []
fullText = []
#导入文本文件
for i in range(1, 26):
    #切分文本
    wordList = textParse(open('email/spam/%d.txt' % i,encoding='ISO-8859-1').read())
    docList.append(wordList) #切分后的文本以原始列表形式加入文档列表
    fullText.extend(wordList)#切分后的文本直接合并到词汇列表
    classList.append(1) #标签列表更新
    wordList = textParse(open('email/ham/%d.txt' % i,encoding='ISO-8859-1').read())#切分文本
#切分后的文本以原始列表形式加入文档列表
    docList.append(wordList) #切分后的文本直接合并到词汇列表
    fullText.extend(wordList) #标签列表更新
    classList.append(0) #创建一个包含所有文档中出现的不重复词的列表
vocabList = createVocabList(docList) #初始化训练集和测试集列表
trainingSet = list(range(50));
testSet = [] #随机构建测试集，随机选取十个样本作为测试样本，并从训练样本中剔除
for i in range(10):
    randIndex = int(random.uniform(0, len(trainingSet)))#随机得到Index
    testSet.append(trainingSet[randIndex]) #将该样本加入测试集中
    del(trainingSet[randIndex]) #同时将该样本从训练集中剔除
#  #初始化训练集数据列表和标签列表
trainMat = []
trainClasses = []
for docIndex in trainingSet:#遍历训练集
    trainMat.append(setOfWords2Vec(vocabList, docList[docIndex]))#词表转换到向量，并加入到训练数据列表中
    trainClasses.append(classList[docIndex]) #相应的标签也加入训练标签列表中
p0V, p1V, pSpam = trainNB0(np.array(trainMat), np.array(trainClasses))#朴素贝叶斯分类器训练函数
errorCount = 0 #初始化错误计数
for docIndex in testSet:#遍历测试集进行测试
    wordVector = setOfWords2Vec(vocabList, docList[docIndex]) #词表转换到向量
    if classifyNB(np.array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:#判断分类结果与原标签是否一致
        errorCount += 1  #如果不一致则错误计数加1
        print("classification error",docList[docIndex])#并且输出出错的文档
print('the erroe rate is: ',float(errorCount)/len(testSet)) #打印输出信息
return vocabList, fullText#返回词汇表和全部单词列表

在这里插入图片描述就是说在文件里可能存在不是以utf-8格式保存,修改方式：

或者用以下代码找到有问题的文件，进行修改：

for i in range(1, 26):
    try:
        wordList = open('./email/ham/%d.txt' % i,encoding='UTF-8').read()
    except UnicodeDecodeError:
        print(i)
    finally:
        print(wordList)

在这里插入图片描述想似错误修改可参考：
https://blog.csdn.net/lvsehaiyang1993/article/details/80909984
https://zhidao.baidu.com/question/1304187124475688859.html
https://www.cnblogs.com/apple2016/p/6514917.html
四、中文邮件怎么处理
上面处理的邮件是全英文的，可是中文邮件怎么办呢？如果按上述方法按标点符号来切分的话，中文的一整句话会被当做一个词条，这显然不行，好在Python有个强大的中文处理模块 jieba（结巴），它不仅能对中文文本切词，如果碰到英文单词，也会以英文的默认形式切分。

文件解析函数，可处理中文和英文
def textParse1(bigString):
    import re
    import jieba
    listOfTokens = jieba.lcut(bigString)  # 使用jieba切分文本
    newList = [re.sub(r'\W*', '', s) for s in listOfTokens]  # 去掉标点符号
    return [tok.lower() for tok in newList if len(tok) > 0]  # 删除长度为0的空值

关于使用朴素贝叶斯分类器从个人广告中获取区域倾向的实例可参考：https://blog.csdn.net/crozonkdd/article/details/80785718