使用朴素贝叶斯算法过滤(中英文)垃圾邮件

使用朴素贝叶斯解决一些现实生活的问题时,需要先从文本内容得到字符串列表,然后生成词向量。下面这个例子中,我们将了解朴素贝叶斯的一个最著名的应用:电子邮件垃圾过滤。
首先看一下使用朴素贝叶斯对电子邮件进行分类的步骤:

在这里插入图片描述下边斜体部分是没处理前和处理后的文本内容。
Hello,
Since you are an owner of at least one Google Groups group that uses the customized welcome message, pages or files, we are writing to inform you that we will no longer be supporting these features starting February 2011. We made this decision so that we can focus on improving the core functionalities of Google Groups – mailing lists and forum discussions. Instead of these features, we encourage you to use products that are designed specifically for file storage and page creation, such as Google Docs and Google Sites.
For example, you can easily create your pages on Google Sites and share the site (http://www.google.com/support/sites/bin/answer.py?hl=en&answer=174623) with the members of your group. You can also store your files on the site by attaching files to pages (http://www.google.com/support/sites/bin/answer.py?hl=en&answer=90563) on the site. If you�re just looking for a place to upload your files so that your group members can download them, we suggest you try Google Docs. You can upload files (http://docs.google.com/support/bin/answer.py?hl=en&answer=50092) and share access with either a group (http://docs.google.com/support/bin/answer.py?hl=en&answer=66343) or an individual (http://docs.google.com/support/bin/answer.py?hl=en&answer=86152), assigning either edit or download only access to the files.
you have received this mandatory email service announcement to update you about important changes to Google Groups.
[‘Hello’, ‘Since’, ‘you’, ‘are’, ‘an’, ‘owner’, ‘of’, ‘at’, ‘least’, ‘one’, ‘Google’, ‘Groups’, ‘group’, ‘that’, ‘uses’, ‘the’, ‘customized’, ‘welcome’, ‘message’, ‘pages’, ‘or’, ‘files’, ‘we’, ‘are’, ‘writing’, ‘to’, ‘inform’, ‘you’, ‘that’, ‘we’, ‘will’, ‘no’, ‘longer’, ‘be’, ‘supporting’, ‘these’, ‘features’, ‘starting’, ‘February’, ‘2011’, ‘We’, ‘made’, ‘this’, ‘decision’, ‘so’, ‘that’, ‘we’, ‘can’, ‘focus’, ‘on’, ‘improving’, ‘the’, ‘core’, ‘functionalities’, ‘of’, ‘Google’, ‘Groups’, ‘mailing’, ‘lists’, ‘and’, ‘forum’, ‘discussions’, ‘Instead’, ‘of’, ‘these’, ‘features’, ‘we’, ‘encourage’, ‘you’, ‘to’, ‘use’, ‘products’, ‘that’, ‘are’, ‘designed’, ‘specifically’, ‘for’, ‘file’, ‘storage’, ‘and’, ‘page’, ‘creation’, ‘such’, ‘as’, ‘Google’, ‘Docs’, ‘and’, ‘Google’, ‘Sites’, ‘For’, ‘example’, ‘you’, ‘can’, ‘easily’, ‘create’, ‘your’, ‘pages’, ‘on’, ‘Google’, ‘Sites’, ‘and’, ‘share’, ‘the’, ‘site’, ‘http’, ‘www’, ‘google’, ‘com’, ‘support’, ‘sites’, ‘bin’, ‘answer’, ‘py’, ‘hl’, ‘en’, ‘answer’, ‘174623’, ‘with’, ‘the’, ‘members’, ‘of’, ‘your’, ‘group’, ‘You’, ‘can’, ‘also’, ‘store’, ‘your’, ‘files’, ‘on’, ‘the’, ‘site’, ‘by’, ‘attaching’, ‘files’, ‘to’, ‘pages’, ‘http’, ‘www’, ‘google’, ‘com’, ‘support’, ‘sites’, ‘bin’, ‘answer’, ‘py’, ‘hl’, ‘en’, ‘answer’, ‘90563’, ‘on’, ‘the’, ‘site’, ‘If’, ‘you’, ‘re’, ‘just’, ‘looking’, ‘for’, ‘a’, ‘place’, ‘to’, ‘upload’, ‘your’, ‘files’, ‘so’, ‘that’, ‘your’, ‘group’, ‘members’, ‘can’, ‘download’, ‘them’, ‘we’, ‘suggest’, ‘you’, ‘try’, ‘Google’, ‘Docs’, ‘You’, ‘can’, ‘upload’, ‘files’, ‘http’, ‘docs’, ‘google’, ‘com’, ‘support’, ‘bin’, ‘answer’, ‘py’, ‘hl’, ‘en’, ‘answer’, ‘50092’, ‘and’, ‘share’, ‘access’, ‘with’, ‘either’, ‘a’, ‘group’, ‘http’, ‘docs’, ‘google’, ‘com’, ‘support’, ‘bin’, ‘answer’, ‘py’, ‘hl’, ‘en’, ‘answer’, ‘66343’, ‘or’, ‘an’, ‘individual’, ‘http’, ‘docs’, ‘google’, ‘com’, ‘support’, ‘bin’, ‘answer’, ‘py’, ‘hl’, ‘en’, ‘answer’, ‘86152’, ‘assigning’, ‘either’, ‘edit’, ‘or’, ‘download’, ‘only’, ‘access’, ‘to’, ‘the’, ‘files’, ‘you’, ‘have’, ‘received’, ‘this’, ‘mandatory’, ‘email’, ‘service’, ‘announcement’, ‘to’, ‘update’, ‘you’, ‘about’, ‘important’, ‘changes’, ‘to’, ‘Google’, ‘Groups’, ‘’]

def textParse(bigString):
    """
    文件解析为向量
    :param bigString:
    :return:
    """
    import re
    listOfTokens = re.split(r'\W*', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2]

上面3行代码包含了很多内容,它的输入参数是字符串,返回字符串列表。

  • 因为要使用正则表达式切分文本,首先引用了re模块。

  • 调用re.split方法切分文本,正则表达式‘\W*’代表以单词和数字外的任意长度字符作为分隔符,前面的r表示原生字符,用于声明\不作为转义字符而是与W一起作为正则表达式。

  • listOfTokens是被切分后的词列表, 但其中包含了空字符串,解析文本中的url产生的无含义的短字符串等。代码最后一行负责生成并返回一个列表, 列表的内容为listOfTokens中长度大于2的单词,并统一转为小写。
    二、测试算法——过滤垃圾邮件
    现在有50封电子邮件,垃圾邮件(spam)和正常邮件(ham)各25个,随机选取10个作为测试集,剩余的40个作为训练集。测试过程可以分解为3部分:

    读取文本文件,并将其转换为词向量列表
    构建训练集和测试集,利用训练集构建分类器
    对测试集样本分类,输出分类错误的文本和错误率

    def spamTest():
    """
    Function:   贝叶斯垃圾邮件分类器
    Args:       无
    Returns:    float(errorCount)/len(testSet):错误率
                vocabList:词汇表
                fullText:文档中全部单词
    """
    #初始化数据列表
    docList = []
    classList = []
    fullText = []
    #导入文本文件
    for i in range(1, 26):
        #切分文本
        wordList = textParse(open('email/spam/%d.txt' % i,encoding='ISO-8859-1').read())
        docList.append(wordList) #切分后的文本以原始列表形式加入文档列表
        fullText.extend(wordList)#切分后的文本直接合并到词汇列表
        classList.append(1) #标签列表更新
        wordList = textParse(open('email/ham/%d.txt' % i,encoding='ISO-8859-1').read())#切分文本
    #切分后的文本以原始列表形式加入文档列表
        docList.append(wordList) #切分后的文本直接合并到词汇列表
        fullText.extend(wordList) #标签列表更新
        classList.append(0) #创建一个包含所有文档中出现的不重复词的列表
    vocabList = createVocabList(docList) #初始化训练集和测试集列表
    trainingSet = list(range(50));
    testSet = [] #随机构建测试集,随机选取十个样本作为测试样本,并从训练样本中剔除
    for i in range(10):
        randIndex = int(random.uniform(0, len(trainingSet)))#随机得到Index
        testSet.append(trainingSet[randIndex]) #将该样本加入测试集中
        del(trainingSet[randIndex]) #同时将该样本从训练集中剔除
    #  #初始化训练集数据列表和标签列表
    trainMat = []
    trainClasses = []
    for docIndex in trainingSet:#遍历训练集
        trainMat.append(setOfWords2Vec(vocabList, docList[docIndex]))#词表转换到向量,并加入到训练数据列表中
        trainClasses.append(classList[docIndex]) #相应的标签也加入训练标签列表中
    p0V, p1V, pSpam = trainNB0(np.array(trainMat), np.array(trainClasses))#朴素贝叶斯分类器训练函数
    errorCount = 0 #初始化错误计数
    for docIndex in testSet:#遍历测试集进行测试
        wordVector = setOfWords2Vec(vocabList, docList[docIndex]) #词表转换到向量
        if classifyNB(np.array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:#判断分类结果与原标签是否一致
            errorCount += 1  #如果不一致则错误计数加1
            print("classification error",docList[docIndex])#并且输出出错的文档
    print('the erroe rate is: ',float(errorCount)/len(testSet)) #打印输出信息
    return vocabList, fullText#返回词汇表和全部单词列表
    

在这里插入图片描述就是说在文件里可能存在不是以utf-8格式保存,修改方式:
在这里插入图片描述
或者用以下代码找到有问题的文件,进行修改:

for i in range(1, 26):
    try:
        wordList = open('./email/ham/%d.txt' % i,encoding='UTF-8').read()
    except UnicodeDecodeError:
        print(i)
    finally:
        print(wordList)

在这里插入图片描述在这里插入图片描述在这里插入图片描述在这里插入图片描述想似错误修改可参考:
https://blog.csdn.net/lvsehaiyang1993/article/details/80909984
https://zhidao.baidu.com/question/1304187124475688859.html
https://www.cnblogs.com/apple2016/p/6514917.html
四、 中文邮件怎么处理
上面处理的邮件是全英文的,可是中文邮件怎么办呢?如果按上述方法按标点符号来切分的话,中文的一整句话会被当做一个词条,这显然不行,好在Python有个强大的中文处理模块 jieba(结巴),它不仅能对中文文本切词,如果碰到英文单词,也会以英文的默认形式切分。

文件解析函数,可处理中文和英文
def textParse1(bigString):
    import re
    import jieba
    listOfTokens = jieba.lcut(bigString)  # 使用jieba切分文本
    newList = [re.sub(r'\W*', '', s) for s in listOfTokens]  # 去掉标点符号
    return [tok.lower() for tok in newList if len(tok) > 0]  # 删除长度为0的空值

关于使用朴素贝叶斯分类器从个人广告中获取区域倾向的实例可参考:https://blog.csdn.net/crozonkdd/article/details/80785718

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值