通俗易懂的机器学习——筛选垃圾邮件（贝叶斯分类）

本文链接：https://blog.csdn.net/DuLNode/article/details/120177882

筛选垃圾邮件（贝叶斯分类）

背景及应用

利用贝叶斯公式，我们可以在计算机上验证和测试垃圾邮件的字词统计，计算词语出现的概率。并根据的得出来的概率来判断一封新的邮件是垃圾邮件还是正常邮件。

贝叶斯公式

公式：
$\frac {P (B|A) \cdot P (A)}{P (B)}$
解释：在B情况出现的条件下A情况出现的概率等于在A情况出现的情况下B情况出现的概率乘以A情况出现的概率，除以B情况出现的概率。
应用：看起来公式很麻烦，但在本问题的解决中并不会直接试用，后续代码中的使用是P(A|B)=(A和B同时满足的个数)/B的个数。这样使用是同等效力的，但是会更加简单。

数据集

首先我们需要使用60000条Email的数据集，数据集连接如下：
60000条Email数据集
数据集的结构如下：
数据集结构
full文件夹下的index:包含每一条Email的标签和对应的文件位置。
data文件夹下的每个文件夹都有若干条Email数据。

idex中数据组织形式：
index数据组织形式
Email:
Email

引入依赖包

import re
import jieba
import numpy as np

re：正则表达式用于过滤非中文字符
jieba:用于做中文分词
numpy:用于划分训练集和测试集

数据预处理

在进行统计之前我们要先进行数据预处理以保证统计能够顺利进行。

全局变量

EmailsIndex = "./full/index"

过滤所有非中文词语

def filterEmail(email):
    email = re.sub("[a-zA-Z0-9_]+|\W","",email)
    return email

使用正则表达式过滤掉非中文词
原理：按照字母，数字和下划线划分email中的文本

读取邮件

def readEmail(filename):
    with  open(filename, "r",encoding='GB2312', errors='ignore') as fp:
        content=fp.read()
        content = filterEmail(content) #把一封邮件提取所有中文内容
        words = list(jieba.cut(content))
    return words

读取一封Email：首先按照传进来的filename(文件路径)打开txt文件，然后调用filterEmail函数过滤掉非中文字符，过滤后content中不含非中文字符。最后使用jieba.cut对中文进行分词，分词后words中为中文词组。
注：数据集的编码方式要按照GB2312打开

加载邮件

def loadAllEmails(IndexFile):
    Emails=[]
    with  open(IndexFile, "r") as fp:
        lines=fp.readlines()
        for line in lines:
            spam,filename = line.split()
            Emails.append((spam,readEmail(filename)))
    return Emails

传进来的IndexFile参数是index的路径
打开index文件后读取每一行数据，每一行数据包含邮件类别和邮件路径，使用split将他们分离开，调用readEmail获得邮件路径对应的邮件的中文分词。

获取概率表

划分训练集和测试集

def train_test_split(Emails,testSize):
    arrayEmails = np.array(Emails)
    test_size = int(len(Emails)*testSize)
    shuffle_indexes = np.random.permutation(len(arrayEmails))
    test_indexs,train_indexs= np.split(shuffle_indexes,[test_size])
    return arrayEmails[train_indexs],arrayEmails[test_indexs]

计算分词的概率表

def calWordsFreqTable(Emails):
    wordsSet = []
    spamWords = dict()
    hamWords = dict()
    spamlength = 0
    hamlength = 0
    for elem in Emails:
        if elem[0] == "spam":
            spamlength += 1
            tempset = set()
            for word in elem[1]:
                if word not in tempset:
                    tempset.add(word)
                    if word in wordsSet:
                        spamWords[word] += 1
                    else:
                        wordsSet.append(word)
                        spamWords[word] = 1
                        hamWords[word] = 0
        elif elem[0] == "ham":
            hamlength += 1
            tempset = set()
            for word in elem[1]:
                if word not in tempset:
                    tempset.add(word)
                    if word in wordsSet:
                        hamWords[word] += 1
                    else:
                        wordsSet.append(word)
                        hamWords[word] = 1
                        spamWords[word] = 0

    wordsProbs = []  # 格式（word, spamprob, hamprob）

    for word in wordsSet:
        wordsProbs.append((word, spamWords[word] * 1.0 / (spamWords[word] + hamWords[word]), hamWords[word] * 1.0 / (spamWords[word] + hamWords[word])))

    return wordsProbs

变量含义：
wordsSet：词典，在遍历Email时逐渐将词典中没有的词添加到词典
spamWords：分词与其在垃圾邮件中出现的次数组成的字典
hamWords：分词与其在正常邮件中出现的次数组成的字典
spamlength：总垃圾邮件数
hamlength：总正常邮件数
tempset：用来判断在一封Email内某词是否已经出现过，如果已经出现，则不予统计。

保存概率表

def saveWordsTable(wordsProbs, filepath):
    import os
    with open(os.path.join(filepath, "probs.txt"), "w", encoding="utf-8") as f:
        for prob in wordsProbs:
            f.write(prob[0])
            f.write("\t")
            f.write(str(prob[1]))
            f.write("\t")
            f.write(str(prob[2]))
            f.write('\n')

检查邮件是否为垃圾文件

获取分词对应的概率字典和字典

def getProbDict_WordsSet(filepath):
    ProbsDict = dict()
    wordsSet = list()
    import os
    with open(os.path.join(filepath, "probs.txt"), "r", encoding="utf-8") as f:
        lines = f.readlines()
        for line in lines:
            word, spamprob, hamprob = line.split()
            wordsSet.append(word)
            ProbsDict[word] = (float(spamprob), float(hamprob))

    return ProbsDict, wordsSet

初步检查一封邮件

def checkOneEmail(ProbsDict, wordsSet, email):
    spamprob = 1
    hamprob = 1
    for word in email:
        if word in wordsSet:
            hamprob *= max((1 - ProbsDict[word][0]), 0.01)
            spamprob *= max((1 - ProbsDict[word][1]), 0.01)
        if hamprob == 0 or spamprob == 0:
            break
    if spamprob > hamprob:
        return spamprob, 1
    elif spamprob < hamprob:
        return hamprob, -1
    else:
        return 0, 0

hamprob *= max((1 - ProbsDict[word][0]), 0.01)中1 - ProbsDict[word][0]的含义是该词不是来着垃圾邮件的概率，使用max是为了防止获得的概率过低，同时用1来表示垃圾邮件，-1表示正常邮件。当垃圾邮件和正常邮件的概率相等时用0表示未能判断出是否为垃圾邮件。

检查多封邮件

def checkEmails(ProbsDict, wordsSet, emails):
    rates = []
    predict_flags = []
    for email in emails:
        rate, ans = checkOneEmail(ProbsDict, wordsSet, email)
        rates.append(rate)
        predict_flags.append(ans)
    return rates, predict_flags  # 返回准确率和每一封邮件的预测值

主程序

if __name__ == "__main__":
    opt = int(input("1.通过训练集获取概率并在测试集测试 2.单个邮件预测"))
    if opt == 1:
        Emails = loadAllEmails(EmailsIndex)
        trainEmails, testEmails = train_test_split(Emails, 0.2)
        wordsProbs = calWordsFreqTable(trainEmails)
        saveWordsTable(wordsProbs, filepath='')
        ProbsDict, wordsSet = getProbDict_WordsSet(filepath='')
        rate, ans = checkEmails(ProbsDict, wordsSet, testEmails[:,1])
        t = 0
        for i in range(len(ans)):
            if ans[i] == 1 and testEmails[i][0] == 'spam':
                t += 1
            elif ans[i] == -1 and testEmails[i][0] == 'ham':
                t += 1
        print("精度:", t / len(ans))
    else:
        opt = int(input("1.手动输入 2.打开"))
        if opt == 1:
            text = input("输入邮件:")
            Email = filterEmail(text)
            ProbsDict, wordsSet = getProbDict_WordsSet(filepath='')
            rate, ans = checkOneEmail(ProbsDict, wordsSet, Email)
            if ans == 1:
                print("垃圾邮件")
            elif ans == -1:
                print("有效邮件")
            else:
                print("未知邮件")
        else:
            path = input("请输入邮件的路径:")  # 路径用”/“而不是”\’“
            text = ""
            with open(path, encoding="utf-8") as f:
                for line in f:
                    text += line
            Email = filterEmail(text)
            ProbsDict, wordsSet = getProbDict_WordsSet(filepath='')
            rate, ans = checkOneEmail(ProbsDict, wordsSet, Email)
            if ans == 1:
                print("垃圾邮件")
            elif ans == -1:
                print("有效邮件")
            else:
                print("未知邮件")