机器学习算法三——基于概率论的分类方法：朴素贝叶斯（2）（示例：使用朴素贝叶斯过滤垃圾邮件）

最新推荐文章于 2024-07-25 15:16:40 发布

Silvia+

最新推荐文章于 2024-07-25 15:16:40 发布

阅读量652

点赞数 1

分类专栏：机器学习实战文章标签：机器学习 python

本文链接：https://blog.csdn.net/wsf09/article/details/87924889

版权

机器学习实战专栏收录该内容

11 篇文章 0 订阅

订阅专栏

示例：使用朴素贝叶斯过滤垃圾邮件

首先，将文本解析成词条；然后，和前面的分类代码集成为一个函数，该函数在测试分类器的同时会给出错误率。

一、准备数据：切分文本
下面介绍如何从文本文档中构建自己的词列表。
1、对于一个文本字符串，可以使用python的string.split()方法将其切分：

def test():
    mySent = 'This book is the best book on Python or M.L. I have ever laid eyes upon.'
    print(mySent.split())

def main():
    test()

>>>['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M.L.', 'I', 'have', 'ever', 'laid', 'eyes', 'upon.']

可见，切分的同时，标点符号也被当成了词的一部分。

2、可以使用正则表示式来切分句子，其中分隔符是除单词、数字外的任意字符串。

mySent = 'This book is the best book on Python or M.L. I have ever laid eyes upon.'
regEx = re.compile('\\W+') #匹配非英文字母和数字
listOfTokens = regEx.split(mySent)
print(listOfTokens)

运行结果：

>>>['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M', 'L', 'I', 'have', 'ever', 'laid', 'eyes', 'upon', '']

现在得到了一系列词组成的词表，但是还需去掉空字符串。

3、可以计算每个字符串的长度，只返回长度大于0的字符串。

print([tok for tok in listOfTokens if len(tok) > 0 ])
>>>['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M', 'L', 'I', 'have', 'ever', 'laid', 'eyes', 'upon']

这时，我们发现句子中的第一个单词是大写的。如果目的是句子查找，那么这个特点会很有用。
但这里的文本只看成是词袋，所以希望所有词的形式是统一的，不论它们出现在句子中间、结尾还是开头。

4、python中有些内嵌的方法，可将字符串全部转换成小写（.lower()）或者大写（.upper()）。

print([tok.lower() for tok in listOfTokens if len(tok) > 0 ])
>>>['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'm', 'l', 'i', 'have', 'ever', 'laid', 'eyes', 'upon']

5、现在来看数据集中一封完整的电子邮件的实际处理结果。
email文件夹中，包含两个子文件夹spam和ham。

emailText = open('email/ham/6.txt').read()
regEx = re.compile('\\W+')
listOfTokens=regEx.split(emailText)
print(listOfTokens)
>>>['Hello', 'Since', 'you', 'are', 'an', 'owner', 'of', 'at', 'least', 'one', 'Google', 'Groups', 'group', 'that', 'uses', 'the', 'customized', 'welcome', 'message', 'pages', 'or', 'files', 'we', 'are', 'writing', 'to', 'inform', 'you', 'that', 'we', 'will', 'no', 'longer', 'be', 'supporting', 'these', 'features', 'starting', 'February', '2011', 'We', 'made', 'this', 'decision', 'so', 'that', 'we', 'can', 'focus', 'on', 'improving', 'the', 'core', 'functionalities', 'of', 'Google', 'Groups', 'mailing', 'lists', 'and', 'forum', 'discussions', 'Instead', 'of', 'these', 'features', 'we', 'encourage', 'you', 'to', 'use', 'products', 'that', 'are', 'designed', 'specifically', 'for', 'file', 'storage', 'and', 'page', 'creation', 'such', 'as', 'Google', 'Docs', 'and', 'Google', 'Sites', 'For', 'example', 'you', 'can', 'easily', 'create', 'your', 'pages', 'on', 'Google', 'Sites', 'and', 'share', 'the', 'site', 'http', 'www', 'google', 'com', 'support', 'sites', 'bin', 'answer', 'py', 'hl', 'en', 'answer', '174623', 'with', 'the', 'members', 'of', 'your', 'group', 'You', 'can', 'also', 'store', 'your', 'files', 'on', 'the', 'site', 'by', 'attaching', 'files', 'to', 'pages', 'http', 'www', 'google', 'com', 'support', 'sites', 'bin', 'answer', 'py', 'hl', 'en', 'answer', '90563', 'on', 'the', 'site', 'If', 'you抮e', 'just', 'looking', 'for', 'a', 'place', 'to', 'upload', 'your', 'files', 'so', 'that', 'your', 'group', 'members', 'can', 'download', 'them', 'we', 'suggest', 'you', 'try', 'Google', 'Docs', 'You', 'can', 'upload', 'files', 'http', 'docs', 'google', 'com', 'support', 'bin', 'answer', 'py', 'hl', 'en', 'answer', '50092', 'and', 'share', 'access', 'with', 'either', 'a', 'group', 'http', 'docs', 'google', 'com', 'support', 'bin', 'answer', 'py', 'hl', 'en', 'answer', '66343', 'or', 'an', 'individual', 'http', 'docs', 'google', 'com', 'support', 'bin', 'answer', 'py', 'hl', 'en', 'answer', '86152', 'assigning', 'either', 'edit', 'or', 'download', 'only', 'access', 'to', 'the', 'files', 'you', 'have', 'received', 'this', 'mandatory', 'email', 'service', 'announcement', 'to', 'update', 'you', 'about', 'important', 'changes', 'to', 'Google', 'Groups', '']

注意：由于是URL:answer.py?hl=en&answer=174623的一部分，因而会出现en和py这样的单词。对URL进行切分时，会得到很多词，因此在实现时会过滤掉长度小于3的字符串。
本例使用一个通用的文本解析规则来实现这一点。
在实际的解析程序中，要用更高级的过滤器来对诸如HTML和URI的对象进行处理。
目前，一个URI最终会解析成词汇表中的单词（e.g. http.www.google.com会被解析为四个单词）。

(1)re.compile()
可以把正则表达式编译成一个正则表达式对象。
如果一个匹配规则要多次使用，可以先将其编译，以后就不用每次去重复写匹配规则。
常用正则表达式符号和语法：
‘.’ 匹配所有字符串，除\n以外
‘-’ 表示范围[0-9]
‘’ 匹配前面的子表达式零次或多次。要匹配 * 字符，请使用 *。
‘+’ 匹配前面的子表达式一次或多次。要匹配 + 字符，请使用 +
‘^’ 匹配字符串开头
‘$’ 匹配字符串结尾 re
‘’ 转义字符，使后一个字符改变原来的意思，如果字符串中有字符需要匹配，可以*或者字符集[] re.findall(r’3*’,'3ds’)结[‘3*’]
‘’ 匹配前面的字符0次或多次 re.findall("ab",“cabc3abcbbac”)结果：[‘ab’, ‘ab’, ‘a’]
‘?’ 匹配前一个字符串0次或1次 re.findall(‘ab?’,‘abcabcabcadf’)结果[‘ab’, ‘ab’, ‘ab’, ‘a’]
‘{m}’ 匹配前一个字符m次 re.findall(‘cb{1}’,‘bchbchcbfbcbb’)结果[‘cb’, ‘cb’]
‘{n,m}’ 匹配前一个字符n到m次 re.findall(‘cb{2,3}’,‘bchbchcbfbcbb’)结果[‘cbb’]
‘\d’ 匹配数字，等于[0-9] re.findall(’\d’,‘电话:10086’)结果[‘1’, ‘0’, ‘0’, ‘8’, ‘6’]
‘\D’ 匹配非数字，等于[^0-9] re.findall(’\D’,‘电话:10086’)结果[‘电’, ‘话’, ‘:’]
‘\w’ 匹配字母和数字，等于[A-Za-z0-9] re.findall(’\w’,‘alex123,./;;;’)结果[‘a’, ‘l’, ‘e’, ‘x’, ‘1’, ‘2’, ‘3’]
‘\W’ 匹配非英文字母和数字,等于[^A-Za-z0-9] re.findall(’\W’,‘alex123,./;;;’)结果[’,’, ‘.’, ‘/’, ‘;’, ‘;’, ‘;’]
‘\s’ 匹配空白字符 re.findall(’\s’,‘3ds \t\n’)结果[’ ‘, ‘\t’, ‘\n’]
‘\S’ 匹配非空白字符 re.findall(’\s’,'3ds \t\n’)结果[‘3’, ‘*’, ‘d’, ‘s’]
‘\A’ 匹配字符串开头
‘\Z’ 匹配字符串结尾
‘\b’ 匹配单词的词首和词尾，单词被定义为一个字母数字序列，因此词尾是用空白符或非字母数字符来表示的
‘\B’ 与\b相反，只在当前位置不在单词边界时匹配
更多内容可参考：
https://www.cnblogs.com/wenwei-blog/p/7216102.html
https://blog.csdn.net/wl_ss/article/details/78241782

(2)书上代码纠错：

regEx = re.compile('\\W*')

\w表示数字字母下划线
\W表示非数字非字母非下划线，即对\w取反
*表示匹配0次、1次或者多次
“+”规定其前导字符必须在目标对象中连续出现一次或多次

运行后报错：

Warning (from warnings module):
  File "E:\Master\test\action\bayes\bayes.py", line 87
    listOfTokens = regEx.split(mySent)
FutureWarning: split() requires a non-empty pattern match.
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M', 'L', 'I', 'have', 'ever', 'laid', 'eyes', 'upon', '']

原因：用\W匹配文本时表示非数字字母下划线任意次，也就是说0次也符合匹配要求，0次就是nothing
**解决办法：**把改成+即可。

更多纠正参考：http://www.cnblogs.com/zhhy236400/p/9860894.html

下面构建一个极其简单的函数，可根据情况自行修改。

二、测试算法：使用朴素贝叶斯进行交叉验证
下面将文本解析器集成到一个完整分类器中。代码如下：

#文件解析及完整的垃圾邮件测试函数
def textParse(bigString):#此函数作用：接收一个大字符串并将其解析为字符串列表
    import re
    listOfTokens = re.split(r'\W*',bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2]#去掉少于两个字符的字符串，并将所有字符串转换为小写

def spamTest():#此函数作用：对贝叶斯垃圾邮件分类器进行自动化处理
    docList=[]; classList = []; fullText = []
    for i in range(1,26): #导入并解析文本文件 导入文件夹spam和ham下的文件文本，并将其解析为词列表
        wordList = textParse(open('email/spam/%d.txt' % i).read())
        docList.append(wordList)#append表示追加
        fullText.extend(wordList)#extend表示扩展
        classList.append(1)#1代表垃圾邮件
        
        wordList = textParse(open('email/ham/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
        
    vocabList = createVocabList(docList)#创建词汇表
    
    errorCount = 0
    for j in range(10):#将整个交叉验证过程重复10次，求平均错误率
        trainingSet = list(range(50))#训练集：整数列表，值从0-49
        testSet = []#测试集
        for i in range(10):#抽取测试集
            randIndex = int(random.uniform(0, len(trainingSet)))#随机选取10个文件做测试集
            testSet.append(trainingSet[randIndex])
            del(trainingSet[randIndex]) #将测试集的整数列表从训练集中删去
            
        trainMat = []; trainClasses = []
        for docIndex in trainingSet:
            trainMat.append(setOfWords2Vec(vocabList, docList[docIndex]))#构建词向量，训练矩阵
            trainClasses.append(classList[docIndex]) #训练集标签
        p0V, p1V, pSpam = trainNB0(array(trainMat), array(trainClasses))#计算分类所需概率

        

        for docIndex in testSet: #对测试集分类
            wordVector = setOfWords2Vec(vocabList, docList[docIndex])
            if classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:
                errorCount += 1
                #print(wordVector)
    errorCount = float(errorCount)/10.0
    print('the error rate is: ', float(errorCount)/len(testSet))

这种随机选择数据的一部分作为训练集，而剩余部分作为测试集的过程称为留存交叉验证。

以上代码将交叉验证的过程重复了10次，以获得平均错误率，结果约为7%。

>>>the error rate is:  0.06999999999999999.

中间遇到的几个问题：
1、执行后报错：UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0xae in position 199: illegal multibyte sequence.
在第23个.txt中有个问号，会变成乱码，删掉就可以了。

2、报错：TypeError: ‘range’ object doesn’t support item deletion.
将

trainingSet = range(50)

改为

trainingSet = list(range(50))

这里一直出现的错误是将垃圾邮件误判为正常邮件。
相比之下，要比将正常邮件误判为垃圾邮件好些。为避免错误，有多种方式可以用来修正分类器。将在后面章节中进行讨论。

Silvia+

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
机器学习算法三——基于概率论的分类方法：朴素贝叶斯（2）（示例：使用朴素贝叶斯过滤垃圾邮件）

示例：使用朴素贝叶斯过滤垃圾邮件首先，将文本解析成词条；然后，和前面的分类代码集成为一个函数，该函数在测试分类器的同时会给出错误率。一、准备数据：切分文本下面介绍如何从文本文档中构建自己的词列表。1、对于一个文本字符串，可以使用python的string.split()方法将其切分：def test(): mySent = 'This book is the best book ...
复制链接

扫一扫

专栏目录