《机器学习实战》第四章4.6-4.7 示例1：垃圾邮件过滤示例2：从个人广告中获取区域倾向_使用朴素贝叶斯分类器从个人广告中获取区域倾向rss-CSDN博客

本文链接：https://blog.csdn.net/csdn_lzw/article/details/58231757

机器学习实战》系列博客主要是实现并理解书中的代码，相当于读书笔记了。毕竟实战不能光看书。动手就能遇到许多奇奇怪怪的问题。博文比较粗糙，需结合书本。博主边查边学，水平有限，有问题的地方评论区请多指教。书中的代码和数据，网上有很多请自行下载。

4.6 垃圾邮件过滤

4.6.1 准备数据：切分文本

对于文本字符串，可以用string.split 切分

>>> mySent = 'This book is the best book on python or M.L. I have ever laid eyes upon'
>>> mySent.split()
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'M.L.', 'I', 'have', 'ever', 'laid', 'eyes', 'upon']
>>>

标点符号也被当成词的一部分，可以使用正则表示式来切分，其中分隔符是除单词，数字外的任意字符串。

>>> import re 
>>> regEX = re.compile('\\W*')
>>> listOfTokens = regEX.split(mySent)
>>> listOfTokens 
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'M', 'L', 'I', 'have', 'ever', 'laid', 'eyes', 'upon']
>>>

去空格（好像上面的已经把空格去了？？）
字符串变小写

>>> [tok for tok in listOfTokens if len(tok)>0]
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'M', 'L', 'I', 'have', 'ever', 'laid', 'eyes', 'upon']
>>> [tok.lower() for tok in listOfTokens if len(tok)>0]
['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'm', 'l', 'i', 'have', 'ever', 'laid', 'eyes', 'upon']
>>>

4.6.2 测试算法：使用朴素贝叶斯进行交叉验证

文件解析及完整的垃圾邮件测试函数

文件夹中有各有25个spam 和ham ，随机选择10个做测试集，其余是训练集。这种方法称为：留存交叉验证
随机选择会导致，输出结果有差别。可以重复试验取平均

def textParse(bigString): #输入一个大字符串并解析为字符串列表
    import re
    listOfTokens = re.split(r'\W*', bigString)
    #函数去掉少于2个字符的字符串，并全部转为小写
    return [tok.lower() for tok in listOfTokens if len(tok) > 2] 

def spamTest():
    docList=[]; classList = []; fullText =[]
    for i in range(1,26):
        wordList = textParse(open('email/spam/%d.txt' % i).read())
        docList.append(wordList)  #添加成[[][][]]形式
        fullText.extend(wordList) #添加成[]形式
        classList.append(1)       #类标签
        wordList = textParse(open('email/ham/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList) #调用函数createVocabList生成词表
    trainingSet = range(50); testSet=[]  #有50个训练样本
    for i in range(10):                  #随机选10个做测试样本      
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])  
    trainMat=[]; trainClasses = []       
    for docIndex in trainingSet:
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))#词向量
        trainClasses.append(classList[docIndex])#对应的类标签
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))#训练生成3个概率
    errorCount = 0
    for docIndex in testSet:        #验证测试集
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex]) #词向量
        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
            errorCount += 1 #分类错误加加
            print "classification error",docList[docIndex]
    print 'the error rate is: ',float(errorCount)/len(testSet)
    #return vocabList,fullText

>>> bayes.spamTest()
classification error ['yeah', 'ready', 'may', 'not', 'here', 'because', 'jar', 'jar', 'has', 'plane', 'tickets', 'germany', 'for']
the error rate is:  0.1
>>> bayes.spamTest()
the error rate is:  0.0
>>> bayes.spamTest()
classification error ['experience', 'with', 'biggerpenis', 'today', 'grow', 'inches', 'more', 'the', 'safest', 'most', 'effective', 'methods', 'of_penisen1argement', 'save', 'your', 'time', 'and', 'money', 'bettererections', 'with', 'effective', 'ma1eenhancement', 'products', 'ma1eenhancement', 'supplement', 'trusted', 'millions', 'buy', 'today']
classification error ['yeah', 'ready', 'may', 'not', 'here', 'because', 'jar', 'jar', 'has', 'plane', 'tickets', 'germany', 'for']
the error rate is:  0.2