机器学习实战python版朴素贝叶斯示例垃圾邮件分类从个人广告中获取趋于趋向

最新推荐文章于 2023-11-08 20:51:42 发布

XD_Senior

最新推荐文章于 2023-11-08 20:51:42 发布

阅读量3.1k

点赞数 1

分类专栏：机器学习文章标签：朴素贝叶斯机器学习垃圾邮件广告

本文链接：https://blog.csdn.net/XD_Senior/article/details/50165715

版权

机器学习专栏收录该内容

9 篇文章 0 订阅

订阅专栏

首先我们先来看如何使用朴树贝叶斯对电子邮件进行分类

准备数据：切分文本

对于一个文本字符串，使用python的split（）就可以切分文本。

>>> mySent = 'this book is the best book on python or M.L. I have even laid eyes upon.'
>>> mySent.split()
['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'M.L.', 'I', 'have', 'even', 'laid', 'eyes', 'upon.']
>>> import re
>>> re.split('\W*',mySent)
['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'M', 'L', 'I', 'have', 'even', 'laid', 'eyes', 'upon', '']
>>> re.split('\\W*',mySent)
['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'M', 'L', 'I', 'have', 'even', 'laid', 'eyes', 'upon', '']
>>> listOfTokens = re.split('\\W*',mySent)
>>> listOfTokens
['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'M', 'L', 'I', 'have', 'even', 'laid', 'eyes', 'upon', '']
>>> [tok.lower() for tok in listOfTokens if len(tok) > 0]
['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'm', 'l', 'i', 'have', 'even', 'laid', 'eyes', 'upon']
>>>

我们用可以函数就可以切分句子，而且还可以忽略到小于一定长度的字符串。

测试算法:使用朴树贝叶斯进行交叉验证

def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'\W*', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2] 
    
def spamTest():
    docList=[]; classList = []; fullText =[]
    for i in range(1,26):
        wordList = textParse(open('email/spam/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        wordList = textParse(open('email/ham/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    trainingSet = range(50); testSet=[]           #create test set
    for i in range(10):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])  
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
            errorCount += 1
            print "classification error",docList[docIndex]
    print 'the error rate is: ',float(errorCount)/len(testSet)

textParse(bigString)比较好理解，就是接收一个大的字符串句子，然后按单词划分。

>>> open('email/spam/%d.txt' % i).read()
'--- Codeine 15mg -- 30 for $203.70 -- VISA Only!!! --\n\n-- Codeine (Methylmorphine) is a narcotic (opioid) pain reliever\n-- We have 15mg & 30mg pills -- 30/15mg for $203.70 - 60/15mg for $385.80 - 90/15mg for $562.50 -- VISA Only!!! ---'
>>>

这样就比较好理解每一句代码的意思，比如我们想知道fullText的内容是什么，除了改变这个函数让他返回这个列表，还可以单独写一个测试函数来查看每句代码的意思。

from numpy import *
docList=[]; fullText =[]
def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'\W*', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2] 
for i in range(1,26):
        wordList = textParse(open('email/spam/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        wordList = textParse(open('email/ham/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)

>>> fullText
['codeine', '15mg', 'for', '203', 'visa', 'only', 'codeine'...]

我们可以简单查看下，我们发现这个是有重复代码的和单词表是不一样的。单词表不会出现重复的单词的。

最后我们的测试结果还是一样的。

>>> import bayes
>>> bayes.spamTest()
the error rate is: 0.0
>>>

示例：使用朴树贝叶斯分类器从个人广告中获取区域倾向

收集数据：导入Rss源

现在我们需要一个RSS阅读器，Universal Feed Parser 是Python 中最长使用的Rss程序库。我们现在需要安装一个脚本。

点击打开链接这个是我从网上找的，分享到了百度云里。下载后，输入python setup.py install点击打开链接

然后导入就可以了。

def calcMostFreq(vocabList,fullText):
    import operator
    freqDict = {}
    for token in vocabList:
        freqDict[token]=fullText.count(token)
    sortedFreq = sorted(freqDict.iteritems(), key=operator.itemgetter(1), reverse=True) 
    return sortedFreq[:50]       

def localWords(feed1,feed0):
    import feedparser
    docList=[]; classList = []; fullText =[]
    minLen = min(len(feed1['entries']),len(feed0['entries']))
    for i in range(minLen):
        wordList = textParse(feed1['entries'][i]['summary'])
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1) #NY is class 1
        wordList = textParse(feed0['entries'][i]['summary'])
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    top30Words = calcMostFreq(vocabList,fullText)   #remove top 30 words
    for pairW in top30Words:
        if pairW[0] in vocabList: vocabList.remove(pairW[0])
    trainingSet = range(2*minLen); testSet=[]           #create test set
    for i in range(20):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])  
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
            errorCount += 1
    print 'the error rate is: ',float(errorCount)/len(testSet)
    return vocabList,p0V,p1V,top30Words

代码和前面的基本都很相似。也是交叉验证。

分析数据：显示地域相关的用词

def getTopWords(ny,sf):
    import operator
    vocabList,p0V,p1V,top30Words=localWords(ny,sf)
    topNY=[]; topSF=[]
    for i in range(len(p0V)):
        if p0V[i] > -6.0 : topSF.append((vocabList[i],p0V[i]))
        if p1V[i] > -6.0 : topNY.append((vocabList[i],p1V[i]))
    sortedSF = sorted(topSF, key=lambda pair: pair[1], reverse=True)
    print "SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**"
    for item in sortedSF:
        print item[0]
    sortedNY = sorted(topNY, key=lambda pair: pair[1], reverse=True)
    print "NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**"
    for item in sortedNY:
        print item[0]

> import feedparser
>>> ny = feedparser.parse('http://newyork.craigslist.org/stp/index.rss')
>>> ny['entries']
[{'dc_source': u'http://newyork.craigslist.org/mnh/stp/5331168420.html', 'summary_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss', 'type': 'text/html', 'value': u"Looking to give free non sexual massage to inshape bi/str8 guy. I'm not a pro but told I'm pretty good. Clean discreet 100% respectful for the same. Prefer inshape guys 40 and under. I can host. Pic for trade.", 'language': None}, 'published_parsed': time.struct_time(tm_year=2015, tm_mon=11, tm_mday=25, tm_hour=1, tm_min=38, tm_sec=21, tm_wday=2, tm_yday=329, tm_isdst=0), 'updated_parsed': time.struct_time(tm_year=2015, tm_mon=11, tm_mday=25, tm_hour=1, tm_min=38, tm_sec=21, tm_wday=2, tm_yday=329, tm_isdst=0), 'links': [{'href': u'http://newyork.craigslist.org/mnh/stp/5331168420.html', 'type': 'text/html', 'rel': 'alternate'}], 'title': u'Looking to give free non sexual massage to inshape bi/str8 guy - m4m (Upper East Side)', 'rights': u'© 2015 craigslistCL', 'updated': u'2015-11-24T20:38:21-05:00', 'summary': u"Looking to give free non sexual massage to inshape bi/str8 guy. I'm not a pro but told I'm pretty good. Clean discreet 100% respectful for the same. Prefer inshape guys 40 and under. I can host. Pic for trade.", 'language': u'en-us', 'title_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss', 'type': 'text/plain', 'value': u'Looking to give free non sexual massage to inshape bi/str8 guy - m4m (Upper East Side)', 'language': None}, 'link': u'http://newyork.craigslist.org/mnh/stp/5331168420.html', 'published': u'2015-11-24T20:38:21-05:00', 'rights_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss', 'type': 'text/plain', 'value': u'© 2015 craigslistCL', 'language': None}, 'id': u'http://newyork.craigslist.org/mnh/stp/5331168420.html', 'dc_type': u'text'}, {'dc_source': u'http://newyork.craigslist.org/que/stp/5331187431.html', 'summary_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss', 'type': 'text/html', 'value': u'You are a women with submissive feeling. You would like to learn more and perhaps explore in a safe manner. \nI am an older White Dom...age 60 with years of experience. I am patient and discrete. \nI can listen to your thoughts and answer your question [...]', 'language': None}, 'published_parsed': time.struct_time(tm_year=2015, tm_mon=11, tm_mday=25, tm_hour=1, tm_min=34, tm_sec=58, tm_wday=2, tm_yday=329, tm_isdst=0), 'updated_parsed': time.struct_time(tm_year=2015, tm_mon=11, tm_mday=25, tm_hour=1, tm_min=34, tm_sec=58, tm_wday=2, tm_yday=329, tm_isdst=0), 'links': [{'href': u'http://newyork.craigslist.org/que/stp/5331187431.html', 'type': 'text/html', 'rel': 'alternate'}], 'title': u'Submissive feelings? Need to talk? - m4w (all boros and LI)', 'rights': u'© 2015 craigslistCL', 'updated': u'2015-11-24T20:34:58-05:00', 'summary': u'You are a women with submissive feeling. You would like to learn more and perhaps explore in a safe manner. \nI am an older White Dom...age 60 with years of experience. I am patient and discrete. \nI can listen to your thoughts and answer your question [...]', 'language': u'en-us', 'title_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss', 'type': 'text/plain', 'value': u'Submissive feelings? Need to talk? - m4w (all boros and LI)', 'language': None}, 'link': u'http://newyork.craigslist.org/que/stp/5331187431.html', 'published': u'2015-11-24T20:34:58-05:00', 'rights_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss', 'type': 'text/plain', 'value': u'© 2015 craigslistCL', 'language': None}, 'id': u'http://newyork.craigslist.org/que/stp/5331187431.html', 'dc_type': u'text'}, {'dc_source': u'http://newyork.craigslist.org/brk/stp/5302043769.html', 'summary_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss', 'type': 'text/html', 'value': u"Masculine older guy 61 5ft7 175 \nLooking for a genuine nice guy for \nmutual massage....No agenda here \nTactile sensual guy who primarily is \ninterested in massage...Nothing wrong with \nan erotic component if that's what either of us \nwants....a Go wi [...]", 'language': None}, 'published_parsed': time.struct_time(tm_year=2015, tm_mon=11, tm_mday=25, tm_hour=1, tm_min=32, tm_sec=10, tm_wday=2, tm_yday=329, tm_isdst=0), 'updated_parsed': time.struct_time(tm_year=2015, tm_mon=11, tm_mday=25, tm_hour=1, tm_min=32, tm_sec=10, tm_wday=2, tm_yday=329, tm_isdst=0), 'links': [{'href': u'http://newyork.craigslist.org/brk/stp/5302043769.html', 'type': 'text/html', 'rel': 'alternate'}], 'title': u'Mutual Massage - m4m (Kensington)', 'rights': u'© 2015 craigslistCL', 'updated': u'2015-11-24T20:32:10-05:00', 'summary': u"Masculine older guy 61 5ft7 175 \nLooking for a genuine nice guy for \nmutual massage....No agenda here \nTactile sensual guy who primarily is \ninterested in massage...Nothing wrong with \nan erotic component if that's what either of us \nwants....a Go wi [...]", 'language': u'en-us', 'title_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss', 'type': 'text/plain', 'value': u'Mutual Massage - m4m (Kensington)', 'language': None}, 'link': u'http://newyork.craigslist.org/brk/stp/5302043769.html', 'published': u'2015-11-24T20:32:10-05:00', 'rights_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss', 'type': 'text/plain', 'value': u'© 2015 craigslistCL', 'language': None}, 'id': u'http://newyork.craigslist.org/brk/stp/5302043769.html', 'dc_type': u'text'}, {'dc_source': u'http://newyork.craigslist.org/mnh/stp/5313682903.html', 'summary_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss', 'type': 'text/html', 'value': u'I offer in-home yoga therapy and wellness coaching. I COME TO YOU for Maximum convenience. \nModified sequences according your experience/level and goals.

这个比较乱我们主要用到的是feed1['entries'][i]['summary']。

>>> import feedparser
>>> ny = feedparser.parse('http://newyork.craigslist.org/stp/index.rss')
>>> sf = feedparser.parse('http://sfbay.craigslist.org/stp/index.rss')
>>> vocabList,pSF,pNY = bayes.localWords(ny,sf)
the error rate is: 0.4
>>> pSF
array([-5.81413053, -5.81413053, -5.81413053, -5.81413053, -5.81413053,
       -5.12098335, -5.81413053, -5.81413053, -5.81413053, -5.12098335,
       -5.81413053, -5.81413053, -5.12098335, -5.81413053, -5.81413053,
       -5.81413053, -5.12098335, -5.81413053, -5.12098335, -5.81413053,
       -5.12098335, -5.81413053, -5.12098335, -5.81413053, -5.12098335,
       -5.12098335, -5.12098335, -5.81413053, -5.12098335, -5.81413053,
       -5.81413053, -5.81413053, -5.81413053, -5.81413053, -5.81413053,
       -5.81413053, -5.81413053, -5.12098335, -5.81413053, -5.12098335,
       -5.81413053, -5.81413053, -5.81413053, -5.12098335, -5.81413053,
       -5.12098335, -5.12098335, -5.81413053, -5.81413053, -5.81413053,
       -5.81413053, -5.12098335, -5.81413053, -5.81413053, -5.81413053,
       -5.81413053, -5.12098335, -5.81413053, -5.81413053, -5.81413053,
       -5.81413053, -5.81413053, -5.12098335, -5.81413053, -5.12098335,
       -5.81413053, -5.81413053, -5.81413053, -5.12098335, -5.81413053,
       -4.71551824, -5.81413053, -5.81413053, -5.81413053, -5.81413053,
       -5.12098335, -5.81413053, -5.81413053, -5.12098335, -5.81413053,
       -5.81413053, -5.12098335, -5.81413053, -5.81413053, -5.12098335,
       -5.81413053, -5.12098335, -5.81413053, -5.12098335, -4.71551824,

>>> bayes.getTopWords(ny,sf)
the error rate is: 0.4
SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**
friends
here
they
long
love
learn
married
along
easy
home
but
massage

我们已经除去了概率最大的一些单词，所以后面的显示是没有的。当然这里面也有很多停用词，除去后效果会更好。

希望大家多多指导

XD_Senior

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
机器学习实战python版朴素贝叶斯示例垃圾邮件分类从个人广告中获取趋于趋向

首先我们先来看如何使用朴树贝叶斯对电子邮件进行分类准备数据：切分文本对于一个文本字符串，使用python的split（）就可以切分文本。>>> mySent = 'this book is the best book on python or M.L. I have even laid eyes upon.'>>> mySent.split()['this', 'book', 'i
复制链接

扫一扫