机器学习实战python版 朴素贝叶斯示例 垃圾邮件分类 从个人广告中获取趋于趋向

首先我们先来看如何使用朴树贝叶斯对电子邮件进行分类

准备数据:切分文本

对于一个文本字符串,使用python的split()就可以切分文本。

>>> mySent = 'this book is the best book on python or M.L. I have even laid eyes upon.'
>>> mySent.split()
['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'M.L.', 'I', 'have', 'even', 'laid', 'eyes', 'upon.']
>>> import re
>>> re.split('\W*',mySent)
['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'M', 'L', 'I', 'have', 'even', 'laid', 'eyes', 'upon', '']
>>> re.split('\\W*',mySent)
['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'M', 'L', 'I', 'have', 'even', 'laid', 'eyes', 'upon', '']
>>> listOfTokens = re.split('\\W*',mySent)
>>> listOfTokens
['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'M', 'L', 'I', 'have', 'even', 'laid', 'eyes', 'upon', '']
>>> [tok.lower() for tok in listOfTokens if len(tok) > 0]
['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'm', 'l', 'i', 'have', 'even', 'laid', 'eyes', 'upon']
>>> 

我们用可以函数就可以切分句子,而且还可以忽略到小于一定长度的字符串。

测试算法:使用朴树贝叶斯进行交叉验证

def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'\W*', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2] 
    
def spamTest():
    docList=[]; classList = []; fullText =[]
    for i in range(1,26):
        wordList = textParse(open('email/spam/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        wordList = textParse(open('email/ham/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    trainingSet = range(50); testSet=[]           #create test set
    for i in range(10):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])  
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
            errorCount += 1
            print "classification error",docList[docIndex]
    print 'the error rate is: ',float(errorCount)/len(testSet)


textParse(bigString)比较好理解,就是接收一个大的字符串句子,然后按单词划分。

>>> open('email/spam/%d.txt' % i).read()
'--- Codeine 15mg -- 30 for $203.70 -- VISA Only!!! --\n\n-- Codeine (Methylmorphine) is a narcotic (opioid) pain reliever\n-- We have 15mg & 30mg pills -- 30/15mg for $203.70 - 60/15mg for $385.80 - 90/15mg for $562.50 -- VISA Only!!! ---'
>>> 

这样就比较好理解每一句代码的意思,比如我们想知道fullText的内容是什么,除了改变这个函数让他返回这个列表,还可以单独写一个测试函数来查看每句代码的意思。

from numpy import *
docList=[]; fullText =[]
def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'\W*', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2] 
for i in range(1,26):
        wordList = textParse(open('email/spam/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        wordList = textParse(open('email/ham/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)

>>> fullText
['codeine', '15mg', 'for', '203', 'visa', 'only', 'codeine'...]

我们可以简单查看下,我们发现这个是有重复代码的和单词表是不一样的。单词表不会出现重复的单词的。

最后我们的测试结果还是一样的。

>>> import bayes
>>> bayes.spamTest()
the error rate is:  0.0
>>>


示例:使用朴树贝叶斯分类器从个人广告中获取区域倾向

收集数据:导入Rss源

现在我们需要一个RSS阅读器,Universal Feed Parser 是Python 中最长使用的Rss程序库。我们现在需要安装一个脚本。

点击打开链接 这个是我从网上找的,分享到了百度云里。下载后,输入python setup.py install点击打开链接

然后导入就可以了。

def calcMostFreq(vocabList,fullText):
    import operator
    freqDict = {}
    for token in vocabList:
        freqDict[token]=fullText.count(token)
    sortedFreq = sorted(freqDict.iteritems(), key=operator.itemgetter(1), reverse=True) 
    return sortedFreq[:50]       

def localWords(feed1,feed0):
    import feedparser
    docList=[]; classList = []; fullText =[]
    minLen = min(len(feed1['entries']),len(feed0['entries']))
    for i in range(minLen):
        wordList = textParse(feed1['entries'][i]['summary'])
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1) #NY is class 1
        wordList = textParse(feed0['entries'][i]['summary'])
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    top30Words = calcMostFreq(vocabList,fullText)   #remove top 30 words
    for pairW in top30Words:
        if pairW[0] in vocabList: vocabList.remove(pairW[0])
    trainingSet = range(2*minLen); testSet=[]           #create test set
    for i in range(20):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])  
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
            errorCount += 1
    print 'the error rate is: ',float(errorCount)/len(testSet)
    return vocabList,p0V,p1V,top30Words


代码和前面的基本都很相似。也是交叉验证。

分析数据:显示地域相关的用词

def getTopWords(ny,sf):
    import operator
    vocabList,p0V,p1V,top30Words=localWords(ny,sf)
    topNY=[]; topSF=[]
    for i in range(len(p0V)):
        if p0V[i] > -6.0 : topSF.append((vocabList[i],p0V[i]))
        if p1V[i] > -6.0 : topNY.append((vocabList[i],p1V[i]))
    sortedSF = sorted(topSF, key=lambda pair: pair[1], reverse=True)
    print "SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**"
    for item in sortedSF:
        print item[0]
    sortedNY = sorted(topNY, key=lambda pair: pair[1], reverse=True)
    print "NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**"
    for item in sortedNY:
        print item[0]


> import feedparser
>>> ny = feedparser.parse('http://newyork.craigslist.org/stp/index.rss')
>>> ny['entries']
[{'dc_source': u'http://newyork.craigslist.org/mnh/stp/5331168420.html', 'summary_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss', 'type': 'text/html', 'value': u"Looking to give free non sexual massage to inshape bi/str8 guy. I'm not a pro but told I'm pretty good. Clean discreet 100% respectful for the same. Prefer inshape guys 40 and under. I can host. Pic for trade.", 'language': None}, 'published_parsed': time.struct_time(tm_year=2015, tm_mon=11, tm_mday=25, tm_hour=1, tm_min=38, tm_sec=21, tm_wday=2, tm_yday=329, tm_isdst=0), 'updated_parsed': time.struct_time(tm_year=2015, tm_mon=11, tm_mday=25, tm_hour=1, tm_min=38, tm_sec=21, tm_wday=2, tm_yday=329, tm_isdst=0), 'links': [{'href': u'http://newyork.craigslist.org/mnh/stp/5331168420.html', 'type': 'text/html', 'rel': 'alternate'}], 'title': u'Looking to give free non sexual massage to inshape bi/str8 guy - m4m (Upper East Side)', 'rights': u'&copy; 2015 <span class="desktop">craigslist</span><span class="mobile">CL</span>', 'updated': u'2015-11-24T20:38:21-05:00', 'summary': u"Looking to give free non sexual massage to inshape bi/str8 guy. I'm not a pro but told I'm pretty good. Clean discreet 100% respectful for the same. Prefer inshape guys 40 and under. I can host. Pic for trade.", 'language': u'en-us', 'title_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss', 'type': 'text/plain', 'value': u'Looking to give free non sexual massage to inshape bi/str8 guy - m4m (Upper East Side)', 'language': None}, 'link': u'http://newyork.craigslist.org/mnh/stp/5331168420.html', 'published': u'2015-11-24T20:38:21-05:00', 'rights_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss', 'type': 'text/plain', 'value': u'&copy; 2015 <span class="desktop">craigslist</span><span class="mobile">CL</span>', 'language': None}, 'id': u'http://newyork.craigslist.org/mnh/stp/5331168420.html', 'dc_type': u'text'}, {'dc_source': u'http://newyork.craigslist.org/que/stp/5331187431.html', 'summary_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss', 'type': 'text/html', 'value': u'You are a women with submissive feeling. You would like to learn more and perhaps explore in a safe manner. \nI am an older White Dom...age 60 with years of experience. I am patient and discrete. \nI can listen to your thoughts and answer your question [...]', 'language': None}, 'published_parsed': time.struct_time(tm_year=2015, tm_mon=11, tm_mday=25, tm_hour=1, tm_min=34, tm_sec=58, tm_wday=2, tm_yday=329, tm_isdst=0), 'updated_parsed': time.struct_time(tm_year=2015, tm_mon=11, tm_mday=25, tm_hour=1, tm_min=34, tm_sec=58, tm_wday=2, tm_yday=329, tm_isdst=0), 'links': [{'href': u'http://newyork.craigslist.org/que/stp/5331187431.html', 'type': 'text/html', 'rel': 'alternate'}], 'title': u'Submissive feelings? Need to talk? - m4w (all boros and LI)', 'rights': u'&copy; 2015 <span class="desktop">craigslist</span><span class="mobile">CL</span>', 'updated': u'2015-11-24T20:34:58-05:00', 'summary': u'You are a women with submissive feeling. You would like to learn more and perhaps explore in a safe manner. \nI am an older White Dom...age 60 with years of experience. I am patient and discrete. \nI can listen to your thoughts and answer your question [...]', 'language': u'en-us', 'title_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss', 'type': 'text/plain', 'value': u'Submissive feelings? Need to talk? - m4w (all boros and LI)', 'language': None}, 'link': u'http://newyork.craigslist.org/que/stp/5331187431.html', 'published': u'2015-11-24T20:34:58-05:00', 'rights_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss', 'type': 'text/plain', 'value': u'&copy; 2015 <span class="desktop">craigslist</span><span class="mobile">CL</span>', 'language': None}, 'id': u'http://newyork.craigslist.org/que/stp/5331187431.html', 'dc_type': u'text'}, {'dc_source': u'http://newyork.craigslist.org/brk/stp/5302043769.html', 'summary_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss', 'type': 'text/html', 'value': u"Masculine older guy 61 5ft7 175 \nLooking for a genuine nice guy for \nmutual massage....No agenda here \nTactile sensual guy who primarily is \ninterested in massage...Nothing wrong with \nan erotic component if that's what either of us \nwants....a Go wi [...]", 'language': None}, 'published_parsed': time.struct_time(tm_year=2015, tm_mon=11, tm_mday=25, tm_hour=1, tm_min=32, tm_sec=10, tm_wday=2, tm_yday=329, tm_isdst=0), 'updated_parsed': time.struct_time(tm_year=2015, tm_mon=11, tm_mday=25, tm_hour=1, tm_min=32, tm_sec=10, tm_wday=2, tm_yday=329, tm_isdst=0), 'links': [{'href': u'http://newyork.craigslist.org/brk/stp/5302043769.html', 'type': 'text/html', 'rel': 'alternate'}], 'title': u'Mutual Massage - m4m (Kensington)', 'rights': u'&copy; 2015 <span class="desktop">craigslist</span><span class="mobile">CL</span>', 'updated': u'2015-11-24T20:32:10-05:00', 'summary': u"Masculine older guy 61 5ft7 175 \nLooking for a genuine nice guy for \nmutual massage....No agenda here \nTactile sensual guy who primarily is \ninterested in massage...Nothing wrong with \nan erotic component if that's what either of us \nwants....a Go wi [...]", 'language': u'en-us', 'title_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss', 'type': 'text/plain', 'value': u'Mutual Massage - m4m (Kensington)', 'language': None}, 'link': u'http://newyork.craigslist.org/brk/stp/5302043769.html', 'published': u'2015-11-24T20:32:10-05:00', 'rights_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss', 'type': 'text/plain', 'value': u'&copy; 2015 <span class="desktop">craigslist</span><span class="mobile">CL</span>', 'language': None}, 'id': u'http://newyork.craigslist.org/brk/stp/5302043769.html', 'dc_type': u'text'}, {'dc_source': u'http://newyork.craigslist.org/mnh/stp/5313682903.html', 'summary_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss', 'type': 'text/html', 'value': u'I offer in-home yoga therapy and wellness coaching. I COME TO YOU for Maximum convenience. \nModified sequences according your experience/level and goals.

这个比较乱我们主要用到的是feed1['entries'][i]['summary']。

>>> import feedparser
>>> ny = feedparser.parse('http://newyork.craigslist.org/stp/index.rss')
>>> sf = feedparser.parse('http://sfbay.craigslist.org/stp/index.rss')
>>> vocabList,pSF,pNY = bayes.localWords(ny,sf)
the error rate is:  0.4
>>> pSF
array([-5.81413053, -5.81413053, -5.81413053, -5.81413053, -5.81413053,
       -5.12098335, -5.81413053, -5.81413053, -5.81413053, -5.12098335,
       -5.81413053, -5.81413053, -5.12098335, -5.81413053, -5.81413053,
       -5.81413053, -5.12098335, -5.81413053, -5.12098335, -5.81413053,
       -5.12098335, -5.81413053, -5.12098335, -5.81413053, -5.12098335,
       -5.12098335, -5.12098335, -5.81413053, -5.12098335, -5.81413053,
       -5.81413053, -5.81413053, -5.81413053, -5.81413053, -5.81413053,
       -5.81413053, -5.81413053, -5.12098335, -5.81413053, -5.12098335,
       -5.81413053, -5.81413053, -5.81413053, -5.12098335, -5.81413053,
       -5.12098335, -5.12098335, -5.81413053, -5.81413053, -5.81413053,
       -5.81413053, -5.12098335, -5.81413053, -5.81413053, -5.81413053,
       -5.81413053, -5.12098335, -5.81413053, -5.81413053, -5.81413053,
       -5.81413053, -5.81413053, -5.12098335, -5.81413053, -5.12098335,
       -5.81413053, -5.81413053, -5.81413053, -5.12098335, -5.81413053,
       -4.71551824, -5.81413053, -5.81413053, -5.81413053, -5.81413053,
       -5.12098335, -5.81413053, -5.81413053, -5.12098335, -5.81413053,
       -5.81413053, -5.12098335, -5.81413053, -5.81413053, -5.12098335,
       -5.81413053, -5.12098335, -5.81413053, -5.12098335, -4.71551824,

>>> bayes.getTopWords(ny,sf)
the error rate is:  0.4
SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**
friends
here
they
long
love
learn
married
along
easy
home
but
massage

我们已经除去了概率最大的一些单词,所以后面的显示是没有的。当然这里面也有很多停用词,除去后效果会更好。

希望大家多多指导

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值