首先我们先来看如何使用朴树贝叶斯对电子邮件进行分类
准备数据:切分文本
对于一个文本字符串,使用python的split()就可以切分文本。
>>> mySent = 'this book is the best book on python or M.L. I have even laid eyes upon.'
>>> mySent.split()
['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'M.L.', 'I', 'have', 'even', 'laid', 'eyes', 'upon.']
>>> import re
>>> re.split('\W*',mySent)
['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'M', 'L', 'I', 'have', 'even', 'laid', 'eyes', 'upon', '']
>>> re.split('\\W*',mySent)
['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'M', 'L', 'I', 'have', 'even', 'laid', 'eyes', 'upon', '']
>>> listOfTokens = re.split('\\W*',mySent)
>>> listOfTokens
['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'M', 'L', 'I', 'have', 'even', 'laid', 'eyes', 'upon', '']
>>> [tok.lower() for tok in listOfTokens if len(tok) > 0]
['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'm', 'l', 'i', 'have', 'even', 'laid', 'eyes', 'upon']
>>>
我们用可以函数就可以切分句子,而且还可以忽略到小于一定长度的字符串。
测试算法:使用朴树贝叶斯进行交叉验证
def textParse(bigString): #input is big string, #output is word list
import re
listOfTokens = re.split(r'\W*', bigString)
return [tok.lower() for tok in listOfTokens if len(tok) > 2]
def spamTest():
docList=[]; classList = []; fullText =[]
for i in range(1,26):
wordList = textParse(open('email/spam/%d.txt' % i).read())
docList.append(wordList)
fullText.extend(wordList)
classList.append(1)
wordList = textParse(open('email/ham/%d.txt' % i).read())
docList.append(wordList)
fullText.extend(wordList)
classList.append(0)
vocabList = createVocabList(docList)#create vocabulary
trainingSet = range(50); testSet=[] #create test set
for i in range(10):
randIndex = int(random.uniform(0,len(trainingSet)))
testSet.append(trainingSet[randIndex])
del(trainingSet[randIndex])
trainMat=[]; trainClasses = []
for docIndex in trainingSet:#train the classifier (get probs) trainNB0
trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
trainClasses.append(classList[docIndex])
p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
errorCount = 0
for docIndex in testSet: #classify the remaining items
wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
errorCount += 1
print "classification error",docList[docIndex]
print 'the error rate is: ',float(errorCount)/len(testSet)
textParse(bigString)比较好理解,就是接收一个大的字符串句子,然后按单词划分。
>>> open('email/spam/%d.txt' % i).read()
'--- Codeine 15mg -- 30 for $203.70 -- VISA Only!!! --\n\n-- Codeine (Methylmorphine) is a narcotic (opioid) pain reliever\n-- We have 15mg & 30mg pills -- 30/15mg for $203.70 - 60/15mg for $385.80 - 90/15mg for $562.50 -- VISA Only!!! ---'
>>>
这样就比较好理解每一句代码的意思,比如我们想知道fullText的内容是什么,除了改变这个函数让他返回这个列表,还可以单独写一个测试函数来查看每句代码的意思。
from numpy import *
docList=[]; fullText =[]
def textParse(bigString): #input is big string, #output is word list
import re
listOfTokens = re.split(r'\W*', bigString)
return [tok.lower() for tok in listOfTokens if len(tok) > 2]
for i in range(1,26):
wordList = textParse(open('email/spam/%d.txt' % i).read())
docList.append(wordList)
fullText.extend(wordList)
wordList = textParse(open('email/ham/%d.txt' % i).read())
docList.append(wordList)
fullText.extend(wordList)
>>> fullText
['codeine', '15mg', 'for', '203', 'visa', 'only', 'codeine'...]
我们可以简单查看下,我们发现这个是有重复代码的和单词表是不一样的。单词表不会出现重复的单词的。
最后我们的测试结果还是一样的。
>>> import bayes
>>> bayes.spamTest()
the error rate is: 0.0
>>>
示例:使用朴树贝叶斯分类器从个人广告中获取区域倾向
收集数据:导入Rss源
现在我们需要一个RSS阅读器,Universal Feed Parser 是Python 中最长使用的Rss程序库。我们现在需要安装一个脚本。
点击打开链接 这个是我从网上找的,分享到了百度云里。下载后,输入python setup.py install点击打开链接
然后导入就可以了。
def calcMostFreq(vocabList,fullText):
import operator
freqDict = {}
for token in vocabList:
freqDict[token]=fullText.count(token)
sortedFreq = sorted(freqDict.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedFreq[:50]
def localWords(feed1,feed0):
import feedparser
docList=[]; classList = []; fullText =[]
minLen = min(len(feed1['entries']),len(feed0['entries']))
for i in range(minLen):
wordList = textParse(feed1['entries'][i]['summary'])
docList.append(wordList)
fullText.extend(wordList)
classList.append(1) #NY is class 1
wordList = textParse(feed0['entries'][i]['summary'])
docList.append(wordList)
fullText.extend(wordList)
classList.append(0)
vocabList = createVocabList(docList)#create vocabulary
top30Words = calcMostFreq(vocabList,fullText) #remove top 30 words
for pairW in top30Words:
if pairW[0] in vocabList: vocabList.remove(pairW[0])
trainingSet = range(2*minLen); testSet=[] #create test set
for i in range(20):
randIndex = int(random.uniform(0,len(trainingSet)))
testSet.append(trainingSet[randIndex])
del(trainingSet[randIndex])
trainMat=[]; trainClasses = []
for docIndex in trainingSet:#train the classifier (get probs) trainNB0
trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
trainClasses.append(classList[docIndex])
p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
errorCount = 0
for docIndex in testSet: #classify the remaining items
wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
errorCount += 1
print 'the error rate is: ',float(errorCount)/len(testSet)
return vocabList,p0V,p1V,top30Words
代码和前面的基本都很相似。也是交叉验证。
分析数据:显示地域相关的用词
def getTopWords(ny,sf):
import operator
vocabList,p0V,p1V,top30Words=localWords(ny,sf)
topNY=[]; topSF=[]
for i in range(len(p0V)):
if p0V[i] > -6.0 : topSF.append((vocabList[i],p0V[i]))
if p1V[i] > -6.0 : topNY.append((vocabList[i],p1V[i]))
sortedSF = sorted(topSF, key=lambda pair: pair[1], reverse=True)
print "SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**"
for item in sortedSF:
print item[0]
sortedNY = sorted(topNY, key=lambda pair: pair[1], reverse=True)
print "NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**"
for item in sortedNY:
print item[0]
> import feedparser
>>> ny = feedparser.parse('http://newyork.craigslist.org/stp/index.rss')
>>> ny['entries']
[{'dc_source': u'http://newyork.craigslist.org/mnh/stp/5331168420.html', 'summary_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss', 'type': 'text/html', 'value': u"Looking to give free non sexual massage to inshape bi/str8 guy. I'm not a pro but told I'm pretty good. Clean discreet 100% respectful for the same. Prefer inshape guys 40 and under. I can host. Pic for trade.", 'language': None}, 'published_parsed': time.struct_time(tm_year=2015, tm_mon=11, tm_mday=25, tm_hour=1, tm_min=38, tm_sec=21, tm_wday=2, tm_yday=329, tm_isdst=0), 'updated_parsed': time.struct_time(tm_year=2015, tm_mon=11, tm_mday=25, tm_hour=1, tm_min=38, tm_sec=21, tm_wday=2, tm_yday=329, tm_isdst=0), 'links': [{'href': u'http://newyork.craigslist.org/mnh/stp/5331168420.html', 'type': 'text/html', 'rel': 'alternate'}], 'title': u'Looking to give free non sexual massage to inshape bi/str8 guy - m4m (Upper East Side)', 'rights': u'© 2015 <span class="desktop">craigslist</span><span class="mobile">CL</span>', 'updated': u'2015-11-24T20:38:21-05:00', 'summary': u"Looking to give free non sexual massage to inshape bi/str8 guy. I'm not a pro but told I'm pretty good. Clean discreet 100% respectful for the same. Prefer inshape guys 40 and under. I can host. Pic for trade.", 'language': u'en-us', 'title_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss', 'type': 'text/plain', 'value': u'Looking to give free non sexual massage to inshape bi/str8 guy - m4m (Upper East Side)', 'language': None}, 'link': u'http://newyork.craigslist.org/mnh/stp/5331168420.html', 'published': u'2015-11-24T20:38:21-05:00', 'rights_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss', 'type': 'text/plain', 'value': u'© 2015 <span class="desktop">craigslist</span><span class="mobile">CL</span>', 'language': None}, 'id': u'http://newyork.craigslist.org/mnh/stp/5331168420.html', 'dc_type': u'text'}, {'dc_source': u'http://newyork.craigslist.org/que/stp/5331187431.html', 'summary_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss', 'type': 'text/html', 'value': u'You are a women with submissive feeling. You would like to learn more and perhaps explore in a safe manner. \nI am an older White Dom...age 60 with years of experience. I am patient and discrete. \nI can listen to your thoughts and answer your question [...]', 'language': None}, 'published_parsed': time.struct_time(tm_year=2015, tm_mon=11, tm_mday=25, tm_hour=1, tm_min=34, tm_sec=58, tm_wday=2, tm_yday=329, tm_isdst=0), 'updated_parsed': time.struct_time(tm_year=2015, tm_mon=11, tm_mday=25, tm_hour=1, tm_min=34, tm_sec=58, tm_wday=2, tm_yday=329, tm_isdst=0), 'links': [{'href': u'http://newyork.craigslist.org/que/stp/5331187431.html', 'type': 'text/html', 'rel': 'alternate'}], 'title': u'Submissive feelings? Need to talk? - m4w (all boros and LI)', 'rights': u'© 2015 <span class="desktop">craigslist</span><span class="mobile">CL</span>', 'updated': u'2015-11-24T20:34:58-05:00', 'summary': u'You are a women with submissive feeling. You would like to learn more and perhaps explore in a safe manner. \nI am an older White Dom...age 60 with years of experience. I am patient and discrete. \nI can listen to your thoughts and answer your question [...]', 'language': u'en-us', 'title_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss', 'type': 'text/plain', 'value': u'Submissive feelings? Need to talk? - m4w (all boros and LI)', 'language': None}, 'link': u'http://newyork.craigslist.org/que/stp/5331187431.html', 'published': u'2015-11-24T20:34:58-05:00', 'rights_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss', 'type': 'text/plain', 'value': u'© 2015 <span class="desktop">craigslist</span><span class="mobile">CL</span>', 'language': None}, 'id': u'http://newyork.craigslist.org/que/stp/5331187431.html', 'dc_type': u'text'}, {'dc_source': u'http://newyork.craigslist.org/brk/stp/5302043769.html', 'summary_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss', 'type': 'text/html', 'value': u"Masculine older guy 61 5ft7 175 \nLooking for a genuine nice guy for \nmutual massage....No agenda here \nTactile sensual guy who primarily is \ninterested in massage...Nothing wrong with \nan erotic component if that's what either of us \nwants....a Go wi [...]", 'language': None}, 'published_parsed': time.struct_time(tm_year=2015, tm_mon=11, tm_mday=25, tm_hour=1, tm_min=32, tm_sec=10, tm_wday=2, tm_yday=329, tm_isdst=0), 'updated_parsed': time.struct_time(tm_year=2015, tm_mon=11, tm_mday=25, tm_hour=1, tm_min=32, tm_sec=10, tm_wday=2, tm_yday=329, tm_isdst=0), 'links': [{'href': u'http://newyork.craigslist.org/brk/stp/5302043769.html', 'type': 'text/html', 'rel': 'alternate'}], 'title': u'Mutual Massage - m4m (Kensington)', 'rights': u'© 2015 <span class="desktop">craigslist</span><span class="mobile">CL</span>', 'updated': u'2015-11-24T20:32:10-05:00', 'summary': u"Masculine older guy 61 5ft7 175 \nLooking for a genuine nice guy for \nmutual massage....No agenda here \nTactile sensual guy who primarily is \ninterested in massage...Nothing wrong with \nan erotic component if that's what either of us \nwants....a Go wi [...]", 'language': u'en-us', 'title_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss', 'type': 'text/plain', 'value': u'Mutual Massage - m4m (Kensington)', 'language': None}, 'link': u'http://newyork.craigslist.org/brk/stp/5302043769.html', 'published': u'2015-11-24T20:32:10-05:00', 'rights_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss', 'type': 'text/plain', 'value': u'© 2015 <span class="desktop">craigslist</span><span class="mobile">CL</span>', 'language': None}, 'id': u'http://newyork.craigslist.org/brk/stp/5302043769.html', 'dc_type': u'text'}, {'dc_source': u'http://newyork.craigslist.org/mnh/stp/5313682903.html', 'summary_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss', 'type': 'text/html', 'value': u'I offer in-home yoga therapy and wellness coaching. I COME TO YOU for Maximum convenience. \nModified sequences according your experience/level and goals.
这个比较乱我们主要用到的是feed1['entries'][i]['summary']。
>>> import feedparser
>>> ny = feedparser.parse('http://newyork.craigslist.org/stp/index.rss')
>>> sf = feedparser.parse('http://sfbay.craigslist.org/stp/index.rss')
>>> vocabList,pSF,pNY = bayes.localWords(ny,sf)
the error rate is: 0.4
>>> pSF
array([-5.81413053, -5.81413053, -5.81413053, -5.81413053, -5.81413053,
-5.12098335, -5.81413053, -5.81413053, -5.81413053, -5.12098335,
-5.81413053, -5.81413053, -5.12098335, -5.81413053, -5.81413053,
-5.81413053, -5.12098335, -5.81413053, -5.12098335, -5.81413053,
-5.12098335, -5.81413053, -5.12098335, -5.81413053, -5.12098335,
-5.12098335, -5.12098335, -5.81413053, -5.12098335, -5.81413053,
-5.81413053, -5.81413053, -5.81413053, -5.81413053, -5.81413053,
-5.81413053, -5.81413053, -5.12098335, -5.81413053, -5.12098335,
-5.81413053, -5.81413053, -5.81413053, -5.12098335, -5.81413053,
-5.12098335, -5.12098335, -5.81413053, -5.81413053, -5.81413053,
-5.81413053, -5.12098335, -5.81413053, -5.81413053, -5.81413053,
-5.81413053, -5.12098335, -5.81413053, -5.81413053, -5.81413053,
-5.81413053, -5.81413053, -5.12098335, -5.81413053, -5.12098335,
-5.81413053, -5.81413053, -5.81413053, -5.12098335, -5.81413053,
-4.71551824, -5.81413053, -5.81413053, -5.81413053, -5.81413053,
-5.12098335, -5.81413053, -5.81413053, -5.12098335, -5.81413053,
-5.81413053, -5.12098335, -5.81413053, -5.81413053, -5.12098335,
-5.81413053, -5.12098335, -5.81413053, -5.12098335, -4.71551824,
>>> bayes.getTopWords(ny,sf)
the error rate is: 0.4
SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**
friends
here
they
long
love
learn
married
along
easy
home
but
massage
我们已经除去了概率最大的一些单词,所以后面的显示是没有的。当然这里面也有很多停用词,除去后效果会更好。
希望大家多多指导