朴素贝叶斯实践

朴素贝叶斯实践

文本分类问题定义

如果一个已经知道的文本分类数据,文本为Ti,文本的类别为ci。如果给你一个新的文本,问NT更可能属于某个类别。

1、首先我们需要将文本抽象出来,如果要比较文本那要一个统一的标准,而这个标准如何选取呢?我们可以统计所有单词,得到一个词典。

2、给词典的每个单词给一个索引,每个索引下存储的值来表示该单词是否出现了,或者说是出现了几次。

3、对于每个文本,初始化一个单词向量,根据文本内容,给向量的值赋值,那么这个向量就可以代表这个文本了。

4、在这里我们可以引入朴素贝叶斯的概念了。

4.1 已知道文本的单词,求文本更可能属于某个类别

4.2 转化为找最大的P(ci|Ti),即求最大的那个概率属于的类别。

4.3 P(ci|Ti) = P(Tici) / P(Ti) = P(ci)P(Ti|ci) / P(Ti)

4.4 Ti记作的文本的单词向量记作W,W的第j个元素记作Wj。

4.5 将每个单词的出现看作是相互独立的,那么公式可以简化为:
P(ci|Ti) = P(ci)P(Ti|ci) / P(Ti)
= P(ci)(P(w0|ci)P(w1|ci)P(w2|ci)…) / (P(w0)P(w1)P(w2)…)

4.6 因为不是求具体的概率,而是求那个类别的概率最大,所以可以去掉同大小的分母,那么公式又可以简化为:
P(ci|Ti) = P(ci)P(Ti|ci) = P(ci)(P(w0|ci)P(w1|ci)P(w2|ci)…)

4.7 考虑到如果某一个单词的出现次数为0,整个值就为0了,所以在计算P(wj|ci)可以加上一些处理,P(wj|ci) = P(wjci) / P(ci), 导致这一项为0的是P(wjci) = n(wjci) / n(wci),这里我们将n(wjci)的初值设置为1,而n(wci)的初值设置为2。

我们来看三个例子的代码实现:

1、假设我们已经拥有了侮辱性句子的一个分类,现在我们想要设计一个程序,来判断新给出的矩阵是不是具有侮辱性质的句子。
"""
自定义数据集
"""
"""
自定义数据集
"""
def loadDataSet():
    postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0,1,0,1,0,1]    #1 is abusive, 0 not
    return postingList,classVec
listOPosts, listClasses = loadDataSet()
listOPosts, listClasses
输出:
([['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
  ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
  ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
  ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
  ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
  ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']],
 [0, 1, 0, 1, 0, 1])
# 创建单词集合
def createVocabList(dataSet):
    vocabSet = set([])  #create empty set
    for document in dataSet:
        vocabSet = vocabSet | set(document) #union of the two sets
    return list(vocabSet)
myVocabList = createVocabList(listOPosts)
print(myVocabList)
输出:
['park', 'licks', 'maybe', 'is', 'ate', 'has', 'to', 'help', 'food', 'my', 'mr', 'posting', 'stop', 'stupid', 'so', 'cute', 'flea', 'steak', 'problems', 'him', 'worthless', 'please', 'quit', 'dalmation', 'buying', 'not', 'love', 'I', 'take', 'how', 'dog', 'garbage']
# 将向量转换为
def setOfWords2Vec(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] = 1
        else: print ("the word: %s is not in my Vocabulary!" % word)
    return returnVec
print ( setOfWords2Vec(myVocabList, listOPosts[0]) )
输出:
[0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
print ( setOfWords2Vec(myVocabList, listOPosts[1]) )
[1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0]
def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = sum(trainCategory)/float(numTrainDocs)
    """
    将出现单词初始化为1,分母初始化为2。是为了防止因为某一个单词出现次数为0,整个概率为0。
    """
    p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones() 
    p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    p1Vect = log(p1Num/p1Denom)          #change to log()
    p0Vect = log(p0Num/p0Denom)          #change to log()
    return p0Vect,p1Vect,pAbusive
trainMat = []
for postinDoc in listOPosts:
    trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
print(trainMat)
[[0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0]]
p0V, p1V, pAb = trainNB0(trainMat, listClasses)
p0V, p1V, pAb
输出:
(array([-3.25809654, -2.56494936, -3.25809654, -2.56494936, -2.56494936,
        -2.56494936, -2.56494936, -2.56494936, -3.25809654, -1.87180218,
        -2.56494936, -3.25809654, -2.56494936, -3.25809654, -2.56494936,
        -2.56494936, -2.56494936, -2.56494936, -2.56494936, -2.15948425,
        -3.25809654, -2.56494936, -3.25809654, -2.56494936, -3.25809654,
        -3.25809654, -2.56494936, -2.56494936, -3.25809654, -2.56494936,
        -2.56494936, -3.25809654]),
 array([-2.35137526, -3.04452244, -2.35137526, -3.04452244, -3.04452244,
        -3.04452244, -2.35137526, -3.04452244, -2.35137526, -3.04452244,
        -3.04452244, -2.35137526, -2.35137526, -1.65822808, -3.04452244,
        -3.04452244, -3.04452244, -3.04452244, -3.04452244, -2.35137526,
        -1.94591015, -3.04452244, -2.35137526, -3.04452244, -2.35137526,
        -2.35137526, -3.04452244, -3.04452244, -2.35137526, -3.04452244,
        -1.94591015, -2.35137526]),
 0.5)
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else: 
        return 0
def testingNB():
    listOPosts,listClasses = loadDataSet()
    myVocabList = createVocabList(listOPosts)
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
    p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))
    testEntry = ['love', 'my', 'dalmation']
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    print (testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb))
    testEntry = ['stupid', 'garbage']
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    print (testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb))
testingNB()
输出:
['love', 'my', 'dalmation'] classified as:  0
['stupid', 'garbage'] classified as:  1
2、使用朴素贝叶斯过滤垃圾邮件
def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec
def textParse(bigString):    #input is big string, #output is word list
    import re
    """
    分隔符是除单词和数字之外的任意分隔符号
    """
    listOfTokens = re.split(r'\W*', bigString)
    # 自保留串长度大于2的单词,且字母全部用小写
    return [tok.lower() for tok in listOfTokens if len(tok) > 2] 
emailText = open('email/ham/6.txt').read()
listOfTokens = textParse(emailText)
print(listOfTokens)
['hello', 'since', 'you', 'are', 'owner', 'least', 'one', 'google', 'groups', 'group', 'that', 'uses', 'the', 'customized', 'welcome', 'message', 'pages', 'files', 'are', 'writing', 'inform', 'you', 'that', 'will', 'longer', 'supporting', 'these', 'features', 'starting', 'february', '2011', 'made', 'this', 'decision', 'that', 'can', 'focus', 'improving', 'the', 'core', 'functionalities', 'google', 'groups', 'mailing', 'lists', 'and', 'forum', 'discussions', 'instead', 'these', 'features', 'encourage', 'you', 'use', 'products', 'that', 'are', 'designed', 'specifically', 'for', 'file', 'storage', 'and', 'page', 'creation', 'such', 'google', 'docs', 'and', 'google', 'sites', 'for', 'example', 'you', 'can', 'easily', 'create', 'your', 'pages', 'google', 'sites', 'and', 'share', 'the', 'site', 'http', 'www', 'google', 'com', 'support', 'sites', 'bin', 'answer', 'answer', '174623', 'with', 'the', 'members', 'your', 'group', 'you', 'can', 'also', 'store', 'your', 'files', 'the', 'site', 'attaching', 'files', 'pages', 'http', 'www', 'google', 'com', 'support', 'sites', 'bin', 'answer', 'answer', '90563', 'the', 'site', 'you抮e', 'just', 'looking', 'for', 'place', 'upload', 'your', 'files', 'that', 'your', 'group', 'members', 'can', 'download', 'them', 'suggest', 'you', 'try', 'google', 'docs', 'you', 'can', 'upload', 'files', 'http', 'docs', 'google', 'com', 'support', 'bin', 'answer', 'answer', '50092', 'and', 'share', 'access', 'with', 'either', 'group', 'http', 'docs', 'google', 'com', 'support', 'bin', 'answer', 'answer', '66343', 'individual', 'http', 'docs', 'google', 'com', 'support', 'bin', 'answer', 'answer', '86152', 'assigning', 'either', 'edit', 'download', 'only', 'access', 'the', 'files', 'you', 'have', 'received', 'this', 'mandatory', 'email', 'service', 'announcement', 'update', 'you', 'about', 'important', 'changes', 'google', 'groups']
def spamTest():
    docList=[]; classList = []; fullText =[]
    for i in range(1,26):
        wordList = textParse(open('email/spam/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        wordList = textParse(open('email/ham/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    trainingSet = list(range(50)); testSet=[]           #create test set
    for i in range(10):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])  
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
            errorCount += 1
            print ("classification error",docList[docIndex])
    print ('the error rate is: ',float(errorCount)/len(testSet))
    #return vocabList,fullText
"""
这里验证方法有两种
一种是取一部分作为训练数据,另一部分作为测试数据
一种是进行交叉验证,随机选择一部分数据作为测试集合,并将这部分数据从训练集中提出。这种方法叫做交叉验证。

这里使用朴素贝叶斯进行交叉验证
"""
spamTest()
输出:
classification error ['home', 'based', 'business', 'opportunity', 'knocking', 'your', 'door', 'don抰', 'rude', 'and', 'let', 'this', 'chance', 'you', 'can', 'earn', 'great', 'income', 'and', 'find', 'your', 'financial', 'life', 'transformed', 'learn', 'more', 'here', 'your', 'success', 'work', 'from', 'home', 'finder', 'experts']
the error rate is:  0.1
wordList = textParse(open('email/ham/24.txt').read())
wordList
输出:
['will', 'there', 'the', 'latest']
3、示例: 使用朴素贝叶斯分类器从个人广告中获取区域倾向
"""
示例: 使用朴素贝叶斯分类器从个人广告中获取区域倾向
将分别从美国的两个城市中选取一些人,通过分析这些人发布的征婚广告信息,来比较这两个城市的人们在广告用词上是否不同。
如果结论确实是不同,那么他们各自常用的词是哪些?

需要安装一个RSS阅读器,Universal Feed Parser是python中最常用的RSS程序库
可以在http://www.lfd.uci.edu/~gohlke/pythonlibs/#pygtk下载
"""
import feedparser
ny = feedparser.parse('http://newyork.craigslist.org/stp/index.rss')
ny['entries']
输出:
[{'dc_source': 'http://newyork.craigslist.org/brk/stp/5500876512.html',
  'dc_type': 'text',
  'enc_enclosure': {'resource': 'http://images.craigslist.org/00G0G_cbYj8nwnkLR_300x300.jpg',
   'type': 'image/jpeg'},
  'id': 'http://newyork.craigslist.org/brk/stp/5500876512.html',
  'language': 'en-us',
  'link': 'http://newyork.craigslist.org/brk/stp/5500876512.html',
  'links': [{'href': 'http://newyork.craigslist.org/brk/stp/5500876512.html',
    'rel': 'alternate',
    'type': 'text/html'}],
  'published': '2016-03-21T04:05:02-04:00',
  'published_parsed': time.struct_time(tm_year=2016, tm_mon=3, tm_mday=21, tm_hour=8, tm_min=5, tm_sec=2, tm_wday=0, tm_yday=81, tm_isdst=0),
  'rights': '&copy; 2016 <span class="desktop">craigslist</span><span class="mobile">CL</span>',
  'rights_detail': {'base': 'http://newyork.craigslist.org/search/stp?format=rss',
   'language': None,
   'type': 'text/html',
   'value': '&copy; 2016 <span class="desktop">craigslist</span><span class="mobile">CL</span>'},
  'summary': "I have a really weird sleep schedule and find myself often awake when all my friends are asleep. Talking on the phone has always relaxed me and helped me fall asleep so I'm look for a late night conversation buddy. \nIm in school full time, I voluntee [...]",
...
...
...
print(len(ny['entries']))
输出:
25
# 计算出现次数最大的前30个单词
def calcMostFreq(vocabList,fullText):
    import operator
    freqDict = {}
    for token in vocabList:
        freqDict[token]=fullText.count(token)
    sortedFreq = sorted(freqDict.items(), key=operator.itemgetter(1), reverse=True) 
    return sortedFreq[:30]
def localWords(feed1,feed0):
    import feedparser
    docList=[]; classList = []; fullText =[]
    minLen = min(len(feed1['entries']),len(feed0['entries']))
    # fullText是一个列表,存放整个文档的单词
    # docList是多个列表,每个列表存放一个summary的单词
    for i in range(minLen):
        wordList = textParse(feed1['entries'][i]['summary'])
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1) #NY is class 1
        wordList = textParse(feed0['entries'][i]['summary'])
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    top30Words = calcMostFreq(vocabList,fullText)   #remove top 30 words
    for pairW in top30Words:
        if pairW[0] in vocabList: vocabList.remove(pairW[0])
    trainingSet = list(range(2*minLen)); testSet=[]           #create test set
    for i in range(20):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])  
    trainMat=[]; trainClasses = []
    # 单词列表转换为数字列表
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
            errorCount += 1
    print ('the error rate is: ',float(errorCount)/len(testSet))
    return vocabList,p0V,p1V
ny = feedparser.parse('http://newyork.craigslist.org/stp/index.rss')
sf = feedparser.parse('http://sfbay.craigslist.org/stp/index.rss')
vocabList, pSF, pNY = localWords(ny, sf)
输出
the error rate is:  0.45
# 获取出现概率大于阈值的单词
def getTopWords(ny,sf):
    import operator
    vocabList,p0V,p1V=localWords(ny,sf)
    topNY=[]; topSF=[]
    for i in range(len(p0V)):
        if p0V[i] > -6.0 : topSF.append((vocabList[i],p0V[i]))
        if p1V[i] > -6.0 : topNY.append((vocabList[i],p1V[i]))
    sortedSF = sorted(topSF, key=lambda pair: pair[1], reverse=True)
    print ("SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**")
    for item in sortedSF:
        print (item[0])
    sortedNY = sorted(topNY, key=lambda pair: pair[1], reverse=True)
    print ("NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**")
    for item in sortedNY:
        print (item[0])
getTopWords(ny, sf)
输出:
the error rate is:  0.45
SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**
life
any
there
your
than
contact
share
want
keep
special
post
seeking
...
...
...
对于分类而言,使用概率有时要比使用硬规则更为有效。贝叶斯概率及贝叶斯准则提供了一种利用已知值来估计未知概率的有效方法。
可以通过特征之间的独立性假设,降低对数量的需求。独立性假设是指一个词的出现概率不依赖于文档中的其他词。当然我们也知道这个假设过于简单。这就是之所以称为朴素贝叶斯的原因。尽管条件独立性假设并不准确,但是朴素贝叶斯仍然是一种有效的分类器。
利用现代编程语言来实现朴素贝叶斯时需要考虑很多实际因素。下溢出就是其中一个问题,它可以通过对概率取对数来解决。词袋模型在解决文档分类问题上比词集模型有所提高。还有其他一些方面的改进,比如说移除停用词,当然也可以花大量时间对切分器进行优化。
参考《机器学习实战》
  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值