四、朴素贝叶斯法--统计学习方法总结_朴素贝叶斯是基于统计-CSDN博客

本文链接：https://blog.csdn.net/lemonaha/article/details/53535082

- 四朴素贝叶斯法
  - 1朴素贝叶斯法的学习与分类
    - 11基本方法
    - 12后验概率最大化的含义
  - 2朴素贝叶斯法的参数估计

四、朴素贝叶斯法

朴素贝叶斯（naive Bayes）法是基于贝叶斯定理与特征条件独立假设的分类方法。对于给定的训练数据集，首先基于特征条件独立假设学习输入/输出的联合概率分布；然后基于此模型，对给定的输入x，利用贝叶斯定理求出后验概率的最大的输出y。

4.1朴素贝叶斯法的学习与分类

4.1.1基本方法

输入为特征向量x，输出为类标记（class label）y：
$P(X,Y)$ 是 $X$ 和 $Y$ 的联合概率分布。训练数据集 $T=\{ (x_1,y_1),(x_2,y_2),\cdots,(x_N,y_N)\}$ 由 $P(X,Y)$ 独立同分布产生。
朴素贝叶斯法通过训练数据集学习联合概率分布 $P(X,Y)$ .具体的，学习以下先验概率分布及条件概率分布。先验概率分布：

P (X = c k), k = 1, 2, \dots, K

$P(X=c_k),k=1,2,\cdots,K$
条件概率分布：

P (X = x | Y = c k) = P (X (1) = x (1), \dots, X (n) = x (n)), k = 1, 2, \dots, K

$P(X=x|Y=c_k)=P(X^{(1)}=x^{(1)},\cdots,X^{(n)}=x^{(n)}),k=1,2,\cdots,K$
于是学习到联合概率分布

P(X,Y) $P(X,Y)$ 。
朴素贝叶斯法对条件概率分布作了 条件独立性的假设。由于这是一个较强的假设，朴素贝叶斯法也因此得名。具体的， 条件独立性假设是

P (X = x | Y = c k) = P (X (1) = x (1), \dots, X (n) = x (n)) = \prod j = 1 n P (X (j) = x (j) | Y = c k)

$\begin {align*} P(X=x|Y=c_k) & =P(X^{(1)}=x^{(1)},\cdots,X^{(n)}=x^{(n)})\\ &=\prod_{j=1}^nP(X^{(j)}=x^{(j)}|Y=c_k) \end{align*}$
朴素贝叶斯法实际上学习到生成数据的机制，所以属于 生成模型。条件独立性假设等于是说用用于分类的特征在类确定的条件下都是条件独立的。 这一假设使得朴素贝叶斯法变得简单，但有时会牺牲一定的分类准确率。
贝叶斯公式：

P (Y | X) = P ( X | Y ) P ( Y ) P ( X )

$P(Y|X)=\frac{P(X|Y)P(Y)}{P(X)}$
根据朴素贝叶斯分类的基本公式（P48），朴素贝叶斯分类器可以表示为：

y = a r g m a x c k P (Y = c k) \prod j P (X (j) = x (j) | Y = c k)

$y=arg max_{c_k}P(Y=c_k)\prod_jP(X^{(j)}=x^{(j)}|Y=c_k)$

4.1.2后验概率最大化的含义

朴素贝叶斯法将实例分到后验概率最大的类中，这等价于期望风险最小化。

4.2朴素贝叶斯法的参数估计

4.2.1极大似然估计

在朴素贝叶斯法中，学习意味着估计 $P(Y=c_k)$ 和 $P(X^{(j)}=x^{(j)}|Y=c_k)$ 。可以应用极大似然估计法估计相应的概率。先验概率 $P(Y=c_k)$ 的极大似然估计是

P (Y = c k) = \sum N i = 1 I ( y i = c k ) N

$P(Y=c_k) = \frac{ \sum_{i=1}^N I(y_i=c_k)}{N}$
条件概率

P (X (j) = a j l | Y = c k) = \sum N i = 1 I ( x ( j ) i = a j l , y i = c k ) \sum N i = 1 I ( y i = c k )

$P(X^{(j)}=a_{jl}|Y=c_k)= \frac{\sum_{i=1}^NI(x_i^{(j)}=a_{jl},y_i=c_k)}{\sum_{i=1}^NI(y_i=c_k)}$
其中

I $I$ 为指示函数。

4.2.2学习与分类方法

朴素贝叶斯算法（naive Bayes algorithm）
输入：训练数据 $T=\{ (x_1,y_1),(x_2,y_2),\cdots,(x_N,y_N)\}$ ，其中 $(x_i^{(1)},x_i^{(2)},\cdots,x_i^{(n)})^T$ , $x_j^{(j)}$ 是第 $i$ 个样本的第 $j$ 个特征， $x_i^{(j)}\in\{a_{j1},a_{j2},\cdots,a_{jS_j} \}$ , $a_{jl}$ 是第 $j$ 个特征可能的第 $l$ 个值,实例 $x$
输出：实例 $x$ 的分类。
(1)计算先验概率以及条件概率
先验概率：

P (Y = c k) = \sum N i = 1 I ( y i = c k ) N

$P(Y=c_k) = \frac{ \sum_{i=1}^N I(y_i=c_k)}{N}$
条件概率：

P (X (j) = a j l | Y = c k) = \sum N i = 1 I ( x ( j ) i = a j l , y i = c k ) \sum N i = 1 I ( y i = c k )

$P(X^{(j)}=a_{jl}|Y=c_k)= \frac{\sum_{i=1}^NI(x_i^{(j)}=a_{jl},y_i=c_k)}{\sum_{i=1}^NI(y_i=c_k)}$
(2)对于给定的实例

x=(x(1),x(2),⋯,x(n))T $x=(x^{(1)},x^{(2)},\cdots,x^{(n)})^T$ ，计算

P (Y = c k) \prod j = 1 n P (X (j) = x (j) | Y = c k)

$P(Y=c_k)\prod_{j=1}^nP(X^{(j)}=x^{(j)}|Y=c_k)$
(3)确定实例

x $x$ 的分类

y = a r g m a x c k P (Y = c k) \prod j = 1 n P (X (j) = x (j) | Y = c k)

$y=arg max_{c_k} P(Y=c_k)\prod_{j=1}^nP(X^{(j)}=x^{(j)}|Y=c_k)$

4.2.3贝叶斯估计

用极大似然估计可能会出现所要估计的概率值为0的情况。这时会影响到后验概率的计算结果，使分类产生误差，解决这一问题的方法是采用贝叶斯估计。具体的，条件概率的贝叶斯估计是

P λ (X (j) = a j l | Y = c k) = \sum N i = 1 I ( x ( j ) i = a j l , y i = c k ) + λ \sum N i = 1 I ( y i = c k ) + S j λ

$P_{\lambda}(X^{(j)}=a_{jl}|Y=c_k)=\frac{\sum_{i=1}^NI(x_i^{(j)}=a_{jl},y_i=c_k)+\lambda}{\sum_{i=1}^NI(y_i=c_k)+S_j\lambda}$
式中，

λ≥0 $\lambda\ge0$ 。等价于在随机变量各个取值的频数上赋予一个正数。当

λ=0 $\lambda=0$ 时，就是极大似然估计。常取

λ=1 $\lambda=1$ ，这时称为 拉普拉斯平滑（Laplace smoothing）。
显然，对于任意

l=1,2,⋯,Sj,k=1,2,⋯,K $l=1,2,\cdots,S_j, k=1,2,\cdots,K$

P λ (X (j) = a j l | Y = c k) > 0

$P_{\lambda}(X^{(j)}=a_{jl}|Y=c_k)>0$

\sum l = 1 S j P (X (j) = a j l | Y = c k) = 1

$\sum_{l=1}^{S_j}P(X^{(j)}=a_{jl}|Y=c_k)=1$
表明条件概率的贝叶斯估计确为一种概率分布。同样， 先验概率的贝叶斯估计是

P λ (Y = c k) = \sum N i = 1 I ( y i = c k ) + λ N + K λ

$P_{\lambda}(Y=c_k)=\frac{\sum_{i=1}^NI(y_i=c_k)+\lambda}{N+K\lambda}$

'''
Created on Oct 19, 2010

@author: Peter
'''
from numpy import *

def loadDataSet():
    postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0,1,0,1,0,1]    #1 is abusive, 0 not
    return postingList,classVec

def createVocabList(dataSet):
    vocabSet = set([])  #create empty set
    for document in dataSet:
        vocabSet = vocabSet | set(document) #union of the two sets
    return list(vocabSet)

def setOfWords2Vec(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] = 1
        else: print "the word: %s is not in my Vocabulary!" % word
    return returnVec

def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = sum(trainCategory)/float(numTrainDocs)
    p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones() 
    p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    p1Vect = log(p1Num/p1Denom)          #change to log()
    p0Vect = log(p0Num/p0Denom)          #change to log()
    return p0Vect,p1Vect,pAbusive

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else: 
        return 0

def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec

def testingNB():
    listOPosts,listClasses = loadDataSet()
    myVocabList = createVocabList(listOPosts)
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
    p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))
    testEntry = ['love', 'my', 'dalmation']
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
    testEntry = ['stupid', 'garbage']
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)

def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'\W*', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2] 

def spamTest():
    docList=[]; classList = []; fullText =[]
    for i in range(1,26):
        wordList = textParse(open('email/spam/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        wordList = textParse(open('email/ham/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    trainingSet = range(50); testSet=[]           #create test set
    for i in range(10):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])  
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
            errorCount += 1
            print "classification error",docList[docIndex]
    print 'the error rate is: ',float(errorCount)/len(testSet)
    #return vocabList,fullText

def calcMostFreq(vocabList,fullText):
    import operator
    freqDict = {}
    for token in vocabList:
        freqDict[token]=fullText.count(token)
    sortedFreq = sorted(freqDict.iteritems(), key=operator.itemgetter(1), reverse=True) 
    return sortedFreq[:30]       

def localWords(feed1,feed0):
    import feedparser
    docList=[]; classList = []; fullText =[]
    minLen = min(len(feed1['entries']),len(feed0['entries']))
    for i in range(minLen):
        wordList = textParse(feed1['entries'][i]['summary'])
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1) #NY is class 1
        wordList = textParse(feed0['entries'][i]['summary'])
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    top30Words = calcMostFreq(vocabList,fullText)   #remove top 30 words
    for pairW in top30Words:
        if pairW[0] in vocabList: vocabList.remove(pairW[0])
    trainingSet = range(2*minLen); testSet=[]           #create test set
    for i in range(20):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])  
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
            errorCount += 1
    print 'the error rate is: ',float(errorCount)/len(testSet)
    return vocabList,p0V,p1V

def getTopWords(ny,sf):
    import operator
    vocabList,p0V,p1V=localWords(ny,sf)
    topNY=[]; topSF=[]
    for i in range(len(p0V)):
        if p0V[i] > -6.0 : topSF.append((vocabList[i],p0V[i]))
        if p1V[i] > -6.0 : topNY.append((vocabList[i],p1V[i]))
    sortedSF = sorted(topSF, key=lambda pair: pair[1], reverse=True)
    print "SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**"
    for item in sortedSF:
        print item[0]
    sortedNY = sorted(topNY, key=lambda pair: pair[1], reverse=True)
    print "NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**"
    for item in sortedNY:
        print item[0]