基本原理:给出一个数据实例,让分类器猜测可能属于哪一类。这就是基于概率论的分类方法。朴素贝叶斯算法是选择具有最高概率的决策。
朴素两个含义:1)各个特征之间相互独立。2)每个特征同等重要
条件概率:使用条件概率来分类统计特征在数据集中取某个特定值的次数,再除以数据集实例总数,就是特征取该值的概率。选择各概率中最高的一个,对应的类别就是所求的值。使用贝叶斯准则来交换条件概率中的条件和结果。
知道基本公式:
朴素贝叶斯中:
P(B|A)=P(b1|A)*P(b2|A)*…*P(bn|A)
其中B是类别,A是特征。P(A)是A先验概率,即不考虑B的因素。P(B)是B的先验概率,也称为标准化常量。P(A|B)是已知B前提下求A的后验概率。另外,比例 P(B|A)/P(B)也有时被称作标准相似度,Bayes定理可表述为:后验概率 =标准相似度*先验概率。
于是朴素贝叶斯问题即:已知样本中,类别B与该类别出现的特征词A之间的对应关系,即P(A|B)。以及P(B),P(A)。现在给定一个特征词序列,求其对应的类别概率。具有最大概率的类别即为输出。
关键词:特征,最高概率,条件概率,贝叶斯准则
算法实施:1.数据。假设文本有M行——loadDataSet()
2. 生成单词表。将文本中所有单词生成一个不重复的所有单词表,共有单次N个——createVocabList(dataSet)
3. 生成0-1矩阵。计算每个句子中的单词在单词表中出现的位置,生成M行N列的0-1矩阵——wordSet2Vec(wordsList,inputSet)
4. 计算单词出现的条件概率。首先计算侮辱性句子在所有句子中的比例(即为侮辱性文档的概率)。类别1为侮辱性,将所有次类别的矩阵行向量加和,并统计所有侮辱性类别中出现的单词数;同理,如果是类别0也将行向量加和,统计非侮辱性行中出现的所有单词数;用向量中的各个值除以该类别的总单词数,就是各个词属于侮辱性和非侮辱文档的概率(即各个类别出现该单词的条件概率);其中侮辱性单词向量中概率最大的词即为最能表征该类别的单词。
测试:为了避免各个单词条件概率联乘中出现的零概率,所有词出现次数(分子)初始化为1,该类别总词数(分母)出现次数初始化为2.0——trainNB0(trainMatrix, trainCategory)
5. 返回贝叶斯算法结果。给定的文字向量,训练算法得出的概率向量、类别比例作为输入;调用贝叶斯准则公式,经过对数转换,将概率乘积转化为对数求和;概率大的一种类别作为返回值。结束。
——classifyNB(vec2Classify, p0Vec, p1Vec, pClass1)
算法代码:
# coding: utf-8
from numpy import *
def loadDataSet():
###### 将句子按单词划分成单词向量 ######
myList = [['my','dog','has','flea','problems','help','please'],
['maybe','not','take','him','to','dog','park','stupid'],
['my','dalmation','is','so','cute','I','love','him'],
['stop','posting','stupid','worthless','garbage'],
['mr','licks','ate','my','steak','how','to','stop','him'],
['quit','buying','worthless','dog','food','stupid']]
labels = [0,1,0,1,0,1] # 1代表侮辱性文字,0代表正常言论
return myList,labels
def createVocabList(dataSet):
###### 将所有单词抽出存入集合,返回集合元素构成的列表 ######
vocabSet = set()
for item in dataSet:
vocabSet = vocabSet | set(item)
return list(vocabSet)
def wordSet2Vec(wordsList,inputSet):
###### 输入的新集合转化为向量 ######
newVec = [0]*len(wordsList)
for word in inputSet:
if word in wordsList:
newVec[wordsList.index(word)] = 1
return newVec
################## 开始训练算法 ##################
def trainNB0(trainMatrix, trainCategory):
numTrainDocs = len(trainMatrix)
numWords = len(trainMatrix[0])
pAbusive = sum(trainCategory)/float(numTrainDocs)
# p0Num = zeros(numWords); p1Num = zeros(numWords)
#p0Denom = 0.0; p1Denom = 0.0
p0Num = ones(numWords); p1Num = ones(numWords) #避免一个概率值为0,最后的乘积也为0
p0Denom = 2.0; p1Denom = 2.0 # 考虑中
for i in range(numTrainDocs):
if trainCategory[i] == 1:
p1Num += trainMatrix[i]
p1Denom += sum(trainMatrix[i])
else:
p0Num += trainMatrix[i]
p0Denom += sum(trainMatrix[i])
# p1Vect = p1Num / p1Denom
#p0Vect = p0Num / p0Denom
p1Vect = log(p1Num / p1Denom)
p0Vect = log(p0Num / p0Denom) #避免下溢出或者浮点数舍入导致的错误 下溢出是由太多很小的数相乘得到的
return p0Vect, p1Vect, pAbusive
################## 开始贝叶斯算法 ##################
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
p1 = sum(vec2Classify*p1Vec) + log(pClass1)
p0 = sum(vec2Classify*p0Vec) + log(1.0-pClass1)
if p1 > p0:
return 1
else: return 0
################## 主函数 ##################
def testingNB():
listOPosts, listClasses = loadDataSet()
myVocabList = createVocabList(listOPosts)
trainMat = []
for postinDoc in listOPosts:
trainMat.append(wordSet2Vec(myVocabList, postinDoc))
p0V, p1V, pAb = trainNB0(array(trainMat), array(listClasses))
testEntry = ['love','my','dalmation']
thisDoc = array(wordSet2Vec(myVocabList, testEntry))
print(testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb))
testEntry = ['stupid','garbage']
thisDoc = array(wordSet2Vec(myVocabList, testEntry))
print(testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb) )
testingNB()
def spamTest(nWordList,hWordList):
docList = []; classList = []; fullText = []
for i in range(0,26):
# 解析侮辱性文本
docList.append(hWordList[i])
fullText.extend(hWordList[i])
classList.append(1)
# 解析正常文本
docList.append(hWordList[i])
fullText.extend(hWordList[i])
classList.append(0)
vocabList = createVocabList(docList)
trainingSet = range(50); testSet = []
for i in range(10):
randIndex = int(random.uniform(0,len(trainingSet)))
testSet.append(trainingSet[randIndex])
del(trainingSet[randIndex])
trainMat = []; trainClasses = []
for docIndex in trainingSet:
trainMat.append(setOfWords2Vec(vocabList, docList[docIndex]))
trainClasses.append(classList[docIndex])
p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
errorCount = 0
for docIndex in testSet:
wordVector = setOfWords2Vec(vocabList,docList[docIndex])
if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
errorCount += 1
print 'the error rate is: ',float(errorCount)/len(testSet)
def calcMostFreq(vocabList, fullText):
imoprt operator
freqDict = {}
for token in vocabList:
freaqDict[token] = fullText.count(token)
sortedFreq = sorted(freqDict.iteritems(), key = operator.itemgetter(1), reverse=true)
print "frequent words to remove: ", sortedFreq[:10]
return sortedFreq[:10]
def getTopWords(ny,sf):
import operator
vocabList = createVocabList(dataSet)
p0V,p1V = trainNB0(trainMatrix, trainCategory)
topNY = []; topSF = []
for i in range(len(p0V)):
if p0V[i] > -6.0: topSF.append((vocabList[i],p0V[i]))
if p1V[i] > -6.0: topNY.append((vocabList[i],p1V[i]))
sortedSF = sorted(topSF, key = lambda pair:pair[1], reverse = True)
sortedNY = sorted(topNY, key = lambda pair:pair[1], reverse = True)
print "print sorted SF"
for item in sortedSF:
print item[0]
print "print sorted NY"
for item in sortedNY:
print item[0]
def isStoppingWord(inputWord):
with open(path,'r') as fd:
stopingList = fd.readlines()
if inputWord in stopingList:
return 1
else: return 0