函数伪代码:
计算每个类别中的文档数目
对每篇训练文档:
对每个类别:
如果词条出现在文档中------增加该词条的计数值
增加所有词条的计数值
对每个类别:
对每个词条:
将该词条的数目除以总词条数目得到条件概率
返回每个类别的条件概率
具体代码:
#coding:-utf-8
from numpy import *
def loadDataSet():
postingList = [['my','dog','has','flea',\
'problems','help','please'],
['maybe','not','take','him',\
'to','dog','park','stupid'],
['my','dalmation','is','so','cute',\
'I','love','him'],
['stop','posting','stupid','worthless','garbage'],
['mr','licks','ate','my','steak','how',\
'to','stop','him'],
['quit','buying','worthless','dog','food','stupid']]
classVec = [0,1,0,1,0,1]
return postingList,classVec
def createVocaList(dataSet):
vocabSet = set([])
for document in dataSet:
vocabSet = vocabSet|set(document)
return list(vocabSet)
def setOfWordds2Vec(vocabList,inputSet):
returnVec = [0]*len(vocabList)
#print inputSet
for word in inputSet:
if word in vocabList:
#print word
returnVec[vocabList.index(word)] = 1
else:
print "The word:%s is not in my Vocabulary!" % word
return returnVec
def trainNB0(trainMatrix,trainCategory):
#矩阵正一共有6行数据
numTrainDocs = len(trainMatrix)
#print numTrainDocs:6
#每行一共有32个元素
numWords = len(trainMatrix[0])
#print numWords:32
#侮辱性留言中文档数在总文档数中所占百分比
pAbusive = sum(trainCategory)/float(numTrainDocs)
#print pAbusive:0.5
#创建一共32个元素的一维数组
p0Num = zeros(numWords)
p1Num = zeros(numWords)
p0Denom = 0.0;p1Denom = 0.0
for i in range(numTrainDocs):
#print trainCategory[i]
#print sum(trainMatrix[i])
#print trainMatrix[i]
if trainCategory[i] == 1:
#对类别1(侮辱性),每个词向量文档累加
p1Num += trainMatrix[i]
#每个词向量文档中所有词相加,即一共有多少个侮辱性的词
p1Denom += sum(trainMatrix[i])
else:
# 对类别0(正常词),每个词向量文档累加
p0Num += trainMatrix[i]
# 每个词向量文档中所有词相加,即一共有多少个正常词
p0Denom += sum(trainMatrix[i])
p1Vect = p1Num/p1Denom
p0Vect = p0Num/p0Denom
#返回的是给定文档类别条件下词汇表中单词的出现概率
return p0Vect,p1Vect,pAbusive
listOPost,listClasses = loadDataSet()
myVocaList = createVocaList(listOPost)
#print myVocaList
returnVec = setOfWordds2Vec(myVocaList,listOPost[0])
#print returnVec
trainMat = []
for postinDoc in listOPost:
trainMat.append(setOfWordds2Vec(myVocaList,postinDoc))
p0Vect,p1Vect,pAbusive = trainNB0(trainMat,listClasses)
print p0Vect
print p1Vect
看结果,给定文档类别条件下词汇表中单词的出现概率。
p0Vect:
[ 0.04166667 0.04166667 0.04166667 0. 0. 0.04166667
0.04166667 0.04166667 0. 0.04166667 0.04166667 0.04166667
0.04166667 0. 0. 0.08333333 0. 0.
0.04166667 0. 0.04166667 0.04166667 0. 0.04166667
0.04166667 0.04166667 0. 0.04166667 0. 0.04166667
0.04166667 0.125 ]
p1Vect:
[ 0. 0. 0. 0.05263158 0.05263158 0. 0.
0. 0.05263158 0.05263158 0. 0. 0.
0.05263158 0.05263158 0.05263158 0.05263158 0.05263158 0.
0.10526316 0. 0.05263158 0.05263158 0. 0.10526316
0. 0.15789474 0. 0.05263158 0. 0. 0. ]
从结果中,可以看出词汇表中第一个词是cute,其在类别0中出现一次,而在类别1中未出现,对应的条件概率为别为0.04166667和0。