文章目录
本章内容
- 使用概率论分布进行分类
- 学习朴素贝叶斯分类器
- 解析RSS源数据
- 使用朴素贝叶斯来分析不同地区的态度
一、基于贝叶斯决策理论的分类方法
- 优点:数据较少的情况相爱仍然有效,可以处理多类情况
- 缺点:对于输入数据的准备方式较为敏感
- 适用数据类型:标称型数据
二、条件概率
- P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ) {P(A|B)=\frac{P(B|A)P(A)}{P(B)}} P(A∣B)=P(B)P(B∣A)P(A)
三、适用条件概率来分类
- 用
p
(
c
1
∣
x
,
y
)
和
p
(
c
2
∣
x
,
y
)
{p(c_1|x,y)}和p(c_2|x,y)
p(c1∣x,y)和p(c2∣x,y)表示给定某个由
x,y
表示的数据点,那么该数据点来自类别 c 1 {c_1} c1的概率和来自类别 c 2 {c_2} c2的概率,应用贝叶斯公式: p ( c i ∣ x , y ) = p ( x , y ∣ c i ) p ( c i ) p ( x , y ) {p(c_i|x,y)=\frac{p(x,y|c_i)p(c_i)}{p(x,y)}} p(ci∣x,y)=p(x,y)p(x,y∣ci)p(ci)来进行分类- 如果 p ( c 1 ∣ x , y ) > p ( c 2 ∣ x , y ) {p(c_1|x,y)>p(c_2|x,y)} p(c1∣x,y)>p(c2∣x,y),那么属于类别 c 1 {c_1} c1;否则属于类别 c 2 {c_2} c2
- 适用贝叶斯准则,可以通过已知的三个概率来计算未知的概率值
四、使用朴素贝叶斯进行文档分类
- 朴素贝叶斯的假设:
- 独立同分布
- 每个特征同等重要
五、使用Python进行文本分类
- 要从文本中获取特征,需要先拆分文本
准备数据:从文本中构建词向量
- 本文本看成单词向量或者词条向量,也就是说将句子转换为向量。
- 考虑出现在所有文档中的所有单词,再决定将哪些词纳入词汇表或者说所要的词汇集合,然后必须将每一篇文档转换为词汇表上的向量
def loadDataSet():
'''
创建实验样本,该函数返回到的第一个变量是进行词条切分后的文档集合,第二个变量是一个类别标签的集合
此处有两类:侮辱类和非侮辱类
:return:
'''
postingList=[
['my','dog','has','flea','problems','help','please'],
['maybe','not','take','him','to','dog','park','stupid'],
['my','dalmation','is','so','cute','I','love','him'],
['stop','posting','stupid','worthless','garbage'],
['mr','licks','ate','my','steak','how','to','stop','him'],
['quit','buying','worthless','dog','food','stupid']
]
classVec=[0,1,0,1,0,1] # 0代表正常言论,1代表侮辱性文字
return postingList,classVec
def createVocabList(dataSet):
'''
闯进啊一个包含所有文档中出现的不重复词的列表
:param dataSet:
:return:
'''
vocabSet=set([])
for document in dataSet:
vocabSet=vocabSet|set(document)
return list(vocabSet)
def setOfWord2Vec(vocabList,inputSet):
'''
该函数的输入参数为词汇表以及某个文档,输出的是文档向量,向量的每一个元素为1或0,分别表示该词汇表中的单词在输入文档中是否出现
该函数使用词汇表或者想要检查的所有第N次作为输入,然后为其中每一个单词构建一个特征,一旦戈丁一篇文档,该文档就会被转换为词向量
:param vocabList:
:param inputSet:
:return:
'''
returnVec=[0]*len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)] = 1
else:print('the word:{} is not in my Vocabulary!'.format(word))
return returnVec
if __name__ == '__main__':
listOposts,listClasses=loadDataSet()
myVocabList=createVocabList(listOposts)
print(len(myVocabList),myVocabList)
print(setOfWord2Vec(myVocabList,listOposts[0]))
训练算法:从词向量计算概率
- 前面介绍了如何将一组单词转换为一组数字,下面需要使用这一组数字计算概率
- 数值个数与词汇表中的词个数相同,使用的贝叶斯计算公式如下
p
(
c
i
∣
w
)
=
p
(
w
∣
c
i
)
p
(
c
i
)
p
(
w
)
{p(c_i|w)=\frac{p(w|c_i)p(c_i)}{p(w)}}
p(ci∣w)=p(w)p(w∣ci)p(ci),w
表示一个向量,由多个数值组成,使用上述公式,对每个类计算该值,然后比较这两个概率值的代下。
- 首先通过类别
i
(侮辱性或费侮辱性)中文档数除以总的文档数计算概率 p ( c i ) {p(c_i)} p(ci)。接下来计算 p ( w ∣ c 1 ) {p(w|c_1)} p(w∣c1),这里就要用到朴素贝叶斯假设。如果将w
展开为一个个独立特征,那么可以将上述概率写作 p ( w 0 , w 1 , w 2 . . . w N ∣ c i ) {p(w_0,w_1,w_2...w_N|c_i)} p(w0,w1,w2...wN∣ci)。这里假设所有词都相互独立,该假设也称作条件独立性假设,意味着可以用 p ( w 0 ∣ c i ) p ( w 1 ∣ c i ) p ( w 2 ∣ c i ) . . . p ( w n ∣ c i ) {p(w_0|c_i)p(w_1|c_i)p(w_2|c_i)...p(w_n|c_i)} p(w0∣ci)p(w1∣ci)p(w2∣ci)...p(wn∣ci)来计算上述概率 - 该计算函数的伪代码如下:
计算每个类别中的文档书目
对每篇训练文档:
对每个类别:
如果词条出现文档中->增加该词条的计数值
增加所有词条的计数值
对每个类别:
对每个词条:
将该词条的数目除以总词条数目得到条件概率
返回每个类别的条件概率
- 朴素贝叶斯分类器训练函数
# 朴素贝叶斯分类器训练函数
def trainNB0(trainMatrix,trainCategory):
'''
:param trainMatrix: docs matrix
:param trainCategory: the category label
:return:
'''
numTrainDocs=len(trainMatrix)
numWords=len(trainMatrix[0])
pAbusive=np.sum(trainCategory)/float(numTrainDocs)
p0Num=np.zeros(numWords);
p1Num=np.zeros(numWords)
p0Denom=0.0 # 初始化
p1Denom=0.0
for i in range(numTrainDocs):
if trainCategory[i]==1:
p1Num+=trainMatrix[i] #向量相加
p1Denom+=np.sum(trainMatrix[i])
else:
p0Num+=trainMatrix[i]
p0Denom+=np.sum(trainMatrix[i])
p1Vect=p1Num/p1Denom # 对每个元素做除法
p0Vect=p0Num/p0Denom
return p0Vect,p1Vect,pAbusive
def test1():
listOposts, listClasses = loadDataSet()
print(listOposts)
print(listClasses)
myVocabList = createVocabList(listOposts)
print(myVocabList)
trainMat = []
for postinDoc in listOposts: # 使用词向量填充trainMat列表
trainMat.append(setOfWord2Vec(myVocabList, postinDoc))
print(trainMat)
p0V, p1V, pAb = trainNB0(trainMat, listClasses) # 获取输入侮辱性文档的概率以及两个类别的概率向量
print(p0V, p1V, pAb)
测试算法:根据现实情况修改分类器
- 利用贝叶斯分类器对文档进行分类时,要计算多个概率的乘积以获得文档属于某个类别的概率。即计算 p ( w 0 ∣ 1 ) p ( w 1 ∣ 1 ) p ( w 2 ∣ 1 ) {p(w_0|1)p(w_1|1)p(w_2|1)} p(w0∣1)p(w1∣1)p(w2∣1),如果其中一个概率值为0,那么最后的乘积也为0,为降低这种影响,可以将所有词的出现数初始化为1,将分母初始化为2
def trainNB0(trainMatrix,trainCategory):
'''
:param trainMatrix: docs matrix
:param trainCategory: the category label
:return:
'''
numTrainDocs=len(trainMatrix)
numWords=len(trainMatrix[0])
pAbusive=np.sum(trainCategory)/float(numTrainDocs)
p0Num=np.ones(numWords);
p1Num=np.ones(numWords)
p0Denom=2.0 # 初始化
p1Denom=2.0
for i in range(numTrainDocs):
if trainCategory[i]==1:
p1Num+=trainMatrix[i] #向量相加
p1Denom+=np.sum(trainMatrix[i])
else:
p0Num+=trainMatrix[i]
p0Denom+=np.sum(trainMatrix[i])
p1Vect=p1Num/p1Denom # 对每个元素做除法
p0Vect=p0Num/p0Denom
return p0Vect,p1Vect,pAbusive
- 另一个遇到的问题是下溢出。这是由于太多很小的数相乘,当计算 p ( w 0 ∣ c i ) p ( w 1 ∣ c i ) p ( w 2 ∣ c i ) . . . p ( w N ∣ c i ) {p(w_0|c_i)p(w_1|c_i)p(w_2|c_i)...p(w_N|c_i)} p(w0∣ci)p(w1∣ci)p(w2∣ci)...p(wN∣ci)时,由于大部分因子非常小,所以程序会下溢出,可以通过取对数避免下溢。
p1Vect=log(p1Num/p1Denom) # 对每个元素做除法
p0Vect=log(p0Num/p0Denom)
def trainNB0(trainMatrix,trainCategory):
'''
:param trainMatrix: docs matrix
:param trainCategory: the category label
:return:
'''
numTrainDocs=len(trainMatrix)
numWords=len(trainMatrix[0])
pAbusive=np.sum(trainCategory)/float(numTrainDocs)
p0Num=np.ones(numWords);
p1Num=np.ones(numWords)
p0Denom=2.0 # 初始化
p1Denom=2.0
for i in range(numTrainDocs):
if trainCategory[i]==1:
p1Num+=trainMatrix[i] #向量相加
p1Denom+=np.sum(trainMatrix[i])
else:
p0Num+=trainMatrix[i]
p0Denom+=np.sum(trainMatrix[i])
p1Vect=np.log(p1Num/p1Denom) # 对每个元素做除法
p0Vect=np.log(p0Num/p0Denom)
return p0Vect,p1Vect,pAbusive
- 朴素贝叶斯分类函数
def classifyNB(vec2Classify,p0Vec,p1Vec,pClass1):
p1=np.sum(vec2Classify*p1Vec)+np.log(pClass1)
p0=np.sum(vec2Classify*p0Vec)+np.log(1.0-pClass1) #元素相乘
if(p1>p0): return 1
else:return 0
def testingNB():
listOposts,listClasses=loadDataSet()
myVocabList=createVocabList(listOposts)
trainMat=[]
for postinDoc in listOposts:
trainMat.append(setOfWord2Vec(myVocabList,postinDoc))
p0V,p1V,pAb=trainNB0(np.array(trainMat),np.array(listClasses))
testEntry=['love','my','dalmation']
thisDoc=np.array(setOfWord2Vec(myVocabList,testEntry))
print(testEntry,' classified as : ',classifyNB(thisDoc,p0V,p1V,pAb))
testEntry=['stupid','garbage']
thisDoc=np.array(setOfWord2Vec(myVocabList,testEntry))
print(testEntry," classified as : ",classifyNB(thisDoc,p0V,p1V,pAb))
准备数据:文档词袋模型
- 到目前为止,是将每个词的出现与否作为一个特征,这可以被描述为词集模型。如果一个词在文档中出现不止一次,这意味着包含该词是否出现在文档中所不能表达的某种信息,这种方法被称为词袋模型。
- 为了适应词袋模型,需要对函数
setOfWord2Vec()
稍加修改,修改后的函数为bagOfWord2Vec()
def bagOfWord2Vec(vocabList,inputSet):
returnVec=[0]*len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)]+=1
return returnVec
六、示例:使用朴素贝叶斯过滤垃圾邮件
- 使用朴素贝叶斯解决一些现实生活中的问题时,需要先从文本内容得到字符串列表,然后生成词向量
- 在上面已经介绍了如何创建词向量,并基于这些词向量进行朴素贝叶斯分类的过程,下面将中本文文档中构建词列表
import re
def loadDataFile():
emailText=open("email/ham/6.txt").read()
regEx=re.compile('\\W+')
mySet="This book is the best book on Python or M.L I have ever laid"
listOfTokens=regEx.split(emailText)
# listOfTokens=regEx.split(mySet)
print(listOfTokens)
测试算法:使用朴素贝叶斯进行交叉验证
def textParse(bigString):
import re
listOfTokens=re.split(r'\W+',bigString)
return [tok.lower() for tok in listOfTokens if len(tok)>2]
def spamTest():
docList=[];classList=[];fullText=[]
for i in range(1,26):
wordList=textParse(open('email/spam/{}.txt'.format(i)).read())
docList.append(wordList)
fullText.extend(wordList)
classList.append(1)
wordList=textParse(open("email/ham/{}.txt".format(i)).read())
docList.append(wordList)
fullText.extend(wordList)
classList.append(0)
vocabList=createVocabList(docList)
trainingSet=list(range(50));testSet=[]
for i in range(10):
randIndex=int(np.random.uniform(0,len(trainingSet)))
testSet.append(trainingSet[randIndex])
del(trainingSet[randIndex])
trainMat=[];trainClasses=[]
for docIndex in trainingSet:
# docIndex=int(docIndex)
# print(docIndex,type(docIndex))
trainMat.append(setOfWord2Vec(vocabList,docList[docIndex]))
trainClasses.append(classList[docIndex])
p0V,p1V,pSpam=trainNB0(np.array(trainMat),np.array(trainClasses))
errorCount=0
for docIndex in testSet:
wordVector=setOfWord2Vec(vocabList,docList[docIndex])
if classifyNB(wordVector,p0V,p1V,pSpam)!=classList[docIndex]:
errorCount+=1
print("error classifier:",docList[docIndex])
print("The error rate is :{}".format(float(errorCount/len(testSet))))
- 上面的会随机选择邮件,因此输出错误率会变
附录:代码
import numpy as np
def loadDataSet():
'''
创建实验样本,该函数返回到的第一个变量是进行词条切分后的文档集合,第二个变量是一个类别标签的集合
此处有两类:侮辱类和非侮辱类
:return:
'''
postingList=[
['my','dog','has','flea','problems','help','please'],
['maybe','not','take','him','to','dog','park','stupid'],
['my','dalmation','is','so','cute','I','love','him'],
['stop','posting','stupid','worthless','garbage'],
['mr','licks','ate','my','steak','how','to','stop','him'],
['quit','buying','worthless','dog','food','stupid']
]
classVec=[0,1,0,1,0,1] # 0代表正常言论,1代表侮辱性文字
return postingList,classVec
def createVocabList(dataSet):
'''
闯进啊一个包含所有文档中出现的不重复词的列表
:param dataSet:
:return:
'''
vocabSet=set([])
for document in dataSet:
vocabSet=vocabSet|set(document)
return list(vocabSet)
def setOfWord2Vec(vocabList,inputSet):
'''
该函数的输入参数为词汇表以及某个文档,输出的是文档向量,向量的每一个元素为1或0,分别表示该词汇表中的单词在输入文档中是否出现
该函数使用词汇表或者想要检查的所有第N次作为输入,然后为其中每一个单词构建一个特征,一旦戈丁一篇文档,该文档就会被转换为词向量
:param vocabList:
:param inputSet:
:return:
'''
returnVec=[0]*len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)] = 1
else:print('the word:{} is not in my Vocabulary!'.format(word))
return returnVec
# 朴素贝叶斯分类器训练函数
def trainNB0(trainMatrix,trainCategory):
'''
:param trainMatrix: docs matrix
:param trainCategory: the category label
:return:
'''
numTrainDocs=len(trainMatrix)
numWords=len(trainMatrix[0])
pAbusive=np.sum(trainCategory)/float(numTrainDocs)
p0Num=np.ones(numWords);
p1Num=np.ones(numWords)
p0Denom=2.0 # 初始化
p1Denom=2.0
for i in range(numTrainDocs):
if trainCategory[i]==1:
p1Num+=trainMatrix[i] #向量相加
p1Denom+=np.sum(trainMatrix[i])
else:
p0Num+=trainMatrix[i]
p0Denom+=np.sum(trainMatrix[i])
p1Vect=np.log(p1Num/p1Denom) # 对每个元素做除法
p0Vect=np.log(p0Num/p0Denom)
return p0Vect,p1Vect,pAbusive
def test1():
listOposts, listClasses = loadDataSet()
print(listOposts)
print(listClasses)
myVocabList = createVocabList(listOposts)
print(myVocabList)
trainMat = []
for postinDoc in listOposts: # 使用词向量填充trainMat列表
trainMat.append(setOfWord2Vec(myVocabList, postinDoc))
print(trainMat)
p0V, p1V, pAb = trainNB0(trainMat, listClasses) # 获取输入侮辱性文档的概率以及两个类别的概率向量
print(p0V, p1V, pAb)
def classifyNB(vec2Classify,p0Vec,p1Vec,pClass1):
p1=np.sum(vec2Classify*p1Vec)+np.log(pClass1)
p0=np.sum(vec2Classify*p0Vec)+np.log(1.0-pClass1) #元素相乘
if(p1>p0): return 1
else:return 0
def testingNB():
listOposts,listClasses=loadDataSet()
myVocabList=createVocabList(listOposts)
trainMat=[]
for postinDoc in listOposts:
trainMat.append(setOfWord2Vec(myVocabList,postinDoc))
p0V,p1V,pAb=trainNB0(np.array(trainMat),np.array(listClasses))
testEntry=['love','my','dalmation']
thisDoc=np.array(setOfWord2Vec(myVocabList,testEntry))
print(testEntry,' classified as : ',classifyNB(thisDoc,p0V,p1V,pAb))
testEntry=['stupid','garbage']
thisDoc=np.array(setOfWord2Vec(myVocabList,testEntry))
print(testEntry," classified as : ",classifyNB(thisDoc,p0V,p1V,pAb))
def bagOfWord2Vec(vocabList,inputSet):
returnVec=[0]*len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)]+=1
return returnVec
# 下面将展示实际的邮件分类
import re
def loadDataFile():
emailText=open("email/ham/6.txt").read()
regEx=re.compile('\\W+')
mySet="This book is the best book on Python or M.L I have ever laid"
listOfTokens=regEx.split(emailText)
# listOfTokens=regEx.split(mySet)
print(listOfTokens)
def textParse(bigString):
import re
listOfTokens=re.split(r'\W+',bigString)
return [tok.lower() for tok in listOfTokens if len(tok)>2]
def spamTest():
docList=[];classList=[];fullText=[]
for i in range(1,26):
wordList=textParse(open('email/spam/{}.txt'.format(i)).read())
docList.append(wordList)
fullText.extend(wordList)
classList.append(1)
wordList=textParse(open("email/ham/{}.txt".format(i)).read())
docList.append(wordList)
fullText.extend(wordList)
classList.append(0)
vocabList=createVocabList(docList)
trainingSet=list(range(50));testSet=[]
for i in range(10):
randIndex=int(np.random.uniform(0,len(trainingSet)))
testSet.append(trainingSet[randIndex])
del(trainingSet[randIndex])
trainMat=[];trainClasses=[]
for docIndex in trainingSet:
# docIndex=int(docIndex)
# print(docIndex,type(docIndex))
trainMat.append(setOfWord2Vec(vocabList,docList[docIndex]))
trainClasses.append(classList[docIndex])
p0V,p1V,pSpam=trainNB0(np.array(trainMat),np.array(trainClasses))
errorCount=0
for docIndex in testSet:
wordVector=setOfWord2Vec(vocabList,docList[docIndex])
if classifyNB(wordVector,p0V,p1V,pSpam)!=classList[docIndex]:
errorCount+=1
print("error classifier:",docList[docIndex])
print("The error rate is :{}".format(float(errorCount/len(testSet))))
if __name__ == '__main__':
# 测试编码
# listOposts,listClasses=loadDataSet()
# myVocabList=createVocabList(listOposts)
# print(len(myVocabList),myVocabList)
# print(setOfWord2Vec(myVocabList,listOposts[0]))
# 测试
# test1()
# 测试朴素贝叶斯分类
# testingNB()
# 测试垃圾邮件分类
# loadDataFile()
spamTest()