核心思想:选择具有最高概率的决策。如
p
1
p_1
p1 代表点
(
x
,
y
)
(x, y)
(x,y) 属于类别 1 的概率,
p
2
p_2
p2 代表属于类别 2 的概率,若
p
1
>
p
2
p_1>p_2
p1>p2 ,那么推测该点为类别 1,反之为类别 2
朴素:特征之间相互独立,或者每个特征同等重要
2. 条件概率
在 B 发生的情况下,A 发生的概率:
p
(
A
∣
B
)
=
p
(
A
B
)
p
(
B
)
p(A|B) = \frac{p(AB)}{p(B)}
p(A∣B)=p(B)p(AB)
贝叶斯准则:
P
(
A
∣
B
)
=
p
(
B
∣
A
)
p
(
A
)
p
(
B
)
P(A|B) = \frac{p(B|A)p(A)}{p(B)}
P(A∣B)=p(B)p(B∣A)p(A)
3. 使用条件概率来分类
对于向量
w
\bf{w}
w,该向量属于
c
i
c_i
ci 的概率:
p
(
c
i
∣
w
)
=
p
(
w
∣
c
i
)
p
(
c
i
)
p
(
w
)
p(c_i | {\bf{w}}) = \frac{p( {\bf{w}} | c_i)p(c_i)}{p({\bf{w}})}
p(ci∣w)=p(w)p(w∣ci)p(ci)
如果
p
(
c
1
∣
w
)
>
p
(
c
2
∣
w
)
p(c_1|{\bf{w}}) > p(c_2|{\bf{w}})
p(c1∣w)>p(c2∣w),那么属于类别
c
1
c_1
c1,如果
p
(
c
1
∣
w
)
<
p
(
c
2
∣
w
)
p(c_1|{\bf{w}}) < p(c_2|{\bf{w}})
p(c1∣w)<p(c2∣w),那么属于类别
c
2
c_2
c2
对于朴素贝叶斯,假设各特征之间相互独立,则
p
(
w
∣
c
i
)
=
p
(
w
1
∣
c
i
)
p
(
w
2
∣
c
i
)
.
.
.
p
(
w
n
∣
c
i
)
p({\bf{w}}|c_i) = p(w_1|c_i)p(w_2|c_i)...p(w_n|c_i)
p(w∣ci)=p(w1∣ci)p(w2∣ci)...p(wn∣ci)
4. 使用 Python 进行文本分类
以在线社区的留言板为例,分为侮辱类和非侮辱类,分别使用 1 和 0 表示
准备数据:从文本中构建词向量
'''创建实验样本,返回的第一个变量是进行词条切分后的文档集合,第二个变量是类别标签'''defloadDataSet():
postingList=[['my','dog','has','flea','problems','help','please'],['maybe','not','take','him','to','dog','park','stupid'],['my','dalmation','is','so','cute','I','love','him'],['stop','posting','stupid','worthless','garbage'],['mr','licks','ate','my','steak','how','to','stop','him'],['quit','buying','worthless','dog','food','stupid']]
classVec =[0,1,0,1,0,1]#1 is abusive, 0 notreturn postingList,classVec
'''创建一个包含在所有文档中出现的不重复的词列表'''defcreateVocabList(dataSet):
vocabSet =set([])for document in dataSet:
vocabSet = vocabSet |set(document)# 求集合的并集returnlist(vocabSet)'''输入为词汇表和文档,输出文档向量,表示词汇表的单词在文档中是否出现'''defsetOfWords2Vec(vocabList, inputSet):#
returnVec =[0]*len(vocabList)# 初始化输出for word in inputSet:# 遍历文档if word in vocabList:# 判断词汇是否在词汇表中,是则将输出对应值设为1
returnVec[vocabList.index(word)]=1else:print('the word: %s is not in my Vocabulary!'% word)return returnVec
[IN]: listOPosts, listClasses = loadDataSet()[IN]: myVocabList = createVocabList(listOPosts)[IN]:print(myVocabList)[OUT]:['garbage','not','steak','is','dog','how','my','food','to','licks','mr','buying','so','problems','park','stop','ate','help','stupid','love','flea','worthless','take','posting','has','cute','dalmation','quit','please','him','maybe','I'][IN]:print(setOfWords2Vec(myVocabList, listOPosts[0]))[OUT]:[0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0][IN]:print(setOfWords2Vec(myVocabList, listOPosts[3]))[OUT]:[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0]
defbagOfWords2VecMN(vocabList, inputSet):
returnVec =[0]*len(vocabList)for word in inputSet:if word in vocabList:
returnVec[vocabList.index(word)]+=1return returnVec
5. 练习:使用朴素贝叶斯过滤垃圾邮件
解析文本,提取单词:
deftextParse(bigString):import re
listOfTokens = re.split(r'\W*', bigString)# 分隔单词,并且过滤return[tok.lower()for tok in listOfTokens iflen(tok)>2]# 返回长度大于 2 的单词,并且小写
使用朴素贝叶斯进行交叉验证:
defspamTest():import numpy as np
import random
docList =[]
classList =[]
fullText =[]for i inrange(1,26):
wordList = textParse(open('Ch04/email/spam/%d.txt'% i, encoding='ISO-8859-1').read())# 导入并解析文本文件
docList.append(wordList)
fullText.extend(wordList)
classList.append(1)
wordList = textParse(open('Ch04/email/ham/%d.txt'% i, encoding='ISO-8859-1').read())
docList.append(wordList)
fullText.extend(wordList)
classList.append(0)
vocabList = createVocabList(docList)# 得到所有不重复单词词表
trainingSet =list(range(50))
testSet =[]for i inrange(10):# 随机选取 10 个文件(得到的是索引值)
randIndex =int(random.uniform(0,len(trainingSet)))
testSet.append(trainingSet[randIndex])del(trainingSet[randIndex])
trainMat =[]# 训练集矩阵
trainClasses =[]# 训练集标签for docIndex in trainingSet:
trainMat.append(setOfWords2Vec(vocabList, docList[docIndex]))
trainClasses.append(classList[docIndex])
p0V, p1V, pSpam = trainNB0(np.array(trainMat), np.array(trainClasses))# 计算三项概率值
errorCount =0# 错误数初始化for docIndex in testSet:
wordVector = setOfWords2Vec(vocabList, docList[docIndex])if classifyNB(np.array(wordVector), p0V, p1V, pSpam)!= classList[docIndex]:# 判断是否分类正确
errorCount +=1print('the error rate is: ',float(errorCount)/len(testSet))[IN]:for i inrange(10):
spamTest()[OUT]: the error rate is:0.1
the error rate is:0.0
the error rate is:0.1
the error rate is:0.0
the error rate is:0.0
the error rate is:0.1
the error rate is:0.1
the error rate is:0.1
the error rate is:0.1
the error rate is:0.2