4.1 基于贝叶斯决策理论的分类方法
朴素贝叶斯大致思想:分类器有时会产生错误结果,此时可以要求分类器给出一个最优的类别猜想结果,同时给出这个猜测的概率估计值。
算法优缺点
优点:数据较少时依然有效
缺点:对于输入数据的准备方式较敏感
使用数据类型:标称型数据和数值型数据
标称型:一般在有限的数据中取,而且只存在‘是’和‘否’两种不同的结果(一般用于分类)
数值型:可以在无限的数据中取,而且数值比较具体化,例如4.02,6.23这种值(一般用于回归分析)
贝叶斯理论理解
若有两类数据,由两个特征来判断类别x、y,数据被分为这两类数据的概率分别表示为:p1(x,y)和p2(x,y),因此对于一个新的数据点,特征值为x‘和y’,按照贝叶斯理论,分类规则为:
p1(x’,y’) > p2(x’,y’),因此分到第一类的概率大,则分到第一类。
p1(x’,y’) < p2(x’,y’),因此分到第二类的概率大,则分到第二类。
即选择高概率的类别进行分类。
若计算概率效率很高,特征数较多时,使用KNN进行分类,则计算效率太低;使用决策树,则效果可能不太好,然而贝叶斯是不错的选择。
4.2 条件概率
条件概率
举例:共有七块石头,三块为灰色,四块为黑色。
取得灰色石头的概率为:3/7
取得黑色石头的概率为:4/7
若把这七块石头分别让如两个桶,假设:现在A桶中有4块石头,2灰色2黑色,B桶中有3块石头,1灰色2黑色。
此时需要事先知道石头所在的桶的信息是否会改变概率计算的结果呢?
这就需要计算条件概率,P(A|B)即A在B的条件下发生的概率。
我们直接计算B桶中取到灰色石头的概率:B桶中一共有3块石头,1块为灰色,即1/3。
我们根据条件概率的公式:
P(gray|BucketB)即表示在B桶中,取出灰色石头的概率,即石头在B桶的条件下,从B桶中取出灰色石头的概率。
P(gray|BucketB) = P(gray and BucketB)/P(BucketB)即等于从B桶取出灰色石头的概率/从B桶中取出石头的概率。
P(gray|BucketB) = 1/7(7块石头只有1块灰色的在B桶)/ 3/7(7块石头有3块在B桶中) = 1/3
贝叶斯准则公式:
P(c|x) = (P(x|c) * P©) / P(x)
4.3 使用条件概率来分类
之前在4.1中提到的P(x,y),根据贝叶斯理论,正确的表达方式应为P(c|x,y),即在x,y表示的数据点下,分为c类的概率为多少。
因此根据4.2中提到的贝叶斯准则公式我们可以得到:
p(ci|x,y) = (p(x,y|ci)* p(ci))/p(x,y)
根据贝叶斯按照高概率分类的准则,我们可得:
p(c1|x,y) > p(c2|x,y),则分到c1类。
p(c1|x,y) < p(c2|x,y),则分到c2类。
4.4 使用朴素贝叶斯进行分档分类
背景介绍:朴素贝叶斯经常用来进行分档分类。例如分类垃圾邮件,按照邮件中的单词,是否为侮辱性的,并把每个词是否出现或者不出现作为特征,得到的特征数目就会跟词汇表中的词目一样多。
朴素贝叶斯的过程
1-收集数据
2-准备数据:数值型或布尔型。
3-分析数据:有大量特征时,绘制特征的作用不大,此时使用直方图效果更好。
4-训练算法:计算不同的独立特征的条件概率。
5-测试算法:计算错误率
6-使用算法:通常使用朴素贝叶斯进行文档分类,也可以在任意的分类场景中使用朴素贝叶斯分类器,不一定非要是文本。
朴素贝叶斯的2个假设
第一个假设:特征之间相互独立
要得到更好地概率分布,需要足够的数据样本。
若每个特征需要N个样本,列表中有1000个单词,则共需要N的1000次方个样本。
若特征之间相互独立,统计意义上的独立,即一个特征或者单词出现的可能性与它和其他相邻单词没有关系,则可以说是独立。但通常来说,这种独立假设在实际应用中是不对的,因此称为”朴素“贝叶斯
第二个假设:所有特征同等重要
这个假设也有问题,例如:判断是否为垃圾邮件,只需要看一部分词是否为侮辱性的即可,而不需要看所有词,因此不是所有特征都是同等重要。
尽管上述两个假设都有问题,但是朴素贝叶斯在实际应用中效果是不错的。
4.5 使用python进行文本分类
背景介绍:把文档进行拆分,拆分成词条(token),可以理解为单词,字符,等等。每个文本片段表示为一个词条向量,其中值为1表示该词条出现在文档中,0表示未出现。
4.5.1 准备数据:从文本中构建词向量
loadDataSet()返回包含单词的列表和一个向量,向量为类别标签集合,0代表非侮辱性文字,1代表侮辱性文字,列表中文本由人工标注,用于训练程序一遍自动检测侮辱性语言。
def loadDataSet():
postingList = [['my','dog','has','flea','problems','help','please'],
['maybe','not','take','him','to','dog','park','stupid'],
['my','dalmation','is','so','cute','I','love','him'],
['stop','posting','stupid','worthless','garbage'],
['mr','licks','ate','my','steak','how','to','stop','him'],
['quit','buying','worthless','dog','food','stupid']]
classVec = [0,1,0,1,0,1]
return postingList, classVec
createVocabList(dataSet) dataSet为所有文档,该函数返回文档中所有的词条,且不含重复项。
def createVocabList(dataSet):
vocabSet = set([])
for document in dataSet:
vocabSet = vocabSet|set(document) #|用于合并两个集合,集合内不含重复项
return vocabSet
setOfWords2Vec(vocabList,inputSet) 返回一个文档向量,参照字典,文档若出现字典中的的词条,则在向量中对应的位置设为1,否则为0
vocabList:词条库
inputSet:新输入的文本
def setOfWords2Vec(vocabList,inputSet):
returnVec = [0] * len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[list(vocabList).index(word)] = 1
else:
print("the word : %s is not in my Vocabulary!" %word)
return returnVec
运行函数
listOPosts, listClasses = loadDataSet()
listOPosts
[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
['stop', 'posting', 'stupid', 'worthless', 'garbage'],
['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
listClasses
[0, 1, 0, 1, 0, 1]
myVocabList = createVocabList(listOPosts)
returnVec1 = setOfWords2Vec(myVocabList,listOPosts[0])
returnVec2 = setOfWords2Vec(myVocabList,listOPosts[1])
4.5.2 训练算法:从词向量计算概率
根据4.2中提到贝叶斯准则公式:
P(c|x,y) = (P(x,y|c) * P©) / P(x,y)
我们用向量w开替换数据点x,y,可得
P(ci|wi) = (P(wi|ci) * P(ci)) / p(wi)
P(wi|ci)展开后可得:P(w1,w2,w3,…wi|c1,c2,c3,…ci)
根据贝叶斯的独立假设,我们可以继续简化,可得:P(w0|c1) * P(w1|c1)* …* P(w0|c2)…
函数伪代码
计算每个类别中的文档数目
对每篇训练文档:
对每个类别:
若词条出现在文档中->增加该词条的计数值
增加所有词条的计数值
对每个类别:
对每个词条:
将该词的数目除以总词条的数目得到条件概率
返回每个类别的条件概率
from numpy import *
def trainNB0(trainMatrix,trainCategory):
numTrainDocs = len(trainMatrix) #文档个数
numWords = len(trainMatrix[0]) #每篇文章词条长度
pAbusive = sum(trainCategory)/float(numTrainDocs) #侮辱性文档占总文档的比例
###### 初始化概率 ######
#zeros(x):构建长度为x的array表
p0Num = zeros(numWords) #正常词条数目
p1Num = zeros(numWords) #侮辱性条数目
p0Denom = 0.0 #正常文档的词条总数目
p1Denom = 0.0 #侮辱性文档的词条总数目
######################
for i in range(numTrainDocs):
#若文档的类别为1,则该词条数量在P1Num上更新,该文档所有词条对于所有文档数量更新,即p1Denom更新
if trainCategory[i] == 1:
####### 向量相加 #######
p1Num += trainMatrix[i]
p1Denom += sum(trainMatrix[i])
######################
else:
p0Num += trainMatrix[i]
p0Denom += sum(trainMatrix[i])
p1Vect = p1Num/p1Denom #正常词条数目/正常文档词条总数目
p0Vect = p0Num/p0Denom #侮辱性词条数目/侮辱性文档词条总数目
return p0Vect, p1Vect, pAbusive
测试
loadDataSet()
([['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
['stop', 'posting', 'stupid', 'worthless', 'garbage'],
['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']],
[0, 1, 0, 1, 0, 1])
myVocabList = createVocabList(listOPosts)
trainMat = []
for postinDoc in listOPosts:
trainMat.append(setOfWords2Vec(myVocabList,postinDoc))
p0V, p1V, pAb = trainNB0(trainMat,listClasses)
print("正常词条件概率")
print(p0V)
print()
print("侮辱性词条条件概率")
print(p1V)
print()
print("侮辱性文档占总文档比重")
print(pAb)
正常词条件概率
[0.08333333 0. 0.04166667 0.04166667 0.04166667 0.04166667
0. 0.04166667 0. 0.125 0.04166667 0.04166667
0. 0.04166667 0. 0.04166667 0.04166667 0.04166667
0. 0. 0.04166667 0.04166667 0. 0.04166667
0.04166667 0. 0. 0. 0.04166667 0.04166667
0.04166667 0.04166667]
侮辱性词条条件概率
[0.05263158 0.05263158 0. 0.05263158 0. 0.
0.10526316 0. 0.05263158 0. 0. 0.
0.05263158 0.05263158 0.05263158 0. 0. 0.
0.05263158 0.05263158 0.10526316 0. 0.05263158 0.
0. 0.05263158 0.05263158 0.15789474 0. 0.
0. 0. ]
侮辱性文档占总文档比重
0.5
结果解释:在侮辱性词条条件概率的结果中,我们可以看到第23个概率最高,对应的单词为stupid,因此,侮辱性的文档中,stupid出现时,被分为侮辱性的概率最高。
4.5.3 测试算法:根据现实情况修改分类器
在4.5.2中写的计算概率的函数有两个问题:
1.之后我们需要计算概率的乘积,我们根据各个词条的条件概率可知,其中很多项为0,因此这样会导致乘积为0,计算结果就没有意义。
2.由于条件概率的值都非常小,即便解决了第一个问题,也会有下溢出的问题,导致乘积的结果不正确或得不到结果。
解决方法
1.把所有词的出现数初始化为1,将分母初始化为2
2.使用对数函数,有公式:ln(a*b) = lna + lnb 来防止下溢出。可以做出f(x)和ln(f(x))的曲线,发现他们大致相同,在相同区域内同时增加或减少,并且在相同点上取到极值。虽然取值不同,但不影响最终结果。
def trainNB(trainMatrix,trainCategory):
numTrainDocs = len(trainMatrix) #文档个数
numWords = len(trainMatrix[0]) #每篇文章词条长度
pAbusive = sum(trainCategory)/float(numTrainDocs) #侮辱性文档占总文档的比例
###### 初始化概率 ######
#zeros(x):构建长度为x的array表
p0Num = ones(numWords) #正常词条数目
p1Num = ones(numWords) #侮辱性条数目
p0Denom = 2.0 #正常文档的词条总数目
p1Denom = 2.0 #侮辱性文档的词条总数目
######################
for i in range(numTrainDocs):
#若文档的类别为1,则该词条数量在P1Num上更新,该文档所有词条对于所有文档数量更新,即p1Denom更新
if trainCategory[i] == 1:
####### 向量相加 #######
p1Num += trainMatrix[i]
p1Denom += sum(trainMatrix[i])
######################
else:
p0Num += trainMatrix[i]
p0Denom += sum(trainMatrix[i])
p1Vect = log(p1Num/p1Denom) #正常词条数目/正常文档词条总数目
p0Vect = log(p0Num/p0Denom) #侮辱性词条数目/侮辱性文档词条总数目
return p0Vect, p1Vect, pAbusive
p0V, p1V, pAb = trainNB(trainMat,listClasses)
classifyNB判断输入vec2Classify应该分到侮辱性文档还是正常文档
vec2Classify:要被判断的文档
p0Vec:正常词条概率表
p1Vec:侮辱性词条概率表
pClass1:侮辱性文档占总文档的比重
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
# p(ci|wi) = (p(wi|ci) * p(ci))/p(wi)
# 因为这个例子中两类的wi相同,因此在这里不比较wi,省略分母的部分
# 这里p1、p2都是计算的对数,根据log(a*b) = loga + logb,sum(vec2Classify * p1Vec)这部分本身是用对数来算的,因此不需要再加log
# pclass1为p(ci),当分到侮辱性文档时,则pClass1 = pAbus,否则pClass1 = 1 - pAbus
p1 = sum(vec2Classify * p1Vec) + log(pClass1)
p0 = sum(vec2Classify * p0Vec) + log(1.0-pClass1)
if p1 > p0:
return 1
else:
return 0
testingNB() 便捷函数
File "<ipython-input-19-8203a1d065c4>", line 1
testingNB() 便捷函数
^
SyntaxError: invalid syntax
def testingNB():
listOPosts, listClasses = loadDataSet()
myVocabList = createVocabList(listOPosts)
trainMat = []
for postinDoc in listOPosts:
trainMat.append(setOfWords2Vec(myVocabList,postinDoc))
p0V,p1V,pAb = trainNB(trainMat,listClasses)
testEntry = ['love','my','dalmation']
thisDoc = array(setOfWords2Vec(myVocabList,testEntry))
print("classification of love my dalmation is : " + str(classifyNB(thisDoc,p0V,p1V,pAb)))
testEntry = ['stupid','garbage']
thisDoc = array(setOfWords2Vec(myVocabList,testEntry))
print("classification of stupid garbage is : " + str(classifyNB(thisDoc,p0V,p1V,pAb)))
testingNB()
classification of love my dalmation is : 0
classification of stupid garbage is : 1
4.5.4 准备数据:文档词袋模型
词集模型(set-of-words model):一个词是否出现
词袋模型(bag-of-words model):每个词出现的个数
因此,词集模型中的词只能出现一次,而词袋模型中的词可以出现多次。
def bagOfWordsVecMN(vocabList,inputSet):
returnVec = [0] * len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)] += 1
return returnVec
4.6 示例:使用朴素贝叶斯过滤垃圾邮件
算法思想
1-收集数据:提供文本文件。
2-准备数据:讲文本解析成词条向量。
3-分析数据:检查词条确保解析正确性。
4-训练算法:使用之前建立的trainNB()函数。
5-测试算法:使用classifyNB()函数,并构建测试函数计算错误率。
6-使用算法:构建完整的程序对一组文档进行分类,并输出结果。
数据源地址:
https://github.com/pbharrin/machinelearninginaction/blob/master/Ch04
4.6.1 准备数据:切分文本
切分文本方法:
1-string.split()
2-正则表达式:re.compile()
mySent = "This book is the best book on Python or M.L., I have ever laid eyes upon."
#方法一
mySent.split()
#或者
#mySent.split(' ')
['This',
'book',
'is',
'the',
'best',
'book',
'on',
'Python',
'or',
'M.L.,',
'I',
'have',
'ever',
'laid',
'eyes',
'upon.']
但是这样标点符号也留在其中,当然,我们可以把标点符号去掉,再对句子进行使用。
#方法二
import re
regEx = re.compile('\W*') #\W 非数字字母下划线
listOfTokens = regEx.split(mySent) #根据非数字字母下划线切割,因此只剩下字母
listOfTokens
/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:4: FutureWarning: split() requires a non-empty pattern match.
after removing the cwd from sys.path.
['This',
'book',
'is',
'the',
'best',
'book',
'on',
'Python',
'or',
'M',
'L',
'I',
'have',
'ever',
'laid',
'eyes',
'upon',
'']
但这种方法会留下空格,我们可以根据token的长度,保留长度大于0的token即可。
for tok in listOfTokens:
if len(tok) == 0:
listOfTokens.remove(tok)
字母大小写
文本的字母大小写会影响结果,如果是句子查找,保留大写会非常有用,因为此处我们仅需建立词袋,我们需要把大小写统一,可以使用.lower()或者.upper()函数。
for tok in listOfTokens:
listOfTokens.remove(tok)
tok = tok.lower()
listOfTokens.append(tok)
listOfTokens
['book',
'the',
'book',
'Python',
'M',
'I',
'ever',
'eyes',
'this',
'best',
'or',
'have',
'upon',
'on',
'laid',
'l',
'is']
import codecs
#6.txt为GB18030编码
emailText = open('email/ham/6.txt',mode='r', encoding='GB18030').read()
emailText
'Hello,\n\nSince you are an owner of at least one Google Groups group that uses the customized welcome message, pages or files, we are writing to inform you that we will no longer be supporting these features starting February 2011. We made this decision so that we can focus on improving the core functionalities of Google Groups -- mailing lists and forum discussions. Instead of these features, we encourage you to use products that are designed specifically for file storage and page creation, such as Google Docs and Google Sites.\n\nFor example, you can easily create your pages on Google Sites and share the site (http://www.google.com/support/sites/bin/answer.py?hl=en&answer=174623) with the members of your group. You can also store your files on the site by attaching files to pages (http://www.google.com/support/sites/bin/answer.py?hl=en&answer=90563) on the site. If you抮e just looking for a place to upload your files so that your group members can download them, we suggest you try Google Docs. You can upload files (http://docs.google.com/support/bin/answer.py?hl=en&answer=50092) and share access with either a group (http://docs.google.com/support/bin/answer.py?hl=en&answer=66343) or an individual (http://docs.google.com/support/bin/answer.py?hl=en&answer=86152), assigning either edit or download only access to the files.\n\nyou have received this mandatory email service announcement to update you about important changes to Google Groups.'
listOfTokens = regEx.split(emailText)
/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: split() requires a non-empty pattern match.
"""Entry point for launching an IPython kernel.
补充一下各种编码形式:GB2312,GB18030,UTF-8,ASCII等等
4.6.2 测试算法:使用朴素贝叶斯进行交叉验证
textParse(bigString)把文档分割成词条。
def textParse(bigString):
import re
listOfTokens = re.split(r'\W*',bigString)
return [tok.lower() for tok in listOfTokens if len(tok)>2]
spamTest()利用之前写的函数,对于文档词条中的算条件概率,分到概率大的那一类,最后进行交叉验证,验证方法的错误率。
def spamTest():
docList = [] #每个文档的词条列表
classList = [] #spam类和ham类文档的类标签列表
fullText = [] #所有文档的所有词条
for i in range(1,26):
##### 从spam类读取一个文档 #####
wordList = textParse(open('email/spam/%d.txt'%i,mode='r', encoding='GB18030').read())
docList.append(wordList)
fullText.extend(wordList) #extend合并两个表,可能有词条重复
classList.append(1)
#############################
##### 从ham类读取一个文档 #####
wordList = textParse(open('email/ham/%d.txt'%i,mode='r', encoding='GB18030').read())
docList.append(wordList)
fullText.extend(wordList)
classList.append(0)
#############################
vocabList = createVocabList(docList) #所有文档的所有涉及的词条,无重复
trainingSet = range(50) #共有50个文档,之后会从中随机取出10个作为测试集,并把该序号从trainingSet中取出
testSet = []
#生成十个随机数,取出对应编号的文件,做测试集,并在训练集中删除该文件
for i in range(10):
randIndex = int(random.uniform(0,len(trainingSet)))
testSet.append(trainingSet[randIndex])
trainingSet = list(trainingSet)
trainingSet.remove(trainingSet[randIndex])
trainMat = []
trainClasses = []
#构建测试集,以及测试集的翁当对应的类标签
for docIndex in trainingSet:
#setOfWords2Vec:构建词条矩阵,文章中有的设1,没有的为0
trainMat.append(setOfWords2Vec(vocabList,docList[docIndex]))
trainClasses.append(classList[docIndex])
p0V, p1V, pSpam = trainNB(array(trainMat),array(trainClasses))
errorCount = 0
for docIndex in testSet:
wordVector = setOfWords2Vec(vocabList,docList[docIndex])
if classifyNB(wordVector,p0V,p1V,pSpam) != classList[docIndex]:
errorCount+=1
errorcount = errorCount/len(testSet)
print("the error rate is %d"%errorCount)
spamTest()
the error rate is 0
/anaconda3/lib/python3.6/re.py:212: FutureWarning: split() requires a non-empty pattern match.
return _compile(pattern, flags).split(string, maxsplit)
求平均错误率可将构成训练集测试集的过程重复多次,最后再求平均错误率。会更加准确。
我们这组选择数据的一部分做训练集,剩余部分做测试集的过程称为留存交叉验证(hould-out cross validation)。
4.7 实例:使用朴素贝叶斯分类器从个人广告中获取区域倾向
因为没找到数据源,因此这部分先空着,之后会添加其他相关的项目。
import feedparser
ny = feedparser.parse('http://www.craigslist.org/about/best/all/index.rss')
ny['entries'][0]
{'author': 'robot@craigslist.org',
'author_detail': {'email': 'robot@craigslist.org'},
'authors': [{'email': 'robot@craigslist.org'}],
'dc_source': 'https://www.craigslist.org/about/best/van/6901570431.html',
'dc_type': 'text',
'id': 'https://www.craigslist.org/about/best/van/6901570431.html',
'link': 'https://www.craigslist.org/about/best/van/6901570431.html',
'links': [{'href': 'https://www.craigslist.org/about/best/van/6901570431.html',
'rel': 'alternate',
'type': 'text/html'}],
'rights': 'copyright 2019 craigslist',
'rights_detail': {'base': 'https://www.craigslist.org/about/best/all/index.rss',
'language': None,
'type': 'text/plain',
'value': 'copyright 2019 craigslist'},
'summary': "Where do I begin? First off thank you. Thanks for being in right place at the right time and doing the right thing. <br>\n<br>\nOn a cold January night you saw my husband swerving all over the road. You followed him while dialing 911. I hadn't been able to reach him for a few hours and was frantic by the time this happened. He was out driving under the influence. I appreciate you and everything you did for us. <br>\n<br>\nYes he was drunk, hammered even, and because of your actions he was arrested and charged. (His first ever offense) <br>\n<br>\nAlso because of your actions, he didnt cause an accident, he didnt hurt anyone else and he was able to come home to us. <br>\n<br>\nHe is now sober. He has been for a few months now. This was a huge wake up call for him, he is deeply remorseful and is still dealing with his actions. It hasn't been easy but I really believe that your phone call saved his life and possibly the lives of others. I thank God for you every day and that the outcome wasn't worse. <br>\n<br>\nMany thanks,<br>\nHis wife.<br>\n<br>",
'summary_detail': {'base': 'https://www.craigslist.org/about/best/all/index.rss',
'language': None,
'type': 'text/html',
'value': "Where do I begin? First off thank you. Thanks for being in right place at the right time and doing the right thing. <br>\n<br>\nOn a cold January night you saw my husband swerving all over the road. You followed him while dialing 911. I hadn't been able to reach him for a few hours and was frantic by the time this happened. He was out driving under the influence. I appreciate you and everything you did for us. <br>\n<br>\nYes he was drunk, hammered even, and because of your actions he was arrested and charged. (His first ever offense) <br>\n<br>\nAlso because of your actions, he didnt cause an accident, he didnt hurt anyone else and he was able to come home to us. <br>\n<br>\nHe is now sober. He has been for a few months now. This was a huge wake up call for him, he is deeply remorseful and is still dealing with his actions. It hasn't been easy but I really believe that your phone call saved his life and possibly the lives of others. I thank God for you every day and that the outcome wasn't worse. <br>\n<br>\nMany thanks,<br>\nHis wife.<br>\n<br>"},
'title': 'An open letter to the person who called the police!',
'title_detail': {'base': 'https://www.craigslist.org/about/best/all/index.rss',
'language': None,
'type': 'text/plain',
'value': 'An open letter to the person who called the police!'},
'updated': '2019-05-31T09:38:58-07:00',
'updated_parsed': time.struct_time(tm_year=2019, tm_mon=5, tm_mday=31, tm_hour=16, tm_min=38, tm_sec=58, tm_wday=4, tm_yday=151, tm_isdst=0)}
for i in range(len(ny['entries'])):
print(ny['entries'][i])
len(ny['entries'])
25
4.8 总结
1-利用特征之间条件独立假设降低对数据的需求。独立性假设是指一个词的出现概率并不依赖于文档中的其他词,这个假设在实际生活中是有缺陷的,但贝叶斯分类器仍然是有效的分类器。
2-实现朴素贝叶斯分类时需考虑实际因素。例如去吃stopword,下溢出问题,等等。