Naive Bayes1.0
- The book is written by Peter Harrington.
前言
由于我读的这本书是英文版,阅读过程中主要是英文思维,为了方便,我的笔记也会以英文的形式来记录。
From Machine learning in Action Chapter 4.
This chapter will lead us to some ways probability theory to help us classify thing. Start out with the simplest probabilistic classifier and make some assumptions, finally the naive Bayes classifier.
First,remember that probability theory is the basis for many ML algorithms.
Pros and Cons 优缺点
- Key idea : Choose the class with the higher probability.
- Popular in : Document-classification problem.
General Approaches
· prepare
Prepare: making word vetors from text
def loadDataSet():
postingList=[['my','dog','has','flea',\
'problems','help','please'],
['maybe','not','take','him',\
'to','dog','park','stupid'],
['my','dalmation','is','so','cute',\
'I','love','him'],
['stop','posting','stupid','worthless',\
'garbage']]
classVec = [0,1,0,1]
return postingList,classVec
def createVocabList(dataset):
vocabSet = set([]) #建立空列表
for document in dataset:
vocabSet = vocabSet | set(document) #或运算,取并集
return list(vocabSet)
def setOfWords2Vec(vocabList,inputSet):
returnVec = [0]*len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)] = 1
else:print("the word: %s is not in my Vocabulary!" % word)
return returnVec
涉及到的python知识:或运算符,列表
经检验,程序可正常使用,
使用Spyder(python3.7)
总的来说,第一个函数建立数据,供实验;
第二个函数是找出“所有document中,只出现了一次的词”;
第三个函数,是看第二个函数找出的那些词,是否出现在“当前document”中,并返回0,1的数组。如下图: