Naive Bayes总结

最新推荐文章于 2024-07-23 14:36:35 发布

H114754726

最新推荐文章于 2024-07-23 14:36:35 发布

阅读量239

点赞数

文章标签： python

原文链接：http://www.cnblogs.com/DianeSoHungry/p/11357240.html

版权

为了更好地而阅读体验，欢迎来我的GitHub小站原址观看！

This notebook is inspired but not limited by Machine Learning In Action.
e.g., the implementation of the algrithm is at a higher level.

All rights deserved by Diane(Qingyun Hu).

1. About Naive Bayes

1.1 Mechanism of Naive Bayes

Naive Bayes is a variant of Bayes' Rule. Let's recap Bayes' Rule a bit.

\[ P(c_i | w_1, w_2, w_3, ..., w_m) = \frac{P(w_1, w_2, w_3, ..., w_m | c_i)*P(c_i)}{P(w_1, w_2, w_3, ..., w_m)} \]

where $w_1, w_2, w_3, ..., w_m$ is an vector of words that present in the document as well as included in the existing vocabulary list, and $c_i$ stands for class i.

Naive Bayes asks us to assume that the presence of $w_1, w_2, w_3, ..., w_m$ is independent. Although this is not realistic, as in there are always some connection between one word to another. However, this assumption simplifies the calculation and works quite well so far. By assuming the presence of words is independent, here we have:

\[ P(c_i | w_1, w_2, w_3, ..., w_m) = \frac{(\ P(w_1 | c_i) * P(w_2 | c_i) * P(w_3 | c_i) * ... * P(w_m | c_i)\ ) * P(c_i)}{P(w_1) * P(w_2) * P(w_3) * ... * P(w_m))} \]

1.2 Pros and Cons

1.21 Pros

Handles multiple classes.
Works well on small dataset.

1.22 Cons

Sensitive to how the input data is prepared
The sparse bag-of-words vector could consume a lot of memery if not handling it properly, as in for each vector, it's lenth is at the same lenth of vocabulary list.

1.23 Works with

Nominal Values

2. ID3 Tree Construction

# Creat demo dataset
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import pandas as pd
import math
def createDataSet():
    postingList=[['my', 'dog', 'has', 'flea', \
                  'problems', 'help', 'please'],
                 ['maybe', 'not', 'take', 'him', \
                  'to', 'dog', 'park', 'stupid'],
                 ['my', 'dalmation', 'is', 'so', 'cute', \
                   'I', 'love', 'him'],
                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                 ['mr', 'licks', 'ate', 'my', 'steak', 'how',\
                   'to', 'stop', 'him'],
                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0,1,0,1,0,1]
    return postingList,classVec
dataSet, labels = createDataSet()
dataSet
labels

[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]






[0, 1, 0, 1, 0, 1]

# Tool Function 1: Create an vocabulary list according to dataSet

def createVocabList(dataSet):
    vocabList = set([])
    for docum in dataSet:
        vocabList = vocabList | set(docum)
    return list(vocabList)
vocabList = createVocabList(dataSet)

# Tool Function 2: Get an bag of words vector for each document
import numpy as np
def bagOfWordsVec(vocabList, document):
    returnVec = np.ones(len(vocabList))
    for token in document:
        if token in vocabList:
            returnVec[vocabList.index(token)] += 1
    return returnVec
bagOfWordsVec(vocabList, dataSet[3])

array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  2.,
        1.,  1.,  2.,  1.,  1.,  1.,  2.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  2.,  1.,  1.,  2.,  1.])

# Tool Function 3: Get BagOfWordsTable for Training Dataset

def getBagOfWordsTable(dataSet, vocabList, label):
    bagOfWordsTable = []
    for document in dataSet:
        bagOfWordsTable.append(bagOfWordsVec(vocabList, document))
    bagOfWordsTable = pd.DataFrame(bagOfWordsTable, columns=vocabList)
    bagOfWordsTable['label']= label
    return bagOfWordsTable
getBagOfWordsTable(dataSet, vocabList, labels)

	park	food	licks	him	problems	love	take	not	maybe	mr	...	has	so	I	help	stupid	dalmation	ate	worthless	my	label
0	1.0	1.0	1.0	1.0	2.0	1.0	1.0	1.0	1.0	1.0	...	2.0	1.0	1.0	2.0	1.0	1.0	1.0	1.0	2.0	0
1	2.0	1.0	1.0	2.0	1.0	1.0	2.0	2.0	2.0	1.0	...	1.0	1.0	1.0	1.0	2.0	1.0	1.0	1.0	1.0	1
2	1.0	1.0	1.0	2.0	1.0	2.0	1.0	1.0	1.0	1.0	...	1.0	2.0	2.0	1.0	1.0	2.0	1.0	1.0	2.0	0
3	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	...	1.0	1.0	1.0	1.0	2.0	1.0	1.0	2.0	1.0	1
4	1.0	1.0	2.0	2.0	1.0	1.0	1.0	1.0	1.0	2.0	...	1.0	1.0	1.0	1.0	1.0	1.0	2.0	1.0	2.0	0
5	1.0	2.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	...	1.0	1.0	1.0	1.0	2.0	1.0	1.0	2.0	1.0	1

6 rows × 33 columns

# Calculate Probabilities

bagOfWordsTable = getBagOfWordsTable(dataSet, vocabList, labels)
def getProb(c_i, bagOfWordsTable, testDataset):
    P_ci = bagOfWordsTable['label'][bagOfWordsTable.label==c_i].count() / bagOfWordsTable.shape[0]
    bagOfWordsTable_ci = bagOfWordsTable[bagOfWordsTable.label==c_i]
    P_Xi_ci = bagOfWordsTable_ci.sum() / bagOfWordsTable_ci.sum().sum()
    P_Xi = bagOfWordsTable.sum() / bagOfWordsTable.sum().sum()
    
    predVec = []
    for document in testDataset:
        predVec.append(np.exp(np.log(P_Xi_ci[document]).sum() + np.log(P_ci) - np.log(P_Xi[document]).sum()))
#     return P_Xi_ci, P_ci, P_Xi
    return predVec

print("Predictions on Traing DataSet (The propability of each document being Class 1) :")
getProb(1, bagOfWordsTable,dataSet)

print("Real Classes of Traing DataSet")
labels

print("Not Bad!")

Predictions on Traing DataSet (The propability of each document being Class 1) :





[0.18178454867713456,
 1.2017140246697071,
 0.12570863705130642,
 1.1353438671320581,
 0.14790295856460187,
 1.4539243496229779]



Real Classes of Traing DataSet





[0, 1, 0, 1, 0, 1]



Not Bad!

# 3. Misc.

Trick 1: Initiate bag-of-words vector with 1s instead of 0s to prevent something like $P(w_i|c_1)==0$ from happening, which would cause the prediction to be 0.

Trick 2: Probability varies from 0 to 1, thus when multiplying a bunch of probabilities like $P(w_1) * P(w_2) * P(w_3) * ... * P(w_m)$, underflow tend to happen. To prevent this from happening, what we can do is to apply log() first then exp() to the right side of the equation $P(c_i | w_1, w_2, w_3, ..., w_m) = \frac{( P(w_1 | c_i) * P(w_2 | c_i) * P(w_3 | c_i) * ... * P(w_m | c_i) ) * P(c_i)}{P(w_1) * P(w_2) * P(w_3) * ... * P(w_m))} $ .

转载于:https://www.cnblogs.com/DianeSoHungry/p/11357240.html

H114754726

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Naive Bayes总结

为了更好地而阅读体验，欢迎来我的GitHub小站原址观看！This notebook is inspired but not limited by Machine Learning In Action.e.g., the implementation of the algrithm is at a higher level.All rights deserved by Diane(...
复制链接

扫一扫

	park	food	licks	him	problems	love	take	not	maybe	mr	...	has	so	I	help	stupid	dalmation	ate	worthless	my	label
0	1.0	1.0	1.0	1.0	2.0	1.0	1.0	1.0	1.0	1.0	...	2.0	1.0	1.0	2.0	1.0	1.0	1.0	1.0	2.0	0
1	2.0	1.0	1.0	2.0	1.0	1.0	2.0	2.0	2.0	1.0	...	1.0	1.0	1.0	1.0	2.0	1.0	1.0	1.0	1.0	1
2	1.0	1.0	1.0	2.0	1.0	2.0	1.0	1.0	1.0	1.0	...	1.0	2.0	2.0	1.0	1.0	2.0	1.0	1.0	2.0	0
3	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	...	1.0	1.0	1.0	1.0	2.0	1.0	1.0	2.0	1.0	1
4	1.0	1.0	2.0	2.0	1.0	1.0	1.0	1.0	1.0	2.0	...	1.0	1.0	1.0	1.0	1.0	1.0	2.0	1.0	2.0	0
5	1.0	2.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	...	1.0	1.0	1.0	1.0	2.0	1.0	1.0	2.0	1.0	1

	park	food	licks	him	problems	love	take	not	maybe	mr	...	has	so	I	help	stupid	dalmation	ate	worthless	my	label
0	1.0	1.0	1.0	1.0	2.0	1.0	1.0	1.0	1.0	1.0	...	2.0	1.0	1.0	2.0	1.0	1.0	1.0	1.0	2.0	0
1	2.0	1.0	1.0	2.0	1.0	1.0	2.0	2.0	2.0	1.0	...	1.0	1.0	1.0	1.0	2.0	1.0	1.0	1.0	1.0	1
2	1.0	1.0	1.0	2.0	1.0	2.0	1.0	1.0	1.0	1.0	...	1.0	2.0	2.0	1.0	1.0	2.0	1.0	1.0	2.0	0
3	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	...	1.0	1.0	1.0	1.0	2.0	1.0	1.0	2.0	1.0	1
4	1.0	1.0	2.0	2.0	1.0	1.0	1.0	1.0	1.0	2.0	...	1.0	1.0	1.0	1.0	1.0	1.0	2.0	1.0	2.0	0
5	1.0	2.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	...	1.0	1.0	1.0	1.0	2.0	1.0	1.0	2.0	1.0	1

	park	food	licks	him	problems	love	take	not	maybe	mr	...	has	so	I	help	stupid	dalmation	ate	worthless	my	label
0	1.0	1.0	1.0	1.0	2.0	1.0	1.0	1.0	1.0	1.0	...	2.0	1.0	1.0	2.0	1.0	1.0	1.0	1.0	2.0	0
1	2.0	1.0	1.0	2.0	1.0	1.0	2.0	2.0	2.0	1.0	...	1.0	1.0	1.0	1.0	2.0	1.0	1.0	1.0	1.0	1
2	1.0	1.0	1.0	2.0	1.0	2.0	1.0	1.0	1.0	1.0	...	1.0	2.0	2.0	1.0	1.0	2.0	1.0	1.0	2.0	0
3	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	...	1.0	1.0	1.0	1.0	2.0	1.0	1.0	2.0	1.0	1
4	1.0	1.0	2.0	2.0	1.0	1.0	1.0	1.0	1.0	2.0	...	1.0	1.0	1.0	1.0	1.0	1.0	2.0	1.0	2.0	0
5	1.0	2.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	...	1.0	1.0	1.0	1.0	2.0	1.0	1.0	2.0	1.0	1