ML入门3.0--手写朴素贝叶斯(Na ̈ıve Bayes)
ML入门3.0 手写朴素贝叶斯(Na ̈ıve Bayes)
朴素贝叶斯简介
贝叶斯分类是一种基于概率框架下的分类算法的总称,核心采用贝叶斯公式解决分类问题,而朴素贝叶斯分类是贝叶斯分类中最简单的一种分类方法。
朴素贝叶斯自20世纪50年代已广泛研究。在20世纪60年代初就以另外一个名称引入到文本信息检索界中,并仍然是文本分类的一种热门(基准)方法,文本分类是以词频为特征判断文件所属类别或其他(如垃圾邮件、合法性、体育或政治等等)的问题。通过适当的预处理,它可以与这个领域更先进的方法(包括支持向量机)相竞争。它在自动医疗诊断中也有应用。百度百科.
朴素贝叶斯原理
算法思想
朴素贝叶斯的思想基础是:首先假设数据集的特征向量的各个分量之间相互独立(朴素的来源),在假设的基础上对于待分类的数据项,计算在此数据项的特征向量的条件下各个类别的概率,选择概率更大的那个作为分类结果。
贝叶斯公式
P ( A B ) = P ( A ) P ( B ∣ A ) = P ( B ) P ( A ∣ B ) P(AB)=P(A)P(B|A)=P(B)P(A|B) P(AB)=P(A)P(B∣A)=P(B)P(A∣B)
P
(
B
∣
A
)
=
P
(
B
)
P
(
A
)
P
(
A
∣
B
)
P(B|A) = \frac{P(B)}{P(A)}P(A|B)
P(B∣A)=P(A)P(B)P(A∣B)
P
(
B
∣
A
1
A
2...
A
n
)
=
P
(
B
)
P
(
A
1
A
2
A
3
.
.
.
A
n
)
P
(
A
1
A
2
A
3
.
.
.
A
n
∣
B
)
P(B|A1A2 ...An) = \frac{P(B)}{P(A_{1}A_{2}A_{3}...A_{n})}P(A_{1}A_{2}A_{3}...A_{n}|B)
P(B∣A1A2...An)=P(A1A2A3...An)P(B)P(A1A2A3...An∣B)
P ( B ‾ ∣ A 1 A 2... A n ) = P ( B ‾ ) P ( A 1 A 2 A 3 . . . A n ) P ( A 1 A 2 A 3 . . . A n ∣ B ‾ ) P(\overline{B}|A1A2 ...An) = \frac{P(\overline{B})}{P(A_{1}A_{2}A_{3}...A_{n})}P(A_{1}A_{2}A_{3}...A_{n}|\overline{B}) P(B∣A1A2...An)=P(A1A2A3...An)P(B)P(A1A2A3...An∣B)
假设特征之间相互独立
P
′
(
B
∣
A
1
A
2...
A
n
)
=
P
(
B
)
P
(
A
1
∣
B
)
P
(
A
2
∣
B
)
.
.
.
P
(
A
n
∣
B
)
=
P
(
B
)
∏
i
=
1
n
P
(
A
i
∣
B
)
P′(B|A1A2 ...An)= P(B)P(A_{1}|B)P(A_{2}|B)...P(A_{n}|B)=P(B)\prod_{i=1}^{n}P(A_{i}|B)
P′(B∣A1A2...An)=P(B)P(A1∣B)P(A2∣B)...P(An∣B)=P(B)i=1∏nP(Ai∣B)
P
′
(
B
‾
∣
A
1
A
2...
A
n
)
=
P
(
B
‾
)
P
(
A
1
∣
B
‾
)
P
(
A
2
∣
B
‾
)
.
.
.
P
(
A
n
∣
B
‾
)
=
P
(
B
‾
)
∏
i
=
1
n
P
(
A
i
∣
B
‾
)
P′(\overline{B}|A1A2 ...An)= P(\overline{B})P(A_{1}|\overline{B})P(A_{2}|\overline{B})...P(A_{n}|\overline{B}) =P(\overline{B})\prod_{i=1}^{n}P(A_{i}|\overline{B})
P′(B∣A1A2...An)=P(B)P(A1∣B)P(A2∣B)...P(An∣B)=P(B)i=1∏nP(Ai∣B)
拉普拉斯平滑
在上述朴素贝叶斯的就算过程中会出现一种情况
p
(
A
i
∣
B
)
p(A_{i}|B)
p(Ai∣B)可能等于0,这将会导致整个数据项的概率为0,为了避免这种情况引入一种处理方法———拉普拉斯平滑(Laplacian smoothing)
其思想十分简单,就是在分子+1,在分母加上分类的类别
P ( A i ∣ B ) = ∣ { x ∣ a ( x ) = v i , d ( x ) = l } ∣ ∣ { x ∣ d ( x ) = l } ∣ P(Ai|B) = \frac{|\{x|a(x) = vi,d(x) = l\}|}{|\{x|d(x) = l\}|} P(Ai∣B)=∣{x∣d(x)=l}∣∣{x∣a(x)=vi,d(x)=l}∣
其中a(x)是对象属性x在属性a的取值,d(x)是x的类别
Laplacian smoothing:
P
′
(
A
i
∣
B
)
=
∣
{
x
∣
a
(
x
)
=
v
i
,
l
(
x
)
=
l
}
∣
+
1
∣
{
x
∣
l
(
x
)
=
l
}
∣
+
∣
V
d
∣
P'(Ai|B) = \frac{|\{x|a(x) = vi,l(x) = l\}|+1}{|\{x|l(x) = l\}|+|V_{d}|}
P′(Ai∣B)=∣{x∣l(x)=l}∣+∣Vd∣∣{x∣a(x)=vi,l(x)=l}∣+1
其中
V
d
V_{d}
Vd是决策类别的数量,而且:
P ′ ( B ) = ∣ { x ∣ d ( x ) = l } ∣ + 1 m + ∣ V d ∣ P'(B) = \frac{|\{x|d(x) = l\}|+1}{m+|V_{d}|} P′(B)=m+∣Vd∣∣{x∣d(x)=l}∣+1
其中m是训练集的大小
应用举例:
这里给出一个十分详细清晰的例子:
女生是否嫁人决策(带你理解朴素贝叶斯分类算法 - 忆臻的文章 - 知乎)
手写朴素贝叶斯分类:
数据集
这里的采用的数据集是:mushroom.csv 详见个人github
函数
Func1: readNominalData(paraFilename) 加载数据集
def readNominalData(paraFilename):
'''
read data from paraFilename/
:param paraFilename:dataFile
:return: resultNames(), resultData
'''
resultData = []
tempFile = open(paraFilename)
tempLine = tempFile.readline().replace('\n', '')
tempNames = np.array(tempLine.split(','))
resultNames = [tempValue for tempValue in tempNames]
tempLine = tempFile.readline().replace('\n', '')
while tempLine != '':
tempValues = np.array(tempLine.split(','))
tempArray = [tempValue for tempValue in tempValues]
resultData.append(tempArray)
tempLine = tempFile.readline().replace('\n', '')
tempFile.close()
return resultNames, resultData
Func2: obtainFeaturesValues(paraDataset) 生成特征值矩阵
def obtainFeaturesValues(paraDataset):
'''
将整个数据集的特征值生成一个矩阵
:param paraDataset:当前数据集
:return:生成的矩阵
'''
resultMatrix = []
for i in range(len(paraDataset[0])):
featureValues = [example[i] for example in paraDataset] # obtain all values of every feature
uniqueValues = set(featureValues)
currentValues = [tempValue for tempValue in uniqueValues]
resultMatrix.append(currentValues)
return resultMatrix
Func3: calculateClassCounts(paraData, paraValuesMatrix) 统计不同类别的数量
def calculateClassCounts(paraData, paraValuesMatrix):
'''
统计不同类别的数量
:param paraData:dataSet
:param paraValuesMatrix:特征值矩阵
:return: 统计结果
'''
classCount = {}
tempNumInstances = len(paraData)
tempNumClasses = len(paraValuesMatrix[-1])
for i in range(tempNumInstances):
tempClass = paraData[i][-1]
if tempClass not in classCount.keys():
classCount[tempClass] = 0
classCount[tempClass] += 1
resultCounts = np.array(classCount)
return resultCounts
Func4: calculateClassDistributionLaplacian(paraData, paraValuesMatrix) class的概率计算,并进行拉普拉斯变换
def calculateClassDistributionLaplacian(paraData, paraValuesMatrix):
'''
class的概率计算,并进行拉普拉斯变换
:param paraData: dataSet
:param paraValuesMatrix: 特征值矩阵
:return: 不同类别的概率
'''
classCount = {}
tempNumInstances = len(paraData)
tempNumClasses = len(paraValuesMatrix[-1])
for i in range(tempNumInstances):
tempClass = paraData[i][-1]
if tempClass not in classCount.keys():
classCount[tempClass] = 0
classCount[tempClass] += 1
resultClassDistribution = []
for tempValue in paraValuesMatrix[-1]:
resultClassDistribution.append((classCount[tempValue] + 1.0) / (tempNumInstances + tempNumClasses))
print("tempNumClasses", tempNumClasses)
return resultClassDistribution
Func5: calculateConditionalDistributionLaplacian(paraData, paraValuesMatrix, paraMappings) 计算拉普拉斯变换后的条件概率
def calculateConditionalDistributionLaplacian(paraData, paraValuesMatrix, paraMappings):
'''
计算拉普拉斯变换后的条件概率
:param paraData: dataSet
:param paraValuesMatrix:属性取值矩阵
:param paraMappings: 映射后的数值矩阵
:return: 所有属性取值的条件概率
'''
tempNumInstances = len(paraData)
tempNumConditions = len(paraData[0]) - 1
tempNumClasses = len(paraValuesMatrix[-1])
#Step1 Allocate Space
tempCountCubic = []
resultDistributionsLaplacianCubic = []
for i in range(tempNumClasses):
tempMatrix = []
tempMatrix2 = []
#Over all conditions
for j in range(tempNumConditions):
#Over all values
tempNumValues = len(paraValuesMatrix[j])
tempArray = [0.0] * tempNumValues
tempArray2 = [0.0] * tempNumValues
tempMatrix.append(tempArray)
tempMatrix2.append(tempArray2)
tempCountCubic.append(tempMatrix)
resultDistributionsLaplacianCubic.append(tempMatrix2)
#Step 2. Scan the dataSet
for i in range(tempNumInstances):
tempClass = paraData[i][-1]
tempIntClass = paraMappings[tempNumConditions][tempClass] #get class index
# 统计不同类别条件下,每种特征不同取值分别有多少个(eg:p的条件下,特征为a的有x个,b有x1个···)
for j in range(tempNumConditions):
tempValue = paraData[i][j]
tempIntValue = paraMappings[j][tempValue] #get a feature's value's correaspondence index
tempCountCubic[tempIntClass][j][tempIntValue] += 1
#Calculate the real probability with LapLacian
tempClassCounts = [0] * tempNumClasses
for i in range(tempNumInstances):
tempValue = paraData[i][-1]
tempIntValue = paraMappings[tempNumConditions][tempValue]
tempClassCounts[tempIntValue] += 1
for i in range(tempNumClasses):
for j in range(tempNumConditions):
for k in range(len(tempCountCubic[i][j])):
resultDistributionsLaplacianCubic[i][j][k] = (tempCountCubic[i][j][k] + 1) / (tempClassCounts[i] + tempNumClasses)
return resultDistributionsLaplacianCubic
Func6: nbClassify(paraTestData, paraValueMatrix, paraClassValues, paraMappings, paraClassDistribution, paraDistributionCubic) 分类
def nbClassify(paraTestData, paraValueMatrix, paraClassValues, paraMappings, paraClassDistribution, paraDistributionCubic):
'''
分类并返回正确率
:param paraTestData:
:param paraValueMatrix:
:param paraClassValues:
:param paraMappings:
:param paraClassDistribution:
:param paraDistributionCubic:
:return: 正确率
'''
tempCorrect = 0.0
tempNumInstances = len(paraTestData)
tempNumConditions = len(paraTestData[0]) - 1
tempNumClasses = len(paraValueMatrix[-1])
tempTotal = len(paraTestData)
tempBiggest = -1000
tempBest = -1
for featureVector in paraTestData:
tempActualLabel = paraMappings[tempNumConditions][featureVector[-1]]
tempBiggest = -1000
tempBest = -1
for i in range(tempNumClasses):
tempPro = np.log(paraClassDistribution[i])
for j in range(tempNumConditions):
tempValue = featureVector[j]
tempIntValue = paraMappings[j][tempValue]
tempPro += np.log(paraDistributionCubic[i][j][tempIntValue])
if tempBiggest < tempPro:
tempBiggest = tempPro
tempBest = i
if tempBest == tempActualLabel:
tempCorrect += 1
return tempCorrect/tempNumInstances
Func7: STNBTest(paraFileName) 测试
def STNBTest(paraFileName):
featureNames, dataSet = readNominalData(paraFileName)
print("Feature Names = ", featureNames)
valuesMatrix = obtainFeaturesValues(dataSet)
tempMappings = calculateMappings(valuesMatrix)
classSumValues = calculateClassCounts(dataSet, valuesMatrix)
classDistribution = calculateClassDistributionLaplacian(dataSet, valuesMatrix)
print("classDistribution = ", classDistribution)
conditionalDistributions = calculateConditionalDistributionLaplacian(dataSet, valuesMatrix, tempMappings)
tempAccuracy = nbClassify(dataSet, valuesMatrix, classSumValues, tempMappings, classDistribution, conditionalDistributions)
print("The accuracy of NB classifier is {}".format(tempAccuracy))
运行结果
完整代码+数据集
算法优缺点
优点:
(1) 算法逻辑简单,易于实现
(2)分类过程中时空开销小
缺点:
朴素贝叶斯模型假设属性之间相互独立,这个假设在实际应用中往往是不成立的,在属性个数比较多或者属性之间相关性较大时,分类效果不好。