初入机器学习之——朴素贝叶斯分类器NBC
NBC模型的理论基础
理论基础:贝叶斯公式
P
(
C
∣
F
)
=
P
(
F
∣
C
)
×
P
(
C
)
P
(
F
)
P(C\vert F)\;=\;\frac{P(F\vert C)\times P(C)}{P(F)}
P(C∣F)=P(F)P(F∣C)×P(C)
一、参数释义
考虑到实际的含义,C为class首字母,即我们需要模型去预测出的类别;F为feature首字母,即我们找到的关于类别的特征值(例如,关于一个人帅或者不帅的class,我们有身高、颜值、学历、身材这四种features),并且利用贝叶斯公式的前提条件是各个F之间 相互独立。
二、Training阶段模型所需参数
作为一个机器学习和Python的初学者,我研究了很久才知道training dataset和testing dataset如何作用于上述公式(一开始我直接用training dataset把
P
(
C
∣
F
)
P(C\vert F)\;
P(C∣F)给求出来了,蛋疼)。
对于
P
(
C
∣
F
)
P(C\vert F)\;
P(C∣F),我们要做的是用training dataset将
P
(
C
∣
F
)
P(C\vert F)\;
P(C∣F)的概率分布所需的参数求出来,在这里采用的是正态分布,正态分布的模型有两个:
σ
,
μ
\sigma, \mu
σ,μ
分别对应标准差和平均值,因为Python自带方差函数,所以我们可以直接求方差
σ
2
\sigma^2
σ2。还有另外一个概率需要得到:
P
(
C
)
P(C)\;
P(C),这个概率相对简单,只需要统计每个class在training dataset中出现的次数除以总的class数就能得到。综上所述,在training阶段,我们所需的参数为
σ
2
,
μ
,
P
(
C
)
\sigma^2,\mu,P(C)\;
σ2,μ,P(C)
三、Predict阶段
根据联合分布概率可知,
P ( F ∣ C ) = P ( F 1 ∣ C ) × P ( F 2 ∣ C ) × P ( F 3 ∣ C ) × ⋯ × P ( F n ∣ C ) = ∏ 1 n P ( F i ∣ C ) \begin{array}{l}P(F\vert C)=P(F_1\vert C)\times P(F_2\vert C)\times P(F_3\vert C)\times\cdots\times P(F_n\vert C)=\\\;\;\;\;\;\;\;\;\\\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\prod\nolimits_1^nP(F_i\vert C)\end{array} P(F∣C)=P(F1∣C)×P(F2∣C)×P(F3∣C)×⋯×P(Fn∣C)=∏1nP(Fi∣C)
利用对数的性质,将乘积形式变为求和形式,
log
[
P
(
F
∣
C
)
]
=
log
[
P
(
F
1
∣
C
)
×
P
(
F
2
∣
C
)
×
P
(
F
3
∣
C
)
×
⋯
×
P
(
F
n
∣
C
)
]
=
∑
1
n
log
[
P
(
F
i
∣
C
)
]
\begin{array}{l}\log\lbrack P(F\vert C)\rbrack=\log\lbrack P(F_1\vert C)\times P(F_2\vert C)\times P(F_3\vert C)\times\cdots\times P(F_n\vert C)\rbrack=\\\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\\\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\sum\nolimits_1^n\log\lbrack P(F_i\vert C)\rbrack\end{array}
log[P(F∣C)]=log[P(F1∣C)×P(F2∣C)×P(F3∣C)×⋯×P(Fn∣C)]=∑1nlog[P(Fi∣C)]
接着 ∑ 1 n log [ P ( F i ∣ C ) ] × P ( C ) \begin{array}{lc}\sum\nolimits_1^n\log\lbrack P(F_i\vert C)\rbrack&\times\end{array}P(C) ∑1nlog[P(Fi∣C)]×P(C)得到我们需要的概率( P ( F ) P(F) P(F)相对于分子是个常数,所以不用求),取最大概率的那个C,就是我们利用NBC分类得到的class。
四、代码剖析
因为刚上手Python,所以有一些代码或许可以用更简洁的方式写,大佬们可以指正。
NBC类总体代码如下:
class NBC:
def __init__(self, feature_types, num_classes, landa=1 * e ** -6):
"""
Args:
feature_types:
num_classes:
landa: avoid the scenario of log0, defalt 1e-6
"""
self.feature_types = feature_types
self.num_classes = num_classes
self.landa = landa
self.avg = None
self.var = None
self.prior = None
def fit(self, Xtrain, ytrain):
"""
Xtrain is the four features , y is the lable of every row,
we need to use parameters to get some CONSTANTS(average, variance, prior probability)
in order to predict test datasets
Args:
Xtrain:
ytrain:
Returns:
"""
self.prior = self.get_y_pri(ytrain)
# the four features average values of the three labels
self.avg = self.get_x_avg(Xtrain, ytrain)
# the four features's var values of the three labels
# var = power(std, 2)
self.var = self.get_x_var(Xtrain, ytrain)
def predict_prob(self, Xtest):
"""
calculate the probability of every row in the test dataset
in order to choose the closest label of this row
Args:
Xtest:
Returns:
array
"""
# apply_along_axis means cut the Xtest into rows in order to calculate easier the likelihood
likelihood = np.apply_along_axis(self.get_likelihood, axis=1, arr=Xtest)
return self.prior * likelihood
def predict(self, Xtest):
"""
choose the largest probability as the label of row, return the label array
Args:
Xtest:
Returns:
array
"""
return np.apply_along_axis(self.get_prediction_label, axis=1, arr=self.predict_prob(Xtest))
def get_prediction_label(self, prob_row):
"""
get the corresponding label of the largest probability of each row
Args:
prob_row:
Returns:
array
"""
return np.argmax(prob_row)
def get_count(self, ytrain, c):
"""
get total number of every label in thetrain dataset
Args:
ytrain:
c: class lable
Returns:
int count
"""
count = 0
for y in ytrain:
if y == c:
count += 1
return count
def get_y_pri(self, ytrain):
"""
get prior probability of all labels
Args:
ytrain:
Returns:
array
"""
ytrain_len = len(ytrain)
res = []
for y in range(self.num_classes):
pri_p = self.get_count(ytrain, y) / ytrain_len
res.append(pri_p)
return np.array(res)
def get_x_var(self, Xtrain, ytrain):
"""
get variance of every feature in the train dataset,
the result is necessary for predicting test dataset
Args:
Xtrain:
ytrain:
Returns:
array
"""
res = []
for i in range(self.num_classes):
res.append(Xtrain[ytrain == i].var(axis=0))
return np.array(res)
def get_likelihood(self, label_row):
"""
get likelihood probability of every row of test dataset
we add landa parameter manually to avoid the computation result of Gaussian distribution may be zero
Args:
label_row:
Returns:
array
"""
# landa parameter is very important
gauss_dis = (1 / sqrt(2 * pi * self.var) * exp(-1 * (label_row - self.avg) ** 2 / (2 * self.var))) + self.landa
# log(abc) = loga + logb + loc
return (log(gauss_dis)).sum(axis=1)
def get_x_avg(self, Xtrain, ytrain):
"""
get average of every feature in the train dataset,
the result is necessary for predicting test dataset
Args:
Xtrain:
ytrain:
Returns:
array
"""
res = []
for i in range(self.num_classes):
res.append(Xtrain[ytrain == i].mean(axis=0))
return np.array(res)
首先是NBC类的构造函数:
def __init__(self, feature_types, num_classes, landa=1 * e ** -6):
"""
Args:
feature_types:
num_classes:
landa: avoid the scenario of log0, defalt 1e-6
"""
self.feature_types = feature_types
self.num_classes = num_classes
self.landa = landa
self.avg = None
self.var = None
self.prior = None
根据二中总结所需要的参数,avg为 μ \mu μ,var为 σ 2 \sigma^2 σ2,prior为 P ( C ) P(C) P(C),另外有一个额外带有默认值的参数landa,是为了防止 log 0 \log0 log0的出现。
training
Training阶段
def fit(self, Xtrain, ytrain):
"""
Xtrain is the four features , y is the lable of every row,
we need to use parameters to get some CONSTANTS(average, variance, prior probability)
in order to predict test datasets
Args:
Xtrain:
ytrain:
Returns:
"""
self.prior = self.get_y_pri(ytrain)
# the four features average values of the three labels
self.avg = self.get_x_avg(Xtrain, ytrain)
# the four features's var values of the three labels
# var = power(std, 2)
self.var = self.get_x_var(Xtrain, ytrain)
获取先验概率:get_y_pri()方法,参数为training dataset,get_count函数是统计该class下的总数,除以总数即为该class的先验概率
def get_y_pri(self, ytrain):
"""
get prior probability of all labels
Args:
ytrain:
Returns:
array
"""
ytrain_len = len(ytrain)
res = []
for y in range(self.num_classes):
pri_p = self.get_count(ytrain, y) / ytrain_len
res.append(pri_p)
return np.array(res)
def get_count(self, ytrain, c):
"""
get total number of every label in thetrain dataset
Args:
ytrain:
c: class lable
Returns:
int count
"""
count = 0
for y in ytrain:
if y == c:
count += 1
return count
获取平均值 μ \mu μ
def get_x_avg(self, Xtrain, ytrain):
"""
get average of every feature in the train dataset,
the result is necessary for predicting test dataset
Args:
Xtrain:
ytrain:
Returns:
array
"""
res = []
for i in range(self.num_classes):
res.append(Xtrain[ytrain == i].mean(axis=0))
return np.array(res)
获取方差 σ 2 \sigma^2 σ2
def get_x_var(self, Xtrain, ytrain):
"""
get variance of every feature in the train dataset,
the result is necessary for predicting test dataset
Args:
Xtrain:
ytrain:
Returns:
array
"""
res = []
for i in range(self.num_classes):
res.append(Xtrain[ytrain == i].var(axis=0))
return np.array(res)
以上我们完成了对NBC的训练,接下来进行激动人心的预测阶段。
predicting
def predict(self, Xtest):
"""
choose the largest probability as the label of row, return the label array
Args:
Xtest:
Returns:
array
"""
return np.apply_along_axis(self.get_prediction_label, axis=1, arr=self.predict_prob(Xtest))
def predict_prob(self, Xtest):
"""
calculate the probability of every row in the test dataset
in order to choose the closest label of this row
Args:
Xtest:
Returns:
array
"""
# apply_along_axis means cut the Xtest into rows in order to calculate easier the likelihood
likelihood = np.apply_along_axis(self.get_likelihood, axis=1, arr=Xtest)
return self.prior * likelihood
predict_prob函数计算 三 中的 P ( F i ∣ C ) P(F_i\vert C) P(Fi∣C),即先验概率乘以似然度,这里将testing dataset以行单位做了切片,调用get_likelihood函数,似然度get_likelihood函数代码如下
def get_likelihood(self, label_row):
"""
get likelihood probability of every row of test dataset
we add landa parameter manually to avoid the computation result of Gaussian distribution may be zero
Args:
label_row:
Returns:
array
"""
# landa parameter is very important
gauss_dis = (1 / sqrt(2 * pi * self.var) * exp(-1 * (label_row - self.avg) ** 2 / (2 * self.var))) + self.landa
# log(abc) = loga + logb + loc
return (log(gauss_dis)).sum(axis=1)
正态分布公式
f
(
x
)
=
1
2
π
σ
2
e
(
x
−
μ
)
2
2
σ
2
f(x)=\frac1{\sqrt{2\pi\sigma^2}}e^\frac{(x-\mu)^2}{2\sigma^2}
f(x)=2πσ21e2σ2(x−μ)2
取对数后求和,return后利用argmax函数取概率值最大的class
def get_prediction_label(self, prob_row):
"""
get the corresponding label of the largest probability of each row
Args:
prob_row:
Returns:
array
"""
return np.argmax(prob_row)
经过实际数据测试,准确率基本都维持在90%以上。
代码地址:https://github.com/hz920120/python_polyu/blob/master/polyu/NBC.py
有错误欢迎大佬指正!😊