初入机器学习之——朴素贝叶斯分类器NBC及Python代码剖析

NBC模型的理论基础

理论基础:贝叶斯公式
P ( C ∣ F )    =    P ( F ∣ C ) × P ( C ) P ( F ) P(C\vert F)\;=\;\frac{P(F\vert C)\times P(C)}{P(F)} P(CF)=P(F)P(FC)×P(C)

一、参数释义

考虑到实际的含义,C为class首字母,即我们需要模型去预测出的类别;F为feature首字母,即我们找到的关于类别的特征值(例如,关于一个人帅或者不帅的class,我们有身高、颜值、学历、身材这四种features),并且利用贝叶斯公式的前提条件是各个F之间 相互独立

二、Training阶段模型所需参数

作为一个机器学习和Python的初学者,我研究了很久才知道training dataset和testing dataset如何作用于上述公式(一开始我直接用training dataset把 P ( C ∣ F )    P(C\vert F)\; P(CF)给求出来了,蛋疼)。
对于 P ( C ∣ F )    P(C\vert F)\; P(CF),我们要做的是用training dataset将 P ( C ∣ F )    P(C\vert F)\; P(CF)的概率分布所需的参数求出来,在这里采用的是正态分布,正态分布的模型有两个:
σ , μ \sigma, \mu σ,μ
分别对应标准差和平均值,因为Python自带方差函数,所以我们可以直接求方差 σ 2 \sigma^2 σ2。还有另外一个概率需要得到: P ( C )    P(C)\; P(C),这个概率相对简单,只需要统计每个class在training dataset中出现的次数除以总的class数就能得到。综上所述,在training阶段,我们所需的参数为 σ 2 , μ , P ( C )    \sigma^2,\mu,P(C)\; σ2μP(C)

三、Predict阶段

根据联合分布概率可知,

P ( F ∣ C ) = P ( F 1 ∣ C ) × P ( F 2 ∣ C ) × P ( F 3 ∣ C ) × ⋯ × P ( F n ∣ C ) =                                                     ∏ 1 n P ( F i ∣ C ) \begin{array}{l}P(F\vert C)=P(F_1\vert C)\times P(F_2\vert C)\times P(F_3\vert C)\times\cdots\times P(F_n\vert C)=\\\;\;\;\;\;\;\;\;\\\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\prod\nolimits_1^nP(F_i\vert C)\end{array} P(FC)=P(F1C)×P(F2C)×P(F3C)××P(FnC)=1nP(FiC)

利用对数的性质,将乘积形式变为求和形式,
log ⁡ [ P ( F ∣ C ) ] = log ⁡ [ P ( F 1 ∣ C ) × P ( F 2 ∣ C ) × P ( F 3 ∣ C ) × ⋯ × P ( F n ∣ C ) ] =                                                                                                               ∑ 1 n log ⁡ [ P ( F i ∣ C ) ] \begin{array}{l}\log\lbrack P(F\vert C)\rbrack=\log\lbrack P(F_1\vert C)\times P(F_2\vert C)\times P(F_3\vert C)\times\cdots\times P(F_n\vert C)\rbrack=\\\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\\\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\sum\nolimits_1^n\log\lbrack P(F_i\vert C)\rbrack\end{array} log[P(FC)]=log[P(F1C)×P(F2C)×P(F3C)××P(FnC)]=1nlog[P(FiC)]

接着 ∑ 1 n log ⁡ [ P ( F i ∣ C ) ] × P ( C ) \begin{array}{lc}\sum\nolimits_1^n\log\lbrack P(F_i\vert C)\rbrack&\times\end{array}P(C) 1nlog[P(FiC)]×P(C)得到我们需要的概率( P ( F ) P(F) P(F)相对于分子是个常数,所以不用求),取最大概率的那个C,就是我们利用NBC分类得到的class。

四、代码剖析

因为刚上手Python,所以有一些代码或许可以用更简洁的方式写,大佬们可以指正。

NBC类总体代码如下:

class NBC:
    def __init__(self, feature_types, num_classes, landa=1 * e ** -6):
        """

        Args:
            feature_types:
            num_classes:
            landa: avoid the scenario of log0, defalt 1e-6
        """
        self.feature_types = feature_types
        self.num_classes = num_classes
        self.landa = landa
        self.avg = None
        self.var = None
        self.prior = None

    def fit(self, Xtrain, ytrain):
        """
        Xtrain is the four features , y is the lable of every row,
        we need to use parameters to get some CONSTANTS(average, variance, prior probability)
        in order to predict test datasets
        Args:
            Xtrain:
            ytrain:

        Returns:

        """
        self.prior = self.get_y_pri(ytrain)
        # the four features average values of the three labels
        self.avg = self.get_x_avg(Xtrain, ytrain)
        # the four features's var values of the three labels
        # var = power(std, 2)
        self.var = self.get_x_var(Xtrain, ytrain)

    def predict_prob(self, Xtest):
        """
        calculate the probability of every row in the test dataset
        in order to choose the closest label of this row
        Args:
            Xtest:

        Returns:
            array
        """
        # apply_along_axis means cut the Xtest into rows in order to calculate easier the likelihood
        likelihood = np.apply_along_axis(self.get_likelihood, axis=1, arr=Xtest)
        return self.prior * likelihood

    def predict(self, Xtest):
        """
        choose the largest probability as the label of row, return the label array
        Args:
            Xtest:

        Returns:
            array
        """
        return np.apply_along_axis(self.get_prediction_label, axis=1, arr=self.predict_prob(Xtest))

    def get_prediction_label(self, prob_row):
        """
        get the corresponding label of the largest probability of each row
        Args:
            prob_row:

        Returns:
            array
        """
        return np.argmax(prob_row)

    def get_count(self, ytrain, c):
        """
        get total number of every label in thetrain dataset
        Args:
            ytrain:
            c: class lable

        Returns:
            int count
        """
        count = 0
        for y in ytrain:
            if y == c:
                count += 1
        return count

    def get_y_pri(self, ytrain):
        """
        get prior probability of all labels
        Args:
            ytrain:

        Returns:
            array
        """
        ytrain_len = len(ytrain)
        res = []
        for y in range(self.num_classes):
            pri_p = self.get_count(ytrain, y) / ytrain_len
            res.append(pri_p)
        return np.array(res)

    def get_x_var(self, Xtrain, ytrain):
        """
        get variance of every feature in the train dataset,
        the result is necessary for predicting test dataset
        Args:
            Xtrain:
            ytrain:

        Returns:
            array
        """
        res = []
        for i in range(self.num_classes):
            res.append(Xtrain[ytrain == i].var(axis=0))
        return np.array(res)

    def get_likelihood(self, label_row):
        """
        get likelihood probability of every row of test dataset

        we add landa parameter manually to avoid the computation result of Gaussian distribution may be zero
        Args:
            label_row:

        Returns:
            array
        """

        # landa parameter is very important
        gauss_dis = (1 / sqrt(2 * pi * self.var) * exp(-1 * (label_row - self.avg) ** 2 / (2 * self.var))) + self.landa
        # log(abc) = loga + logb + loc
        return (log(gauss_dis)).sum(axis=1)

    def get_x_avg(self, Xtrain, ytrain):
        """
        get average of every feature in the train dataset,
        the result is necessary for predicting test dataset
        Args:
            Xtrain:
            ytrain:

        Returns:
            array
        """
        res = []
        for i in range(self.num_classes):
            res.append(Xtrain[ytrain == i].mean(axis=0))
        return np.array(res)

首先是NBC类的构造函数:

def __init__(self, feature_types, num_classes, landa=1 * e ** -6):
        """

        Args:
            feature_types:
            num_classes:
            landa: avoid the scenario of log0, defalt 1e-6
        """
        self.feature_types = feature_types
        self.num_classes = num_classes
        self.landa = landa
        self.avg = None
        self.var = None
        self.prior = None

根据二中总结所需要的参数,avg为 μ \mu μ,var为 σ 2 \sigma^2 σ2,prior为 P ( C ) P(C) P(C),另外有一个额外带有默认值的参数landa,是为了防止 log ⁡ 0 \log0 log0的出现。

training

Training阶段

def fit(self, Xtrain, ytrain):
        """
        Xtrain is the four features , y is the lable of every row,
        we need to use parameters to get some CONSTANTS(average, variance, prior probability)
        in order to predict test datasets
        Args:
            Xtrain:
            ytrain:

        Returns:

        """
        self.prior = self.get_y_pri(ytrain)
        # the four features average values of the three labels
        self.avg = self.get_x_avg(Xtrain, ytrain)
        # the four features's var values of the three labels
        # var = power(std, 2)
        self.var = self.get_x_var(Xtrain, ytrain)

获取先验概率:get_y_pri()方法,参数为training dataset,get_count函数是统计该class下的总数,除以总数即为该class的先验概率

    def get_y_pri(self, ytrain):
        """
        get prior probability of all labels
        Args:
            ytrain:

        Returns:
            array
        """
        ytrain_len = len(ytrain)
        res = []
        for y in range(self.num_classes):
            pri_p = self.get_count(ytrain, y) / ytrain_len
            res.append(pri_p)
        return np.array(res)


    def get_count(self, ytrain, c):
        """
        get total number of every label in thetrain dataset
        Args:
            ytrain:
            c: class lable

        Returns:
            int count
        """
        count = 0
        for y in ytrain:
            if y == c:
                count += 1
        return count

获取平均值 μ \mu μ

    def get_x_avg(self, Xtrain, ytrain):
        """
        get average of every feature in the train dataset,
        the result is necessary for predicting test dataset
        Args:
            Xtrain:
            ytrain:

        Returns:
            array
        """
        res = []
        for i in range(self.num_classes):
            res.append(Xtrain[ytrain == i].mean(axis=0))
        return np.array(res)

获取方差 σ 2 \sigma^2 σ2

    def get_x_var(self, Xtrain, ytrain):
        """
        get variance of every feature in the train dataset,
        the result is necessary for predicting test dataset
        Args:
            Xtrain:
            ytrain:

        Returns:
            array
        """
        res = []
        for i in range(self.num_classes):
            res.append(Xtrain[ytrain == i].var(axis=0))
        return np.array(res)

以上我们完成了对NBC的训练,接下来进行激动人心的预测阶段。

predicting

    def predict(self, Xtest):
        """
        choose the largest probability as the label of row, return the label array
        Args:
            Xtest:

        Returns:
            array
        """
        return np.apply_along_axis(self.get_prediction_label, axis=1, arr=self.predict_prob(Xtest))

    def predict_prob(self, Xtest):
        """
        calculate the probability of every row in the test dataset
        in order to choose the closest label of this row
        Args:
            Xtest:

        Returns:
            array
        """
        # apply_along_axis means cut the Xtest into rows in order to calculate easier the likelihood
        likelihood = np.apply_along_axis(self.get_likelihood, axis=1, arr=Xtest)
        return self.prior * likelihood

predict_prob函数计算 中的 P ( F i ∣ C ) P(F_i\vert C) P(FiC),即先验概率乘以似然度,这里将testing dataset以行单位做了切片,调用get_likelihood函数,似然度get_likelihood函数代码如下

    def get_likelihood(self, label_row):
        """
        get likelihood probability of every row of test dataset

        we add landa parameter manually to avoid the computation result of Gaussian distribution may be zero
        Args:
            label_row:

        Returns:
            array
        """

        # landa parameter is very important
        gauss_dis = (1 / sqrt(2 * pi * self.var) * exp(-1 * (label_row - self.avg) ** 2 / (2 * self.var))) + self.landa
        # log(abc) = loga + logb + loc
        return (log(gauss_dis)).sum(axis=1)

正态分布公式
f ( x ) = 1 2 π σ 2 e ( x − μ ) 2 2 σ 2 f(x)=\frac1{\sqrt{2\pi\sigma^2}}e^\frac{(x-\mu)^2}{2\sigma^2} f(x)=2πσ2 1e2σ2(xμ)2
取对数后求和,return后利用argmax函数取概率值最大的class

    def get_prediction_label(self, prob_row):
        """
        get the corresponding label of the largest probability of each row
        Args:
            prob_row:

        Returns:
            array
        """
        return np.argmax(prob_row)

经过实际数据测试,准确率基本都维持在90%以上。

代码地址:https://github.com/hz920120/python_polyu/blob/master/polyu/NBC.py

有错误欢迎大佬指正!😊

  • 2
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值