朴素贝叶斯法原理及其python实现

最新推荐文章于 2024-05-09 13:36:50 发布

十里清风

最新推荐文章于 2024-05-09 13:36:50 发布

阅读量9.4k

点赞数 4

分类专栏：机器学习文章标签：机器学习朴素贝叶斯常见模型多项式模型和伯努利模型

本文链接：https://blog.csdn.net/sinat_34072381/article/details/84571451

版权

机器学习专栏收录该内容

23 篇文章 7 订阅

订阅专栏

文章目录

一、模型描述与假设
二、模型学习与参数估计
三、Python实现
四、多项式模型之文本分类

一、模型描述与假设

模型描述

假设输入随机向量 $X\sub{\mathcal X}\in \R^n$ ，输出随机向量 $Y\sub\mathcal Y=\{c_1,c_2,\cdots,c_K\}$ ， $P (X, Y)$ 是 $X$ 和 $Y$ 的联合概率分布，训练集 $T=\{(x_1,y_1),\cdots,(x_N,y_N)\}$ 由 $P (X, Y)$ 独立同分布产生。

对于给定的输入 $\bm x=\{x^1, x^2, \cdots, x^n\}$ ，朴素贝叶斯分类算法通过联合概率分布 $P (X, Y)$ ，计算输入 $\bm x$ 下各类别 $c_k$ 的概率 $P(c_k|\bm x)$ ，并通过期望风险最小化准则，决定最终输出类别（具有最大概率的类别），即 $f:\mathcal X \to \mathcal Y$ 。

期望风险最小化

假设模型 $f$ 对训练集 $X$ 的预测类别为 $f (X)$ ，训练集 $X$ 的真实类别为 $Y$ ，模型 $f$ 的损失函数为 $L (Y, f (X))$ 。对于单一样本 $\bm x$ ，若模型 $f$ 基于后验概率 $P(c|\bm x)$ 将样本的类别预测为 $c$ ，则损失为
$R(c|\bm x) = \sum_{k=1}^K L(c_k, f(\bm x) = c)P(c_k|\bm x)$

实际的预测任务应使得总体风险最小，显然，若对单一样本 $\bm x$ 的预测都能保证风险最小，则最终的总体风险也一定是最小。因此，朴素贝叶斯算法的预测判定准则为：对每个样本选择使条件风险 $R(c|\bm x)$ 最小的类别标记，即
$f(\bm x)=\arg\min_{c\in\mathcal Y}R(c|\bm x)$

若损失函数 $L (Y, f (X))$ 为 $0 / 1$ 损失，即当预测类别 $c$ 等于真实类别时，损失为0，否则损失为1，则
$\begin{aligned}f(\bm x) &=\arg\min_{c\in\mathcal Y}\sum_{k=1}^KP(c_k \neq c|\bm x) \\ &= \arg\min_{c\in\mathcal Y}(1-P(c_k=c|\bm x)) \\ &= \arg\max_{c\in\mathcal Y}P(c|\bm x) \end{aligned}$

由上式结果可见，为使总体期望风险最小化，朴素贝叶斯算法对输入x，输出类别为具有后验概率最大的类别。

生成式模型

基于有限训练集，对于给定的 $\bm x$ 估计后验概率 $P(c_k|\bm x)$ 有两种策略：

直接建模预测 $P(c_k| \bm x)$ ，这种模型称之为判别式模型；
先建模求出联合概率分布 $P (Y, X)$ ，再通过联合概率分布求解 $P(c_k|\bm x)$ ，这种模型称之为生成式模型；

朴素贝叶斯分类算法属于生成式模型，给定样本 $\bm x$ ，其属于类别 $c_k$ 的概率为
$P(c_k|\bm x) =\frac{P(c_k, \bm x)}{P(\bm x)} =\frac{P(\bm x|c_k)P(c_k)}{\sum_{k=1}^KP(\bm x|c_k)P(c_k)}$

上式中，先验概率 $P(c_k)$ 和条件概率 $P(\bm x|c_k)$ 为未知量，其计算方法（训练集的统计量）是朴素贝叶斯算法的核心。

朴素假设

基于贝叶斯公式估计后验概率 $P(c_k|\bm x)$ ，涉及到条件概率 $P(\bm x|c_k)$ 的计算，由于 $P(x^1, \cdots, x^n|c_k)$ 是所有属性上的联合概率，难以从有限的训练集中直接估计得到。

若每个特征取值范围的较大，在计算上也会遭遇组合爆炸和样本稀疏等问题，如对于 $n$ 维的输入 $\bm x$ ，每个特征 $x^j$ 有 $S_j$ 种取值， $Y$ 可取值有 $K$ 种，则模型参数量（不同的概率数）为 $K\prod_{j=1}^nS_j$ 。

为了简化条件概率的计算，朴素贝叶斯算法提出朴素假设：用于分类的特征在类别确定条件下相互独立，即
$\begin{aligned}P(\bm x|c_k) & = P(x^1,x^2, \cdots,x^n|c_k) \\ & = \prod_{j=1}^n P(x^j |c_k) \end{aligned}$

朴素假设下，模型参数量变为 $K\sum_{j=1}^nS_j$ 。此时，模型变为
$\begin{aligned}f(\bm x) &=\arg\max_{c_k\in\mathcal Y}P(c_k|\bm x) =\arg\max_{c_k\in\mathcal Y}\frac{P(\bm x|c_k) \cdot P(c_k)}{P(\bm x,c_k)} \\[1ex] & = \arg\max_{c_k\in\mathcal Y}\frac{P(\bm x|c_k) \cdot P(c_k)}{\displaystyle\sum_{k=1}^KP(\bm x |c_k)\cdot P(c_k)}\\[1ex] & = \arg\max_{c_k\in\mathcal Y}\frac{P(c_k)\displaystyle\prod_{j=1}^n P(x^j|c_k)}{\displaystyle\sum_{k=1}^K P(c_k)\displaystyle\prod_{j=1}^n P(x^j|c_k)}\end{aligned}$
分母对所有的 $c_k$ 具有相同值，且多个条件概率的乘积可能导致浮点数下界溢出，一般是将概率对数化，最终得到决策模型
$f(\bm x)=\arg\max_{c_k\in\mathcal Y}[\log P(c_k)+\sum_{j=1}^n\log P(x^j|c_k)]$

二、模型学习与参数估计

假设样本集 $X=\{\bm x_1, \cdots, \bm x_N\}$ ，样本 $\bm x_i=\{x_i^1, \cdots, x_i^n\}$ ，特征 $x_i^j \in \{a_j^1, \cdots, a_j^{S_j}\}$ ，标签集 $Y=\{y_1, y_2, \cdots, y_N\}$ ，标签 $y_i\in\{c_1, \cdots, c_K\}$ 。

朴素贝叶斯算法的学习，意味着由给定的样本信息估计出先验概率 $P(c_k)$ 和类条件概率 $P(a_j^l|c_k)$ 。

由于朴素贝叶斯有多种模型，各模型中计算先验概率和类条件概率的方法有所不同，如伯努利模型（文本型）和多项式模型（词频型），由于朴素贝叶斯多用于文档分类，因此才有了文本性和词频型的区分，其实这些模型并不局限于文本分类。当特征值存在连续值时，可使用高斯模型。

伯努利模型（文本型）

伯努利模型中每个特征的取值范围为 ${0,1\}$ ，即 $S_j=2$ ，忽略了特征出现的次数。伯努利模型以实例数（文本数）为粒度，在计算类条件概率时，对于未出现的特征（特征值为0），依旧需要参与概率的计算。

文本可理解为训练集中的样例，由极大似然估计可得
$P(c_k)=\frac{类c_k下所有样本数}{训练集所有样本数} \\ \,\\ P(a_j^l |c_k) = \frac{类c_k下第j个特征为a_j^l的样本数}{类c_k下所有样本数}$

考虑到训练集容量有限，直接计算先验概率和条件概率可能出现概率值为0的情况，概率计算引入平滑因子 $\lambda$
$P(a_j^l |c_k) = \frac{类c_k下第j个特征为a_j^l的样本数+\lambda}{类c_k下所有样本数+2\lambda}$

特殊地，当 $\lambda=1$ 时，上式称为拉普拉斯平滑，分母加2，可理解为对于任何一类均添加包含所有特征和不包含所有特征的两个样本。注：sklearn源码实现，先验概率未引入平滑因子。

引理：先验概率 $P(c_k)$ 的极大似然估计
令 $\theta=P(y =c_k)$ ，显然 $\theta=P(y\neq c_k)$ ，因此可将标签集 $Y$ 看作为服从0-1分布，即
$P(y)=\theta^{I(y=c_k)}(1-\theta)^{1-I(y=c_k)}$
上式中， $I$ 为指示函数，即当 $y=c_k$ 时指示函数为1，否则为0。

因此， $\theta$ 的对数似然函数
$\log P(Y|\theta) = \log\prod_{i=1}^N \theta^{I(y_i=c_k)}(1-\theta)^{1-I(y_i=c_k)}$
求 $\theta$ 的偏导并令其为0，得 $\theta=\displaystyle\sum_{i=1}^N I(y_i=c_k)/N$ 。

实例1

No.	doc	label
1	apple orange apple	fruit
2	apple orange potato	fruit
3	orange orange grape	fruit
4	potato cabbage	vegetable

词集：【apple，cabbage，grape，orange，potato】，标签集：【fruit，vegetable】，求【apple apple grape】的类别？

解：将每一个文档向量化（特征值为0或1），则四个文档的特征向量分别为

No.	apple	cabbage	grape	orange	potato	label
1	1	0	0	1	0	fruit
2	1	0	0	1	1	fruit
3	0	0	1	1	0	fruit
4	0	1	0	0	1	vegetable

待预测实例【apple apple grape】的特征向量为[1, 0, 1, 0, 0]，因此
$\begin{aligned} &P({\sf {apple}}|{\sf fruit}) =\frac{2+1}{3+2}=\frac{3}{5},\quad P({\sf \xcancel{cabbage}}|{\sf fruit}) =\frac{0+1}{3+2}=\frac{1}{5},\quad P({\sf {grape}}|{\sf fruit}) =\frac{1+1}{3+2}=\frac{2}{5}\\\,\\ &P({\sf \xcancel{orange}}|{\sf fruit}) =\frac{3+1}{3+2}=\frac{4}{5},\quad P({\sf \xcancel{potato}}|{\sf fruit}) =\frac{1+1}{3+2}=\frac{2}{5},\quad P({\sf fruit})=\frac{3}{4} \end{aligned}$

因此
$\begin{aligned}P({\sf apple, apple, grape|fruit}) &=P({\sf {apple}}|{\sf fruit}) \cdots P({\sf \xcancel{potato}}|{\sf fruit})P({\sf fruit})\\ &=\frac{3}{5}\cdot\frac{1}{5}\cdot\frac{2}{5}\cdot\frac{4}{5}\cdot\frac{2}{5}\cdot\frac{3}{4}=0.01152 \end{aligned}$

同理
$\begin{aligned}P({\sf apple, apple, grape|vegetable}) &=P({\sf {apple}}|{\sf vegetable}) \cdots P({\sf \xcancel{potato}}|{\sf vegetable})P({\sf vegetable})\\ &=\frac{1}{3}\cdot\frac{1}{3}\cdot\frac{1}{3}\cdot\frac{1}{3}\cdot\frac{1}{3}\cdot\frac{1}{4}=0.00103 \end{aligned}$

显然，样本【apple apple grape】属于fruit的概率更大。

多项式模型（词频型）

多项式模型中特征值为该特征出现的次数。多项式模型以特征数（词频）为粒度，在计算类条件概率时，对于未出现的特征（特征值为0），不需要参与概率的计算。

词频可理解为某特征出现的次数，概率计算公式如下：
$P(c_k)=\frac{类c_k下所有特征数}{训练集所有特征数} \\\,\\ P(a_j |c_k) = \frac{类c_k下第j个特征a_j出现的次数之和}{类c_k下所有特征数}$

对于输入向量 $\bm x \in R^n$ ，加入平滑因子 $\lambda$ ，如下：
$P(a_j |c_k) = \frac{类c_k下第j个特征a_j出现的次数之和 + \lambda}{类c_k下所有特征数 + \lambda n}$

因此，对于实例1中的文档，使用多项式模型向量化得

No.	apple	cabbage	grape	orange	potato	label
1	2	0	0	1	0	fruit
2	1	0	0	1	1	fruit
3	0	0	1	2	0	fruit
4	0	1	0	0	1	vegetable

待预测实例【apple apple grape】的特征向量为[2, 0, 1, 0, 0]，因此
$\begin{aligned} &P({\sf {apple}}|{\sf fruit}) =\frac{2+1}{9+5}=\frac{3}{14 },\quad P({\sf {grape}}|{\sf fruit}) =\frac{1+1}{9+5}=\frac{2}{14},\quad P({\sf fruit})=\frac{9}{11} \end{aligned}$

因此
$\begin{aligned}P({\sf apple, apple, grape|fruit}) &=P({\sf {apple}}|{\sf fruit})^2P(\sf{garpe|fruit})P({\sf fruit})\\ &=(\frac{3}{14})^2\cdot\frac{2}{14}\cdot\frac{9}{11}=0.00537 \end{aligned}$

同理
$\begin{aligned}P({\sf apple, apple, grape|vegetable}) &=P({\sf {apple}}|{\sf vegetable})^2P(\sf{garpe|vegetable})P({\sf vegetable})\\ &=(\frac{1}{14})^2\cdot\frac{1}{14}\cdot\frac{2}{11}=0.00007 \end{aligned}$

显然，样本【apple apple grape】属于fruit的概率更大。

高斯模型

对连续属性可考虑概率密度函数，假设 $P(a_j|c)\sim \mathcal N(\mu_{c,j},\sigma_{c,j}^2)$ ，其中 $\mu_{c,j}$ 和 $\sigma_{c,j}^2$ 分别是类 $c$ 样本下第 $j$ 个属性取值的均值和方差。则有
$P(a_j|c) = \frac{1}{\sqrt{2\pi}\sigma_{c,j}}\exp\left(-\frac{(a_j-u_{c,j})^2}{2\sigma_{c,j}^2}\right)$

三、Python实现

伯努利模型

# -*- coding: utf-8 -*-
import numpy as np


class BernoulliNB(object):

    def fit(self, X, y, alpha=1.0):
        """模型训练"""
        X = self._check_array(X)
        self.classes_, Y = self.labelbin(y)
        if not np.any((X != 0) | (X != 1)):
            raise ValueError("Input X must be 0 or 1.")

        self.feature_count_ = np.dot(Y.T, X)
        self.class_count_ = Y.sum(axis=0)

        smoothed_fc = self.feature_count_ + alpha
        smoothed_cc = self.class_count_ + alpha * 2

        self.p_feature = (np.log(smoothed_fc) -
                          np.log(smoothed_cc.reshape(-1, 1)))
        self.p_feature_neg = np.log(1 - np.exp(self.p_feature))
        self.p_class = (np.log(self.class_count_) -
                        np.log(self.class_count_.sum()))

    def predict(self, X):
        """类别预测"""
        if not hasattr(self, 'p_feature'):
            raise ValueError('Instance must be fitted.')

        X = self._check_array(X)
        n_classes, n_features = self.p_feature.shape
        n_samples, n_features_X = X.shape
        if n_features_X != n_features:
            raise ValueError('Expect input with %d features.' % n_samples)

        p = np.dot(X, (self.p_feature - self.p_feature_neg).T)
        p += self.p_class + self.p_feature_neg.sum(axis=1)
        return self.classes_[np.argmax(p, axis=1)]

    def _check_array(self, array):
        """检查输入格式"""
        array = np.asarray(array)
        if array.ndim != 2:
            raise ValueError('Expected 2D array.')
        return array

    @staticmethod
    def labelbin(y):
        """标签二进制化"""
        Y = np.ravel(y)
        classes = np.array(sorted(set(Y)))
        indices = np.searchsorted(classes, Y)

        n_samples = Y.shape[0]
        n_classes = len(classes)

        Y = np.zeros((n_samples, n_classes), dtype=int)
        rows = np.arange(n_samples)
        Y[rows, indices] = 1
        return classes, Y


if __name__ == '__main__':
    X = np.random.randint(2, size=(6, 100))
    y = np.array([1, 2, 3, 4, 4, 5])
    clf = BernoulliNB()
    clf.fit(X, y)
    print(clf.predict(X[2:3]))

多项式模型

# -*- coding: utf-8 -*-
import numpy as np


class MultinomialNB(object):

    def fit(self, X, y, alpha=1.0):
        """模型训练"""
        X = self._check_array(X)
        if np.any(X < 0):
            raise ValueError("Input X must be non-negative")

        self.classes_, Y = self.labelbin(y)
        self.feature_count_ = np.dot(Y.T, X)
        self.class_count_ = Y.sum(axis=0)

        smoothed_fc = self.feature_count_ + alpha
        smoothed_cc = smoothed_fc.sum(axis=1)

        self.p_feature = (np.log(smoothed_fc) -
                          np.log(smoothed_cc.reshape(-1, 1)))
        cc = self.feature_count_.sum(axis=1)
        self.p_class = np.log(cc) - np.log(cc.sum())

    def predict(self, X):
        """类别预测"""
        X = self._check_array(X)
        if not hasattr(self, 'p_feature'):
            raise ValueError('Instance must be fitted.')

        n_classes, n_features = self.p_feature.shape
        n_samples, n_features_X = X.shape
        if n_features_X != n_features:
            raise ValueError('Expect input with %d features.' % n_samples)

        p = np.dot(X, self.p_feature.T) + self.p_class
        return self.classes_[np.argmax(p, axis=1)]

    def _check_array(self, array):
        """检查输入格式"""
        array = np.asarray(array)
        if array.ndim != 2:
            raise ValueError('Expected 2D array.')
        return array

    @staticmethod
    def labelbin(y):
        """标签二进制化"""
        Y = np.ravel(y)
        classes = np.array(sorted(set(Y)))
        indices = np.searchsorted(classes, Y)

        n_samples = Y.shape[0]
        n_classes = len(classes)

        Y = np.zeros((n_samples, n_classes), dtype=int)
        rows = np.arange(n_samples)
        Y[rows, indices] = 1
        return classes, Y
  
  
if __name__ == '__main__':
    X = np.random.randint(5, size=(6, 100))
    y = np.array([1, 1, 3, 3, 5, 6])
    clf = MultinomialNB()
    clf.fit(X, y)
    print(clf.predict(X[2:3]))

高斯模型

# -*- coding: utf-8 -*-
import numpy as np


class GaussianNB(object):

    def fit(self, X, y):
        """模型训练"""
        y = np.ravel(y)
        X = self._check_array(X)

        self.classes_ = np.array(sorted(set(y)))
        n_features = X.shape[1]
        n_classes = len(self.classes_)
        self.mu_ = np.zeros((n_classes, n_features))
        self.var_ = np.zeros((n_classes, n_features))

        self.class_count_ = np.zeros(n_classes, dtype=np.float64)

        for i, y_i in enumerate(self.classes_):
            X_i = X[y == y_i]
            self.class_count_[i] = np.shape(X_i)[0]
            self.mu_[i] = np.mean(X_i, axis=0)
            self.var_[i] = np.var(X_i, axis=0)

        epsilon = 1e-9 * np.var(X, axis=0).max()
        self.var_ += epsilon
        self.p_class = self.class_count_ / self.class_count_.sum()

    def predict(self, X):
        """类别预测"""
        X = self._check_array(X)
        n_samples = X.shape[0]
        n_classes = self.classes_.shape[0]

        p_log = np.zeros((n_classes, n_samples))

        for i in range(n_classes):
            cc_log = np.log(self.p_class[i])
            fc_log = -0.5 * np.sum(np.log(2 * np.pi * self.var_[i]))
            fc_log -= 0.5 * np.sum((X - self.mu_[i]) ** 2 / self.var_[i], axis=1)
            p_log[i] = cc_log + fc_log

        return self.classes_[np.argmax(p_log, axis=0)]

    def _check_array(self, array):
        """检查输入格式"""
        array = np.asarray(array)
        if array.ndim != 2:
            raise ValueError('Expected 2D array.')
        return array


if __name__ == '__main__':
    X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
    Y = np.array([1, 1, 1, 2, 2, 2])
    clf = GaussianNB()
    clf.fit(X, Y)
    print(clf.predict(X))

四、多项式模型之文本分类

文本的特征是词列表，应首先将文本向量化，文本特征向量的长度等于词表（所有文本中不重复的单次）长度。

class Words2Vec(object):

    def fit(self, X):
        vob = sorted(set(w for ws in X for w in ws))
        self.vec_length = len(vob)
        self.vob_dict = dict(zip(vob, range(self.vec_length)))

    def words2vec(self, n_words):
        """文本词列表转换为词向量"""
        if not hasattr(self, 'vob_dict'):
            raise ValueError('Instance must be fitted.')
        n_samples = len(n_words)
        vectors = np.zeros((n_samples, self.vec_length), dtype=int)
        for i, words in enumerate(n_words):
            vec = vectors[i]
            for w in words:
                index = self.vob_dict.get(w, None)
                if index is not None:
                    vec[index] += 1
        return vectors

X = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
     ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
     ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
     ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
     ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
     ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
y = ['0', '1', '0', '1', '0', '1']

wv = Words2Vec()
wv.fit(X)
X = wv.words2vec(X)

clf = MultinomialNB()
clf.fit(X, y)
print(clf.predict(X))

X = wv.words2vec([['dog', 'dog', 'ate']])
print(clf.predict(X))

十里清风

关注

4
点赞
踩
61

收藏

觉得还不错? 一键收藏
1
评论
朴素贝叶斯法原理及其python实现

朴树贝叶斯的原理与分类实现模型描述假设XXX是定义在输入空间X∈Rn{\mathcal X}\in \R^nX∈Rn上的随机向量，YYY是定义在输出空间Y={c1,c2,⋯&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;ThinSpace;,cK}\mathcal Y=\{c_1,c_2,\cdots,c_K\}Y={c1,c2,⋯,cK}上的随机向量，P(X,Y)P(X, Y)P(X,Y)是XXX和YYY的联合概率分布，训...
复制链接

扫一扫