朴素贝叶斯算法文本分类算法_分类算法-朴素贝叶斯

最新推荐文章于 2021-05-28 20:36:20 发布

cunzai1985

最新推荐文章于 2021-05-28 20:36:20 发布

阅读量1k

点赞数

文章标签：算法 python 机器学习人工智能深度学习

原文链接：https://www.tutorialspoint.com/machine_learning_with_python/classification_algorithms_naive_bayes.htm

版权

朴素贝叶斯算法是一种基于贝叶斯定理的分类技术，假设所有预测变量独立。在Python中，可以通过Scikit-learn库实现高斯朴素贝叶斯、多项式朴素贝叶斯和伯努利朴素贝叶斯模型。虽然该算法存在特征独立性的假设限制，但因其简单、快速，常用于文本分类、多类预测和实时预测等场景。

摘要由CSDN通过智能技术生成

朴素贝叶斯算法文本分类算法

分类算法-朴素贝叶斯 (Classification Algorithms - Naïve Bayes)

朴素贝叶斯算法简介 (Introduction to Naïve Bayes Algorithm)

Naïve Bayes algorithms is a classification technique based on applying Bayes’ theorem with a strong assumption that all the predictors are independent to each other. In simple words, the assumption is that the presence of a feature in a class is independent to the presence of any other feature in the same class. For example, a phone may be considered as smart if it is having touch screen, internet facility, good camera etc. Though all these features are dependent on each other, they contribute independently to the probability of that the phone is a smart phone.

朴素贝叶斯算法是一种基于应用贝叶斯定理的分类技术，其中强烈假设所有预测变量彼此独立。简而言之，假设是某个类中某个要素的存在独立于同一类中其他任何要素的存在。例如，如果一部电话具有触摸屏，互联网设施，优质的摄像头等，则可以认为它是智能的。尽管所有这些功能都相互依赖，但它们独立地为该电话是智能电话的可能性做出了贡献。

In Bayesian classification, the main interest is to find the posterior probabilities i.e. the probability of a label given some observed features, 𝑃(𝐿 | 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠). With the help of Bayes theorem, we can express this in quantitative form as follows −

$$P(L |features)= \frac{P(L)P(features |L)}{𝑃(𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠)}$$

在贝叶斯分类中，主要的兴趣是找到后验概率，即给定某些观测特征features(𝑃|𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠)的标签概率。借助贝叶斯定理，我们可以将其定量表示为：

$$ P(L |功能)= \ frac {P(L)P(功能| L)} {𝑃(𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠)} $$

Here, 𝑃(𝐿 | 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠) is the posterior probability of class.

𝑃(𝐿|𝐿)是类别的后验概率。

𝑃(𝐿) is the prior probability of class.

𝑃(𝐿)是分类的先验概率。

𝑃(𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠 | 𝐿) is the likelihood which is the probability of predictor given class.

𝑃(𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠|𝐿)是似然度，它是给定类别的预测变量的概率。

𝑃(𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠) is the prior probability of predictor.

𝑃(𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠)是预测变量的先验概率。

在Python中使用朴素贝叶斯构建模型 (Building model using Naïve Bayes in Python)

Python library, Scikit learn is the most useful library that helps us to build a Naïve Bayes model in Python. We have the following three types of Naïve Bayes model under Scikit learn Python library −

Scikit learning是Python库，它是最有用的库，可帮助我们在Python中建立NaïveBayes模型。在Scikit学习Python库下，我们具有以下三种朴素的贝叶斯模型：

高斯朴素贝叶斯 (Gaussian Naïve Bayes)

It is the simplest Naïve Bayes classifier having the assumption that the data from each label is drawn from a simple Gaussian distribution.

它是最简单的朴素贝叶斯分类器，假设每个标签的数据均来自简单的高斯分布。

多项式朴素贝叶斯 (Multinomial Naïve Bayes)

Another useful Naïve Bayes classifier is Multinomial Naïve Bayes in which the features are assumed to be drawn from a simple Multinomial distribution. Such kind of Naïve Bayes are most appropriate for the features that represents discrete counts.

另一个有用的朴素贝叶斯分类器是多项朴素贝叶斯，其中的特征假定是从简单的多项式分布中得出的。这种朴素的贝叶斯最适合代表离散计数的特征。

伯努利·朴素贝叶斯 (Bernoulli Naïve Bayes)

Another important model is Bernoulli Naïve Bayes in which features are assumed to be binary (0s and 1s). Text classification with ‘bag of words’ model can be an application of Bernoulli Naïve Bayes.

另一个重要模型是伯努利·朴素贝叶斯(BernoulliNaïveBayes)，其中的特征被假定为二进制(0和1)。带有“单词袋”模型的文本分类可以是BernoulliNaïveBayes的应用。

例 (Example)

Depending on our data set, we can choose any of the Naïve Bayes model explained above. Here, we are implementing Gaussian Naïve Bayes model in Python −

根据我们的数据集，我们可以选择上述任何朴素贝叶斯模型。在这里，我们正在用Python实现高斯朴素贝叶斯模型-

We will start with required imports as follows −

我们将从所需的导入开始，如下所示：


import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

Now, by using make_blobs() function of Scikit learn, we can generate blobs of points with Gaussian distribution as follows −

现在，通过使用Scikit learning的make_blobs()函数，我们可以生成具有高斯分布的点的斑点，如下所示：


from sklearn.datasets import make_blobs
X, y = make_blobs(300, 2, centers=2, random_state=2, cluster_std=1.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='summer');

Next, for using GaussianNB model, we need to import and make its object as follows −

接下来，对于使用GaussianNB模型，我们需要导入并使其对象如下：


from sklearn.naive_bayes import GaussianNB
model_GBN = GaussianNB()
model_GNB.fit(X, y);

Now, we have to do prediction. It can be done after generating some new data as follows −

现在，我们必须进行预测。可以在生成一些新数据之后执行以下操作-


rng = np.random.RandomState(0)
Xnew = [-6, -14] + [14, 18] * rng.rand(2000, 2)
ynew = model_GNB.predict(Xnew)

Next, we are plotting new data to find its boundaries −

接下来，我们正在绘制新数据以查找其边界-


plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='summer')
lim = plt.axis()
plt.scatter(Xnew[:, 0], Xnew[:, 1], c=ynew, s=20, cmap='summer', alpha=0.1)
plt.axis(lim);

Now, with the help of following line of codes, we can find the posterior probabilities of first and second label −

现在，借助以下代码行，我们可以找到第一个和第二个标签的后验概率-


yprob = model_GNB.predict_proba(Xnew)
yprob[-10:].round(3)

输出量 (Output)


array([[0.998, 0.002],
   [1.   , 0.   ],
   [0.987, 0.013],
   [1.   , 0.   ],
   [1.   , 0.   ],
   [1.   , 0.   ],
   [1.   , 0.   ],
   [1.   , 0.   ],
   [0.   , 1.   ],
   [0.986, 0.014]]
)

优点缺点 (Pros & Cons)

优点 (Pros)

The followings are some pros of using Naïve Bayes classifiers −

以下是使用朴素贝叶斯分类器的一些优点-

Naïve Bayes classification is easy to implement and fast.
朴素的贝叶斯分类易于实现且快速。
It will converge faster than discriminative models like logistic regression.
它会比逻辑回归等判别模型收敛更快。
It requires less training data.
它需要较少的训练数据。
It is highly scalable in nature, or they scale linearly with the number of predictors and data points.
它本质上是高度可伸缩的，或者它们随预测变量和数据点的数量线性增长。
It can make probabilistic predictions and can handle continuous as well as discrete data.
它可以进行概率预测，并且可以处理连续数据和离散数据。
Naïve Bayes classification algorithm can be used for binary as well as multi-class classification problems both.
朴素贝叶斯分类算法可用于二进制以及多分类问题。

缺点 (Cons)

The followings are some cons of using Naïve Bayes classifiers −

以下是使用朴素贝叶斯分类器的一些缺点-

One of the most important cons of Naïve Bayes classification is its strong feature independence because in real life it is almost impossible to have a set of features which are completely independent of each other.
朴素贝叶斯分类的最重要缺点之一是其强大的特征独立性，因为在现实生活中几乎不可能拥有完全相互独立的一组特征。
Another issue with Naïve Bayes classification is its ‘zero frequency’ which means that if a categorial variable has a category but not being observed in training data set, then Naïve Bayes model will assign a zero probability to it and it will be unable to make a prediction.
朴素贝叶斯分类的另一个问题是它的“零频率”，这意味着如果分类变量具有类别但在训练数据集中没有被观察到，那么朴素贝叶斯模型将为其分配零概率，并且将无法进行分类。预测。

朴素贝叶斯分类的应用 (Applications of Naïve Bayes classification)

The following are some common applications of Naïve Bayes classification −

以下是朴素贝叶斯分类的一些常见应用-

Real-time prediction − Due to its ease of implementation and fast computation, it can be used to do prediction in real-time.

实时预测 -由于其易于实施和快速计算，因此可用于实时预测。

Multi-class prediction − Naïve Bayes classification algorithm can be used to predict posterior probability of multiple classes of target variable.

多类预测 -朴素贝叶斯分类算法可用于预测多类目标变量的后验概率。

Text classification − Due to the feature of multi-class prediction, Naïve Bayes classification algorithms are well suited for text classification. That is why it is also used to solve problems like spam-filtering and sentiment analysis.

文本分类 -由于具有多类别预测的功能，朴素贝叶斯分类算法非常适合文本分类。因此，它也可用于解决垃圾邮件过滤和情感分析等问题。

Recommendation system − Along with the algorithms like collaborative filtering, Naïve Bayes makes a Recommendation system which can be used to filter unseen information and to predict weather a user would like the given resource or not.

推荐系统 -与合作过滤等算法一起，朴素贝叶斯(NaïveBayes)提出了一种推荐系统，该系统可用于过滤看不见的信息并预测用户是否希望使用给定资源的天气。