机器学习朴素贝叶斯_机器学习基础朴素贝叶斯分类

最新推荐文章于 2022-07-02 14:52:37 发布

杨_明

最新推荐文章于 2022-07-02 14:52:37 发布

阅读量1k

点赞数 2

文章标签：机器学习人工智能 python java

原文链接：https://towardsdatascience.com/machine-learning-basics-naive-bayes-classification-964af6f2a965

版权

机器学习朴素贝叶斯

In the previous stories, I had given an explanation of the program for implementation of various Regression models. Also, I had described the implementation of the Logistic Regression, KNN and SVM Classification model. In this article, we shall go through the algorithm of the famous Naive Bayes Classification model with an example.

在先前的故事中，我已经解释了用于实现各种回归模型的程序。另外，我已经描述了Logistic回归，KNN和SVM分类模型的实现。在本文中，我们将通过一个例子详细介绍著名的朴素贝叶斯分类模型的算法。

朴素贝叶斯分类概述 (Overview of Naive Bayes Classification)

Naive Bayes is one such algorithm in classification that can never be overlooked upon due to its special characteristic of being “naive”. It makes the assumption that features of a measurement are independent of each other.

朴素贝叶斯就是这样一种分类算法，由于其“朴素”的特殊特性，它永远不会被忽视。它假设测量的特征彼此独立。

For example, an animal may be considered as a cat if it has cat eyes, whiskers and a long tail. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this animal is a cat and that is why it is known as ‘Naive’.

例如，如果动物具有猫眼，胡须和长尾巴，则可以视为猫。即使这些特征相互依赖或依赖于其他特征的存在，所有这些特征也独立地影响了该动物是猫的可能性，这就是为什么它被称为“天真”的原因。

According to Bayes Theorem, the various features are mutually independent. For two independent events, P(A,B) = P(A)P(B). This assumption of Bayes Theorem is probably never encountered in practice, hence it accounts for the “naive” part in Naive Bayes. Bayes’ Theorem is stated as: P(a|b) = (P(b|a) * P(a)) / P(b). Where P(a|b) is the probability of a given b.

根据贝叶斯定理，各种特征是相互独立的。对于两个独立事件， P(A,B) = P(A)P(B) 。在实践中可能从未遇到过贝叶斯定理的这种假设，因此它解释了朴素贝叶斯的“幼稚”部分。贝叶斯定理表示为： P(a | b)=(P(b | a)* P(a))/ P(b)。 其中P(a | b)是给定b的概率。

Let us understand this algorithm with a simple example. The Student will be a pass if he wears a “red” color dress on the exam day. We can solve it using above discussed method of posterior probability.

让我们用一个简单的例子来了解这种算法。如果学生在考试当天穿着“红色”礼服，则将是合格的学生。我们可以使用上面讨论的后验概率方法来解决它。

By Bayes Theorem, P(Pass| Red) = P( Red| Pass) * P(Pass) / P (Red).

根据贝叶斯定理， P(通过|红色)= P(红色|通过)* P(通过)/ P(红色)。

From the values, let us assume P (Red|Pass) = 3/9 = 0.33, P(Red) = 5/14 = 0.36, P( Pass)= 9/14 = 0.64. Now, P (Pass| Red) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.

根据这些值，我们假设P(Red | Pass)= 3/9 = 0.33，P(Red)= 5/14 = 0.36，P(Pass)= 9/14 = 0.64。现在，P(通过|红色)= 0.33 * 0.64 / 0.36 = 0.60，这更有可能。

In this way, Naive Bayes uses a similar method to predict the probability of different class based on various attributes.

这样，朴素贝叶斯使用类似的方法基于各种属性来预测不同类别的概率。

问题分析 (Problem Analysis)

To implement the Naive Bayes Classification, we shall use a very famous Iris Flower Dataset that consists of 3 classes of flowers. In this, there are 4 independent variables namely the, sepal_length, sepal_width, petal_length and petal_width. The dependent variable is the species which we will predict using the four independent features of the flowers.

为了实现朴素贝叶斯分类，我们将使用非常著名的鸢尾花数据集，该数据集包含3类花。在此，有4个独立变量，即sepal_length ， sepal_width ，花瓣长度和花瓣宽度。因变量是我们将使用花朵的四个独立特征预测的物种。

There are 3 classes of species namely setosa, versicolor and the virginica. This dataset was originally introduced in 1936 by Ronald Fisher. Using the various features of the flower (independent variables), we have to classify a given flower using Naive Bayes Classification model.

共有三类物种，即setosa，versicolor和virginica 。该数据集最初由Ronald Fisher于1936年引入。使用花朵的各种特征(独立变量)，我们必须使用朴素贝叶斯分类模型对给定的花朵进行分类。

步骤1：导入库 (Step 1: Importing the Libraries)

As always, the first step will always include importing the libraries which are the NumPy, Pandas and the Matplotlib.

与往常一样，第一步将始终包括导入NumPy，Pandas和Matplotlib库。

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

步骤2：导入数据集 (Step 2: Importing the dataset)

In this step, we shall import the Iris Flower dataset which is stored in my github repository as IrisDataset.csv and save it to the variable dataset. After this, we assign the 4 independent variables to X and the dependent variable ‘species’ to Y. The first 5 rows of the dataset are displayed.

在此步骤中，我们将将存储在我的github存储库中的鸢尾花数据集导入为IrisDataset.csv并将其保存到变量dataset. 此后，我们将4个独立变量分配给X ，并将因变量'species'分配给Y。显示数据集的前5行。

dataset = pd.read_csv('https://raw.githubusercontent.com/mk-gurucharan/Classification/master/IrisDataset.csv')X = dataset.iloc[:,:4].values
y = dataset['species'].valuesdataset.head(5)>>
sepal_length  sepal_width  petal_length  petal_width   species
5.1           3.5          1.4           0.2           setosa
4.9           3.0          1.4           0.2           setosa
4.7           3.2          1.3           0.2           setosa
4.6           3.1          1.5           0.2           setosa
5.0           3.6          1.4           0.2           setosa

步骤3：将资料集分为训练集和测试集 (Step 3: Splitting the dataset into the Training set and Test set)

Once we have obtained our data set, we have to split the data into the training set and the test set. In this data set, there are 150 rows with 50 rows of each of the 3 classes. As each class is given in a continuous order, we need to randomly split the dataset. Here, we have the test_size=0.2, which means that 20% of the dataset will be used for testing purpose as the test set and the remaining 80% will be used as the training set for training the Naive Bayes classification model.

获得数据集后，我们必须将数据分为训练集和测试集。在此数据集中，有150行，其中3个类别分别有50行。由于每个类都是以连续顺序给出的，因此我们需要随机分割数据集。在这里，我们有test_size=0.2 ，这意味着将数据集的20％用于测试目的作为测试集，其余80％将用作训练Naive Bayes分类模型的训练集。

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

步骤4：功能缩放 (Step 4: Feature Scaling)

The dataset is scaled down to a smaller range using the Feature Scaling option. In this, both the X_train and X_test values are scaled down to smaller values to improve the speed of the program.

使用“功能缩放”选项将数据集缩小到较小的范围。在这种情况下， X_train和X_test值都按比例缩小为较小的值，以提高程序的速度。

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

步骤5：在训练集上训练朴素贝叶斯分类模型 (Step 5: Training the Naive Bayes Classification model on the Training Set)

In this step, we introduce the class GaussianNB that is used from the sklearn.naive_bayes library. Here, we have used a Gaussian model, there are several other models such as Bernoulli, Categorical and Multinomial. Here, we assign the GaussianNB class to the variable classifier and fit the X_train and y_train values to it for training purpose.

在此步骤中，我们介绍了sklearn.naive_bayes库中使用的类GaussianNB 。在这里，我们使用了高斯模型，还有其他一些模型，例如伯努利模型，分类模型和多项式模型。在这里，我们将GaussianNB类分配给变量classifier ，并为其训练X_train和y_train值。

from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

步骤6：预测测试集结果 (Step 6: Predicting the Test set results)

Once the model is trained, we use the the classifier.predict() to predict the values for the Test set and the values predicted are stored to the variable y_pred.

训练模型后，我们将使用classifier.predict()预测测试集的值，并将预测的值存储到变量y_pred.

y_pred = classifier.predict(X_test) 
y_pred

步骤7：混淆矩阵和准确性 (Step 7: Confusion Matrix and Accuracy)

This is a step that is mostly used in classification techniques. In this, we see the Accuracy of the trained model and plot the confusion matrix.

这是分类技术中最常用的步骤。在此，我们看到了训练模型的准确性，并绘制了混淆矩阵。

The confusion matrix is a table that is used to show the number of correct and incorrect predictions on a classification problem when the real values of the Test Set are known. It is of the format

混淆矩阵是一个表，用于在已知测试集的实际值时显示有关分类问题的正确和不正确预测的数量。它的格式

The True values are the number of correct predictions made.

True值是做出正确预测的次数。

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)from sklearn.metrics import accuracy_score 
print ("Accuracy : ", accuracy_score(y_test, y_pred))
cm>>Accuracy :  0.9666666666666667>>array([[14,  0,  0],
       [ 0,  7,  0],
       [ 0,  1,  8]])

From the above confusion matrix, we infer that, out of 30 test set data, 29 were correctly classified and only 1 was incorrectly classified. This gives us a high accuracy of 96.67%.

从上面的混淆矩阵中，我们推断出，在30个测试集数据中，有29个被正确分类，只有1个被错误分类。这为我们提供了96.67％的高精度。

步骤8：将实际值与预测值进行比较 (Step 8: Comparing the Real Values with Predicted Values)

In this step, a Pandas DataFrame is created to compare the classified values of both the original Test set (y_test) and the predicted results (y_pred).

在此步骤中，将创建一个Pandas DataFrame来比较原始测试集( y_test )和预测结果( y_pred )的分类值。

df = pd.DataFrame({'Real Values':y_test, 'Predicted Values':y_pred})
df>> 
Real Values   Predicted Values
setosa        setosa
setosa        setosa
virginica     virginica
versicolor    versicolor
setosa        setosa
setosa        setosa
...  ...   ...  ...  ...
virginica     versicolor
virginica     virginica
setosa        setosa
setosa        setosa
versicolor    versicolor
versicolor    versicolor

This step is an additional step which is not much informative as the Confusion matrix and is mainly used in regression to check the accuracy of the predicted value.

此步骤是一个附加步骤，与Confusion矩阵相比，信息量不多，主要用于回归以检查预测值的准确性。

As you can see, there is one incorrect prediction that has predicted versicolor instead of virginica.

如您所见，有一种不正确的预测已预测为杂色而不是弗吉尼亚州。

结论— (Conclusion —)

Thus in this story, we have successfully been able to build a Naive Bayes Classification Model that is able to classify a flower depending upon 4 characteristic features. This model can be implemented and tested with several other classification datasets that are available on the net.

因此，在这个故事中，我们已经成功地建立了朴素贝叶斯分类模型，该模型能够根据4个特征对花进行分类。可以使用网络上可用的其他几个分类数据集来实施和测试此模型。

I am also attaching the link to my GitHub repository where you can download this Google Colab notebook and the data files for your reference.

我还将链接附加到我的GitHub存储库中，您可以在其中下载此Google Colab笔记本和数据文件以供参考。

You can also find the explanation of the program for other Classification models below:

您还可以在下面找到其他分类模型的程序说明：