python scikit
In this article, I will let you know about how can we use scikit-learn to do machine learning classification on Digits dataset of handwritten digits. You can use any of the dataset for handwritten recognition but here I have used digits dataset. There is no need to download the dataset externally in your PC. I will show you how you can download it using scikit-learn.
在本文中,我将向您介绍如何使用scikit-learn对手写数字的Digits数据集进行机器学习分类。 您可以使用任何数据集进行手写识别,但是这里我使用了数字数据集。 无需从PC外部下载数据集。 我将向您展示如何使用scikit-learn下载它。
Let’s start by loading the dataset. The code below will load the digits dataset into your PC.
让我们从加载数据集开始。 下面的代码会将数字数据集加载到您的PC中。
Now as we have loaded the dataset, let’s see how many images and how many labels are there in the dataset.
现在,当我们加载数据集时,让我们看看数据集中有多少个图像和多少个标签。
现在显示图像和标签 (Now showing the images and labels)
Now let’s split into training and testing dataset. The main purpose of training and testing dataset is to make sure that after we train our model, it is able to generalize well to new data.
现在让我们分为训练和测试数据集。 训练和测试数据集的主要目的是确保在训练模型后,它能够很好地推广到新数据。
Now we are using Logistic Regression to train our model. So first we have to import the model.
现在,我们正在使用Logistic回归来训练我们的模型。 因此,首先我们必须导入模型。
Let’s make an instance of the model.
让我们做一个模型实例。
Now let’s train the model on the data and store the information learned from the data.
现在,让我们在数据上训练模型并存储从数据中学到的信息。
Now let’s try to predict the labels of new data using the information we have gained from training the model.
现在,让我们尝试使用从训练模型中获得的信息来预测新数据的标签。
It’s time to measure the performance of the model, there are various ways to measure the performance of the model but I am using the simple one and using accuracy as our metric. Now, let’s try to understand what is accuracy :-
现在是衡量模型性能的时候了,有多种方法可以衡量模型的性能,但是我使用的是一种简单的方法,并使用准确性作为衡量指标。 现在,让我们尝试了解什么是准确性:-
Accuracy is defined as :
精度定义为:
(fraction of correct predictions): correct predictions / total number of data points.
(正确预测的分数):正确预测/数据点总数。
Let’s find out the confusion matrix as well. Confusion matrix is a table that is used to describe the performance of the model, on a set of test data for which the true values are known. I am showing the confusion matrix using two methods or we can say using two python packages (Seaborn and Matplotlib).
让我们也找出混淆矩阵。 混淆矩阵是用于描述模型的性能的表,该表基于一组已知真实值的测试数据。 我正在使用两种方法显示混乱矩阵,或者可以说使用两个python软件包(Seaborn和Matplotlib)。
Before forming a confusion matrix let’s import the necessary packages in python using the following :-
在形成混淆矩阵之前,让我们使用以下命令在python中导入必要的包:
Let’s try to form the confusion matrix using seaborn.
让我们尝试使用seaborn来形成混淆矩阵。
Now let’s form the confusion matrix using Matplotlib.
现在,让我们使用Matplotlib形成混淆矩阵。
Till now we have predicted using 75% of the training set and 25% of the testing set, and for that split we have got the accuracy around 96.44%.
到现在为止,我们已经预测将使用75%的训练集和25%的测试集,并且对于该划分,我们获得了大约96.44%的准确性。
Let’s try to find out the accuracy in the case of 70% training set and 30% testing set and also in 80% training set and 20% of testing set.
让我们尝试找出在70%训练集和30%测试集以及80%训练集和20%测试集的情况下的准确性。
Now starting with the case of 80% training set and 20% testing set.
现在从80%的训练集和20%的测试集开始。
Now as we have already created the instance of the Logistic Regression and also we have already imported the module and necessary package needed so no need to do it again and again. It’s time to fit the Logistic Regression into training model.
现在,我们已经创建了Logistic回归的实例,并且已经导入了所需的模块和必要的包,因此无需一次又一次地进行操作。 现在是时候将Logistic回归纳入训练模型了。
It’s time to predict :-
现在是时候预测:-
Now let’s look in to the accuracy we got using 80% training and 20% testing set.
现在,让我们看看使用80%训练和20%测试集所获得的准确性。
Now this time I will form the confusion matrix only using Seaborn.
现在,这一次我将仅使用Seaborn来形成混淆矩阵。
Though we can form the confusion matrix using Matplotlib as well, as we have discussed earlier.
尽管我们也可以使用Matplotlib来形成混淆矩阵,但是正如我们之前讨论的那样。
Till now we have found the accuracy in 75% training set and 25% testing set, and just now we have found the accuracy in 80% training set and 20% testing set. Let’s now take the case of 70% training set and 30% testing set.
到目前为止,我们已经找到了75%的训练集和25%的测试集的准确性,而现在我们已经找到了80%的训练集和20%测试集的准确性。 现在,以70%的训练集和30%的测试集为例。
Now starting with the case of 70% training set and 30% testing set.
现在从70%的训练集和30%的测试集开始。
After this we will do the same like we have done before.
此后,我们将像以前一样进行操作。
It’s time to predict :
现在可以预测:
Now let’s look into the accuracy we got from splitting into 70% training set and 30% testing set.
现在,让我们看一下分成70%训练集和30%测试集所获得的准确性。
Now I will again form the confusion matrix using Seaborn and for that let’s load some of the libraries though it is not necessary to import it each time once imported it will go on till the time kernel is ready.
现在,我将再次使用Seaborn形成混淆矩阵,为此,让我们加载一些库,尽管不必每次导入后都将其导入,直到内核准备就绪为止。
总结思想 (Closing Thoughts)
In this article we have used the scikit-learn for Machine Learning Classification. Though it doesn’t need a lot to memorize or something like that, if you are regular user you will be fond of it. And please let me know if you are stuck in between. I will definitely look into your problem.
在本文中,我们将scikit-learn用于机器学习分类。 尽管不需要太多记忆或类似的操作,但是如果您是普通用户,您一定会喜欢它。 并且请让我知道您是否介于两者之间。 我一定会调查您的问题。
Thank you so much for reading this article.
非常感谢您阅读本文。
You can view the source code of this from GitHub and for that click here.
您可以从GitHub查看其源代码,并单击此处。
python scikit