使用Python进行机器学习-准备数据

使用Python进行机器学习-准备数据 (Machine Learning with Python - Preparing Data)

介绍 (Introduction)

Machine Learning algorithms are completely dependent on data because it is the most crucial aspect that makes model training possible. On the other hand, if we won’t be able to make sense out of that data, before feeding it to ML algorithms, a machine will be useless. In simple words, we always need to feed right data i.e. the data in correct scale, format and containing meaningful features, for the problem we want machine to solve.

机器学习算法完全依赖于数据,因为这是使模型训练成为可能的最关键方面。 另一方面,如果我们无法从这些数据中获得任何意义,那么在将其提供给ML算法之前,一台机器将毫无用处。 简而言之,对于我们希望机器解决的问题,我们始终需要提供正确的数据,即正确比例,格式和包含有意义特征的数据。

This makes data preparation the most important step in ML process. Data preparation may be defined as the procedure that makes our dataset more appropriate for ML process.

这使得数据准备成为机器学习过程中最重要的步骤。 数据准备可以定义为使我们的数据集更适合ML处理的过程。

为什么要进行数据预处理? (Why Data Pre-processing?)

After selecting the raw data for ML training, the most important task is data pre-processing. In broad sense, data preprocessing will convert the selected data into a form we can work with or can feed to ML algorithms. We always need to preprocess our data so that it can be as per the expectation of machine learning algorithm.

选择用于机器学习训练的原始数据后,最重要的任务是数据预处理。 从广义上讲,数据预处理会将选定的数据转换为我们可以使用的形式或可以提供给ML算法的形式。 我们始终需要对数据进行预处理,以便可以按照机器学习算法的期望进行处理。

数据预处理技术 (Data Pre-processing Techniques)

We have the following data preprocessing techniques that can be applied on data set to produce data for ML algorithms −

我们拥有以下数据预处理技术,可以将其应用于数据集以为ML算法生成数据-

缩放比例 (Scaling)

Most probably our dataset comprises of the attributes with varying scale, but we cannot provide such data to ML algorithm hence it requires rescaling. Data rescaling makes sure that attributes are at same scale. Generally, attributes are rescaled into the range of 0 and 1. ML algorithms like gradient descent and k-Nearest Neighbors requires scaled data. We can rescale the data with the help of MinMaxScaler class of scikit-learn Python library.

我们的数据集很可能包含规模可变的属性,但是我们无法将此类数据提供给ML算法,因此需要重新缩放。 数据重定比例可确保属性具有相同的比例。 通常,属性会重新缩放到0到1的范围内。像梯度下降和k最近邻这样的ML算法需要缩放的数据。 我们可以借助scikit-learn Python库的MinMaxScaler类来重新缩放数据。

(Example)

In this example we will rescale the data of Pima Indians Diabetes dataset which we used earlier. First, the CSV data will be loaded (as done in the previous chapters) and then with the help of MinMaxScaler class, it will be rescaled in the range of 0 and 1.

在此示例中,我们将重新缩放我们之前使用的Pima Indians Diabetes数据集的数据。 首先,将加载CSV数据(如前几章所述),然后在MinMaxScaler类的帮助下,将其重新缩放为0到1。

The first few lines of the following script are same as we have written in previous chapters while loading CSV data.

以下脚本的前几行与我们在上一章中编写CSV数据时写的相同。


from pandas import read_csv
from numpy import set_printoptions
from sklearn import preprocessing
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(path, names=names)
array = dataframe.values

Now, we can use MinMaxScaler class to rescale the data in the range of 0 and 1.

现在,我们可以使用MinMaxScaler类在0和1的范围内重新缩放数据。


data_scaler = preprocessing.MinMaxScaler(feature_range=(0,1))
data_rescaled = data_scaler.fit_transform(array)

We can also summarize the data for output as per our choice. Here, we are setting the precision to 1 and showing the first 10 rows in the output.

我们还可以根据我们的选择汇总输出数据。 在这里,我们将精度设置为1,并在输出中显示前10行。


set_printoptions(precision=1)
print ("\nScaled data:\n", data_rescaled[0:10])

输出量 (Output)


Scaled data:
[
   [0.4 0.7 0.6 0.4 0.  0.5 0.2 0.5 1. ]
   [0.1 0.4 0.5 0.3 0.  0.4 0.1 0.2 0. ]
   [0.5 0.9 0.5 0.  0.  0.3 0.3 0.2 1. ]
   [0.1 0.4 0.5 0.2 0.1 0.4 0.  0.  0. ]
   [0.  0.7 0.3 0.4 0.2 0.6 0.9 0.2 1. ]
   [0.3 0.6 0.6 0.  0.  0.4 0.1 0.2 0. ]
   [0.2 0.4 0.4 0.3 0.1 0.5 0.1 0.1 1. ]
   [0.6 0.6 0.  0.  0.  0.5 0.  0.1 0. ]
   [0.1 1.  0.6 0.5 0.6 0.5 0.  0.5 1. ]
   [0.5 0.6 0.8 0.  0.  0.  0.1 0.6 1. ]
]

From the above output, all the data got rescaled into the range of 0 and 1.

从上面的输出中,所有数据都重新缩放到0到1的范围内。

正常化 (Normalization)

Another useful data preprocessing technique is Normalization. This is used to rescale each row of data to have a length of 1. It is mainly useful in Sparse dataset where we have lots of zeros. We can rescale the data with the help of Normalizer class of scikit-learn Python library.

另一种有用的数据预处理技术是规范化。 这用于将每行数据重新缩放为长度为1。它主要用于稀疏数据集中,其中我们有很多零。 我们可以借助scikit-learn Python库的Normalizer类来重新缩放数据。

归一化类型 (Types of Normalization)

In machine learning, there are two types of normalization preprocessing techniques as follows −

在机器学习中,有两种类型的规范化预处理技术,如下所示:

L1归一化 (L1 Normalization)

It may be defined as the normalization technique that modifies the dataset values in a way that in each row the sum of the absolute values will always be up to 1. It is also called Least Absolute Deviations.

可以将其定义为标准化技术,该技术以以下方式修改数据集值:在每一行中,绝对值的总和始终为1。这也称为最小绝对偏差。

Example

In this example, we use L1 Normalize technique to normalize the data of Pima Indians Diabetes dataset which we used earlier. First, the CSV data will be loaded and then with the help of Normalizer class it will be normalized.

在此示例中,我们使用L1归一化技术对我们之前使用的Pima Indians Diabetes数据集的数据进行归一化。 首先,将加载CSV数据,然后在Normalizer类的帮助下将其标准化。

The first few lines of following script are same as we have written in previous chapters while loading CSV data.

以下脚本的前几行与我们在上一章中编写CSV数据时写的相同。


from pandas import read_csv
from numpy import set_printoptions
from sklearn.preprocessing import Normalizer
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv (path, names=names)
array = dataframe.values

Now, we can use Normalizer class with L1 to normalize the data.

现在,我们可以对L1使用Normalizer类对数据进行归一化。


Data_normalizer = Normalizer(norm='l1').fit(array)
Data_normalized = Data_normalizer.transform(array)

We can also summarize the data for output as per our choice. Here, we are setting the precision to 2 and showing the first 3 rows in the output.

我们还可以根据我们的选择汇总输出数据。 在这里,我们将精度设置为2,并在输出中显示前3行。


set_printoptions(precision=2)
print ("\nNormalized data:\n", Data_normalized [0:3])

Output

输出量


Normalized data:
[
   [0.02 0.43 0.21 0.1  0. 0.1  0. 0.14 0. ]
   [0.   0.36 0.28 0.12 0. 0.11 0. 0.13 0. ]
   [0.03 0.59 0.21 0.   0. 0.07 0. 0.1  0. ]
]

L2归一化 (L2 Normalization)

It may be defined as the normalization technique that modifies the dataset values in a way that in each row the sum of the squares will always be up to 1. It is also called least squares.

可以将其定义为规范化技术,该技术以以下方式修改数据集值:在每一行中,平方和总等于1。这也称为最小平方。

Example

In this example, we use L2 Normalization technique to normalize the data of Pima Indians Diabetes dataset which we used earlier. First, the CSV data will be loaded (as done in previous chapters) and then with the help of Normalizer class it will be normalized.

在此示例中,我们使用L2归一化技术对我们之前使用的Pima Indians Diabetes数据集的数据进行归一化。 首先,将加载CSV数据(如前几章所述),然后在Normalizer类的帮助下将其标准化。

The first few lines of following script are same as we have written in previous chapters while loading CSV data.

以下脚本的前几行与我们在上一章中编写CSV数据时写的相同。


from pandas import read_csv
from numpy import set_printoptions
from sklearn.preprocessing import Normalizer
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv (path, names=names)
array = dataframe.values

Now, we can use Normalizer class with L1 to normalize the data.

现在,我们可以对L1使用Normalizer类对数据进行归一化。


Data_normalizer = Normalizer(norm='l2').fit(array)
Data_normalized = Data_normalizer.transform(array)

We can also summarize the data for output as per our choice. Here, we are setting the precision to 2 and showing the first 3 rows in the output.

我们还可以根据我们的选择汇总输出数据。 在这里,我们将精度设置为2,并在输出中显示前3行。


set_printoptions(precision=2)
print ("\nNormalized data:\n", Data_normalized [0:3])

Output

输出量


Normalized data:
[
   [0.03 0.83 0.4  0.2  0. 0.19 0. 0.28 0.01]
   [0.01 0.72 0.56 0.24 0. 0.22 0. 0.26 0.  ]
   [0.04 0.92 0.32 0.   0. 0.12 0. 0.16 0.01]
]

二值化 (Binarization)

As the name suggests, this is the technique with the help of which we can make our data binary. We can use a binary threshold for making our data binary. The values above that threshold value will be converted to 1 and below that threshold will be converted to 0. For example, if we choose threshold value = 0.5, then the dataset value above it will become 1 and below this will become 0. That is why we can call it binarizing the data or thresholding the data. This technique is useful when we have probabilities in our dataset and want to convert them into crisp values.

顾名思义,这是一种技术,可以帮助我们使数据二进制化。 我们可以使用二进制阈值来使数据二进制。 高于该阈值的值将转换为1,而低于该阈值的值将转换为0。例如,如果我们选择阈值= 0.5,则高于该阈值的数据集值将变为1,低于此阈值的数据集值将变为0。为什么我们可以称其为二进制数据或阈值数据。 当我们在数据集中有几率并希望将其转换为清晰的值时,此技术很有用。

We can binarize the data with the help of Binarizer class of scikit-learn Python library.

我们可以借助scikit-learn Python库的Binarizer类对数据进行二值化。

(Example)

In this example, we will rescale the data of Pima Indians Diabetes dataset which we used earlier. First, the CSV data will be loaded and then with the help of Binarizer class it will be converted into binary values i.e. 0 and 1 depending upon the threshold value. We are taking 0.5 as threshold value.

在此示例中,我们将重新缩放我们之前使用的Pima Indians Diabetes数据集的数据。 首先,将加载CSV数据,然后在Binarizer类的帮助下,将根据阈值将其转换为二进制值,即0和1。 我们以0.5作为阈值。

The first few lines of following script are same as we have written in previous chapters while loading CSV data.

以下脚本的前几行与我们在上一章中编写CSV数据时写的相同。


from pandas import read_csv
from sklearn.preprocessing import Binarizer
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(path, names=names)
array = dataframe.values

Now, we can use Binarize class to convert the data into binary values.

现在,我们可以使用Binarize类将数据转换为二进制值。


binarizer = Binarizer(threshold=0.5).fit(array)
Data_binarized = binarizer.transform(array)

Here, we are showing the first 5 rows in the output.

在这里,我们显示输出中的前5行。


print ("\nBinary data:\n", Data_binarized [0:5])

输出量 (Output)


Binary data:
[
   [1. 1. 1. 1. 0. 1. 1. 1. 1.]
   [1. 1. 1. 1. 0. 1. 0. 1. 0.]
   [1. 1. 1. 0. 0. 1. 1. 1. 1.]
   [1. 1. 1. 1. 1. 1. 0. 1. 0.]
   [0. 1. 1. 1. 1. 1. 1. 1. 1.]
]

标准化 (Standardization)

Another useful data preprocessing technique which is basically used to transform the data attributes with a Gaussian distribution. It differs the mean and SD (Standard Deviation) to a standard Gaussian distribution with a mean of 0 and a SD of 1. This technique is useful in ML algorithms like linear regression, logistic regression that assumes a Gaussian distribution in input dataset and produce better results with rescaled data. We can standardize the data (mean = 0 and SD =1) with the help of StandardScaler class of scikit-learn Python library.

另一种有用的数据预处理技术,基本上用于转换具有高斯分布的数据属性。 它将平均值和SD(标准偏差)与平均值为0和SD为1的标准高斯分布不同。此技术在ML算法(如线性回归,逻辑回归)中很有用,该算法假设输入数据集中的高斯分布并产生更好的结果结果与重新缩放的数据。 我们可以借助scikit-learn Python库的StandardScaler类来标准化数据(平均值= 0和SD = 1)。

(Example)

In this example, we will rescale the data of Pima Indians Diabetes dataset which we used earlier. First, the CSV data will be loaded and then with the help of StandardScaler class it will be converted into Gaussian Distribution with mean = 0 and SD = 1.

在此示例中,我们将重新缩放我们之前使用的Pima Indians Diabetes数据集的数据。 首先,将加载CSV数据,然后借助StandardScaler类将其转换为平均值= 0和SD = 1的高斯分布。

The first few lines of following script are same as we have written in previous chapters while loading CSV data.

以下脚本的前几行与我们在上一章中编写CSV数据时写的相同。


from sklearn.preprocessing import StandardScaler
from pandas import read_csv
from numpy import set_printoptions
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(path, names=names)
array = dataframe.values

Now, we can use StandardScaler class to rescale the data.

现在,我们可以使用StandardScaler类来重新缩放数据。


data_scaler = StandardScaler().fit(array)
data_rescaled = data_scaler.transform(array)

We can also summarize the data for output as per our choice. Here, we are setting the precision to 2 and showing the first 5 rows in the output.

我们还可以根据我们的选择汇总输出数据。 在这里,我们将精度设置为2,并在输出中显示前5行。


set_printoptions(precision=2)
print ("\nRescaled data:\n", data_rescaled [0:5])

输出量 (Output)


Rescaled data:
[
   [ 0.64  0.85  0.15  0.91 -0.69  0.2   0.47  1.43  1.37]
   [-0.84 -1.12 -0.16  0.53 -0.69 -0.68 -0.37 -0.19 -0.73]
   [ 1.23  1.94 -0.26 -1.29 -0.69 -1.1   0.6  -0.11  1.37]
   [-0.84 -1.   -0.16  0.15  0.12 -0.49 -0.92 -1.04 -0.73]
   [-1.14  0.5  -1.5   0.91  0.77  1.41  5.48 -0.02  1.37]
]

资料标示 (Data Labeling)

We discussed the importance of good fata for ML algorithms as well as some techniques to pre-process the data before sending it to ML algorithms. One more aspect in this regard is data labeling. It is also very important to send the data to ML algorithms having proper labeling. For example, in case of classification problems, lot of labels in the form of words, numbers etc. are there on the data.

我们讨论了好法塔对于ML算法的重要性,以及在将数据发送到ML算法之前对数据进行预处理的一些技术。 在这方面的另一方面是数据标记。 将数据发送到具有适当标签的ML算法也非常重要。 例如,在分类问题的情况下,数据上会出现许多单词,数字等形式的标签。

什么是标签编码? (What is Label Encoding?)

Most of the sklearn functions expect that the data with number labels rather than word labels. Hence, we need to convert such labels into number labels. This process is called label encoding. We can perform label encoding of data with the help of LabelEncoder() function of scikit-learn Python library.

大部分sklearn函数都希望数据使用数字标签而不是单词标签。 因此,我们需要将此类标签转换为数字标签。 此过程称为标签编码。 我们可以借助scikit-learn Python库的LabelEncoder()函数对数据执行标签编码。

(Example)

In the following example, Python script will perform the label encoding.

在以下示例中,Python脚本将执行标签编码。

First, import the required Python libraries as follows −

首先,如下导入所需的Python库:


import numpy as np
from sklearn import preprocessing

Now, we need to provide the input labels as follows −

现在,我们需要提供输入标签,如下所示:


input_labels = ['red','black','red','green','black','yellow','white']

The next line of code will create the label encoder and train it.

下一行代码将创建标签编码器并对其进行训练。


encoder = preprocessing.LabelEncoder()
encoder.fit(input_labels)

The next lines of script will check the performance by encoding the random ordered list −

脚本的下几行将通过编码随机有序列表来检查性能-


test_labels = ['green','red','black']
encoded_values = encoder.transform(test_labels)
print("\nLabels =", test_labels)
print("Encoded values =", list(encoded_values))
encoded_values = [3,0,4,1]
decoded_list = encoder.inverse_transform(encoded_values)

We can get the list of encoded values with the help of following python script −

我们可以在以下python脚本的帮助下获取编码值的列表-


print("\nEncoded values =", encoded_values)
print("\nDecoded labels =", list(decoded_list))

输出量 (Output)


Labels = ['green', 'red', 'black']
Encoded values = [1, 2, 0]
Encoded values = [3, 0, 4, 1]
Decoded labels = ['white', 'black', 'yellow', 'green']

翻译自: https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_python_preparing_data.htm

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值