机器学习中的一种热门编码-CSDN博客

Recently, I was working on a project based on Machine Learning came across this thing called “One Hot Encoding”.While working on any dataset the foremost requirement is that of pre-processing the data. And encoding is a very big part of pre-processing so that the computer can understand the data.

最近，我在一个基于机器学习的项目中遇到了一个叫做“一次热编码”的事情。在处理任何数据集时，最重要的要求是对数据进行预处理。编码是预处理的重要组成部分，因此计算机可以理解数据。

And the two popular techniques for this is Label-encoding and One-hot encoding.

两种流行的技术是Label编码和One-hot编码。

Label Encoding refers to converting the labels into numeric form so as to convert it into the machine-readable form. Machine learning algorithms can then decide in a better way on how those labels must be operated.

标签编码是指将标签转换为数字形式，以便将其转换为机器可读形式。然后，机器学习算法可以更好地决定必须如何操作这些标签。

For Example, we have a column of weight in some random dataset.

例如，我们在某些随机数据集中有一列权重。

The figure above shows the label encoding for the given height column.

上图显示了给定高度列的标签编码。

One Hot Encoding is a process which is used by categorical variables to convert into a form that could be used by ML algorithms for better predictions. In simple terms, it is used to convert every label to a list with 10 elements and the elements at index to the corresponding class will be 1 and the remaining will be set to 0.

一种热编码是一种过程，分类变量使用该过程将其转换为ML算法可用于更好地进行预测的形式。简单来说，它用于将每个标签转换为包含10个元素的列表，索引处对应类的元素将为1，其余元素将设置为0。

For example,consider the data where fruits and their corresponding categorical value and the given one-hot encoding data.

例如，考虑数据所在的水果及其对应的分类值和给定的一键编码数据。

为什么标签编码不够用？ (Why Label Encoding is not Enough?)

Well, by both the encodings we can see, that label encoding has mainly rows but now there are columns. However, the numerical label variable, price is still the same. Although, it just fixes a problem encountered when working with categorical data. And so we won’t be using it in every situation.

好了，通过我们可以看到的两种编码，标签编码主要包含行，但是现在有列。但是，数字标签变量，价格仍然相同。虽然，它只是解决了使用分类数据时遇到的问题。因此，我们不会在每种情况下都使用它。

Moreover, RMSE of one hot encoding is less than the Label encoder which means better accuracy.

此外，一种热编码的RMSE小于标签编码器，这意味着更好的准确性。

The problem here is since there are different numbers in the same column, the model will misunderstand the data to be in some kind of order, 0 < 1 <2.

这里的问题是，由于同一列中有不同的数字，因此模型会误解数据的排列顺序，即0 <1 <2。

一种热编码的代码 (Code for one hot Encoding)

Let’s see how we can code the one hot encoding and use it. Starting with importing the libraries for the preprocessing.

让我们看看如何编码一种热编码并使用它。从导入用于预处理的库开始。

import numpy as np
import pandas as pd

And now with the help of tensorflow it becomes very easy to use this because of a helper function in keras and we’ll use it for both test set and training data set for any given dataset.Let’s suppose we have some random dataset and we input the dataset with pandas ‘read_csv’ feature:

现在，借助tensorflow，由于keras中的辅助功能，变得非常容易使用它，我们将其用于任意给定数据集的测试集和训练数据集。假设我们有一些随机数据集，我们输入具有pandas'read_csv '功能的数据集：

dataset=pd.read_csv('random.csv')
from tensorflow.keras.utils import to_categorical
y_train_encoded=to_categorical(y_train)
y_test_encoded=to_categorical(y_test)

To validate it we print the given labels as 10 dimensional label for example:

为了验证它，我们将给定标签打印为10维标签，例如：

1 — [0,1,0,0,0,0,0,0,0,0]

2 — [0,0,1,0,0,0,0,0,0,0]

printf('y_train_encoded shape:',y_train_encoded.shape)
printf('y_test_encoded shape:',y_test_encoded.shape)

To check for the encoded label

检查编码的标签

y_train_encoded[0]  //let this value be encoded for 5

will give the encoded label array[(0,0,0,0,0,1,0,0,0,0)].

将给出编码的标签数组[(0,0,0,0,0,1,0,0,0,0)]。

This is how you can perform one hot encoding.

这是执行一种热编码的方法。

翻译自: https://medium.com/nerd-for-tech/one-hot-encoding-in-machine-learning-2d14c22f7e26