python onehot_Python中的标签编码器和OneHot编码器

本文介绍了Python中如何使用标签编码器和OneHot编码器进行数据预处理。通过实例展示了这两种编码技术在机器学习项目中的应用。
摘要由CSDN通过智能技术生成

python onehot

Machine Learning algorithms understand the numbers and not texts. Hence, all the “text” columns must be converted into “numerical” columns to make it understandable for the algorithm.

中号 achine学习算法理解的数字,而不是文本。 因此,必须将所有“文本”列都转换为“数字”列,以使其对算法易于理解。

This is the story of transforming labels or categorical or text values into numbers or numerical values. In simple words,

这是将标签或分类或文本值转换为数字或数值的故事。 简单来说

Encoding is the process of transforming words into numbers

编码是将单词转换为数字的过程

In Python, OneHot Encoding and Lebel Encoding are two methods for encoding the categorical columns into numerical columns. And these are part of one of the most commonly used Python library: Scikit-Learn

在Python中, OneHot编码Lebel编码是将分类列编码为数值列的两种方法。 这些是最常用的Python库之一的一部分: Scikit-Learn

But wait, you don’t want to import Scikit-Learn in your notebook ??

但是,等等,您不想在笔记本中导入Scikit-Learn?

No problem at all, ⚡️ Pandas comes for your help.

完全没有问题,⚡️ 熊猫来找你。

Let us dive into this story of converting categorical variables into numerical ones so that ML algorithm understands it.

让我们深入探讨将分类变量转换为数值变量的故事,以便ML算法理解它。

分类数据 (Categorical Data)

Any dataset contains multiple columns containing numerical as well as categorical values.

任何数据集都包含多列,其中包含数值和分类值。

Image for post
Image by Author: Datatypes of categorical column
图像作者:类别列的数据类型

Categorical variables represent types of data which may be divided into groups. It has a limited and usually fixed number of possible values called categories. Variables like gender, social class, blood type, country codes, are examples of categorical data.

分类变量表示可以分为几组的数据类型。 它具有数量有限且通常固定的可能的值,称为类别 。 诸如性别,社会阶层,血型,国家/地区代码之类的变量是分类数据的示例。

But, if this data is encoded into numerical values then only it can be processed in a machine learning algorithm.

但是,如果将此数据编码为数值,则只能在机器学习算法中对其进行处理。

Let us consider the below example to understand the encoding in a simple way.

让我们考虑以下示例,以一种简单的方式理解编码。

import pandas as pdcountries = ["Germany","India","UK","Egypt","Iran"]
continents = ["Europe","Asia","Europe","Africa","Asia"]
code = ["G","I","U","E","N"]
d={"country": countries, "continent":continents, "code":code}
df = pd.DataFrame(d)
Image for post
Image by Author: Example DataFrame
图片作者:示例数据框架

Converting the data type of column “code” from object to the category

将“代码”列的数据类型从对象转换为类别

df['code'] = df.code.astype('category')
Image for post
Image by Author: Datatypes of all columns
图像作者:所有列的数据类型

With this example let us understand the encoding process.

通过此示例,让我们了解编码过程。

Python中的标签编码 (Label Encoding in Python)

Label encoding is a simple and straight forward approach. This converts each value in a categorical column into a numerical value. Each value in a categorical column is called Label.

标签编码是一种简单直接的方法。 这会将分类列中的每个值转换为数值。 分类列中的每个值称为Label

Label encoding: Assign a unique integer to each label based on alphabetical order

标签编码:根据字母顺序为每个标签分配一个唯一的整数

Let me show you how Label encoding works in python with the same above example,

让我用上面的相同示例向您展示Label编码在python中的工作方式,

from sklearn.preprocessing import LabelEncoderle = LabelEncoder()
df["labeled_continent"] = le.fit_transform(df["continent"])
df

the labels in column continent will be converted into numbers and will be stored in the new column — labeled_continent

continent列中的标签将转换为数字,并将存储在新列中labeled_continent

The output will be,

输出将是

Image for post
Image by Author: Label Encoding in Python
图片由作者提供:Python中的标签编码

In more simple words, labels are arranged in alphabetical order and a unique index is assigned to each label starting from 0.

用更简单的词来说,标签是按字母顺序排列的,并且唯一的索引从0开始分配给每个标签。

Image for post
Image by Author: Understand Label Encoding in Python
作者提供的图像:了解Python中的标签编码

All looks good ?? Worked well ⁉️

一切看起来都很好?? 运作良好⁉️

Here jumps in the problem with Label encoding. It uses numbers in a sequence that introduces a comparison between the labels. In the above example, the labels in the column continent do not have an order or rank. But after label encoding, these labels are ordered in an alphabetical manner. because of these numbers, a machine learning model can interpret this ordering as Europe > Asia > Africa

这里跳入了标签编码的问题。 它按顺序使用数字,从而在标签之间进行比较。 在上面的示例中, continent列中的标签没有顺序或等级。 但是在标签编码之后,这些标签以字母顺序排序。 由于这些数字,机器学习模型可以将此顺序解释为Europe > Asia > Africa

To overcome this ordering problem with Label Encoding, OneHot Encoding comes into the picture.

为了克服标签编码的排序问题,图片中加入了OneHot编码。

Python中的OneHot编码 (OneHot Encoding in Python)

In OneHot encoding, a binary column is created for each label in a column. Here each label is transformed into a new column or new feature and assigned 1 (Hot) or 0 (Cold) value.

在OneHot编码中,将为中的每个标签创建一个二进制列。 在这里,每个标签都将转换为新列或新特征,并分配1(热)或0(冷)值。

Let me show you an example first to understand the above statement,

让我先给你看一个例子,以理解上述陈述,

from sklearn.preprocessing import OneHotEncoderohe = OneHotEncoder()
df3 = pd.DataFrame(ohe.fit_transform(df[["continent"]]).toarray())
df_new=pd.concat([df,df3],axis=1)
df_new
Image for post
Image by Author: OneHot Encoding in Python
图片由作者提供:Python中的OneHot编码

In this scenario, the last three columns are the result of OneHot Encoding. Labels Africa, Asia, and Europe have been encoded as 0, 1, 2 respectively. OneHot encoding transforms these labels into columns. hence, looking at the last 3 columns, we have 3 labels → 3 columns

在这种情况下,最后三列是OneHot编码的结果。 标签非洲,亚洲和欧洲分别被编码为0、1、2。 OneHot编码将这些标签转换为列。 因此,查看最后3列,我们有3 labels → 3 columns

OneHot Encoding: In a single row only one Label is Hot

OneHot编码:在一行中,只有一个Label是Hot

In a particular row, only one label has a value of 1 and all other labels have a value of 0. Before feeding such an encoded dataset into a machine learning model few more transformations can be done as given in OneHot Encoding documentation.

在特定的行中,只有一个标签的值是1,所有其他标签的值是0。在将这种编码后的数据集输入到机器学习模型之前,如OneHot Encoding 文档中所述,可以再进行几步转换。

Have a quick look at this article to know more options for merging 2 DataFrames

快速浏览本文,了解合并2个DataFrame的更多选项

Python中的pandas.get_dummies() (pandas.get_dummies() in Python)

OneHot encoding can be implemented in a simpler way and without importing Scikit-Learn.

可以以更简单的方式实现OneHot编码,而无需导入Scikit-Learn。

⚡️ Yess !! Pandas is your friend here. This simple function pandas.get_dummies() will quickly transform all the labels from specified column into individual binary columns

⚡️是的!! 熊猫是您的朋友在这里。 这个简单的函数pandas.get_dummies()可以将所有标签从指定列快速转换为单个二进制列

df2=pd.get_dummies(df[["continent"]])
df_new=pd.concat([df,df2],axis=1)
df_new
Image for post
Image by Author: Pandas dummy variables
图片作者:熊猫虚拟变量

The last 3 columns of above DataFrame are the same as observed in OneHot Encoding.

上面DataFrame的最后3列与OneHot Encoding中观察到的相同。

pandas.get_dummies() generates dummy variables for each label in the column continent. Hence, continent_Africa, continent_Asia, and continent_Europe are the dummy binary variables for the labels Africa, Asia, and Europe respectively.

pandas.get_dummies()为continent中的每个标签生成虚拟变量。 因此, 大洲 _ 非洲,大洲 _亚洲和大洲 _欧洲分别是标签非洲,亚洲和欧洲的虚拟二进制变量

通过我的故事, (Through my story,)

I walked you through the methods of converting categorical variables into numerical variables. Each method has its own pros and limitations, hence it is important to understand all the methods. Depending on your dataset and machine learning model you want to implement, you can choose any of the above three label encoding methods in Python.

我向您介绍了将分类变量转换为数值变量的方法。 每种方法都有其自身的优缺点,因此了解所有方法非常重要。 根据您要实现的数据集和机器学习模型,您可以在Python中选择以上三种标签编码方法中的任何一种。

Here are a few resources which can help you with this topic:

这里有一些资源可以帮助您解决此主题:

  1. Label Encoding in Python

    Python中的标签编码

  2. OneHot Encoding in Python

    Python中的OneHot编码

  3. Get Dummy variables using Pandas

    使用Pandas获取虚拟变量

Liked my way of Storytelling ??

喜欢我的讲故事方式?

Here is an interesting fun & learn activity for you to create your own dataset. Have a look.

这是一个有趣的有趣的学习活动,可让您创建自己的数据集。 看一看。

Thank you for your time!

感谢您的时间!

I am always open to getting suggestions, and new opportunities. Feel free to add your feedback and connect with me on LinkedIn.

我总是乐于获得建议和新的机会。 随时添加您的反馈,并在LinkedIn上与我联系。

翻译自: https://towardsdatascience.com/label-encoder-and-onehot-encoder-in-python-83d32288b592

python onehot

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值