Python对离散变量处理：哑变量编码和one-hot编码

最新推荐文章于 2024-06-26 11:35:37 发布

要早睡的码农

最新推荐文章于 2024-06-26 11:35:37 发布

阅读量6k

点赞数 7

文章标签：机器学习数据分析

本文链接：https://blog.csdn.net/dengfenglaipppp/article/details/89400351

版权

在数据进行建模分析，无法直接把类别变量放入模型中去分析，因此，需要对类别变量进行处理。最常见的方法是对类别变量做哑变量编码或one-hot编码，所以运用最近的业务数据进行了尝试。哑变量编码和one-hot编码的具体介绍和区别在ML小菜鸟的博客中有比较详细的介绍博客链接。

做哑变量编码的库：pandas
one-hot编码的库：sklearn、keras

注意：pandas默认只处理字符串类别变量，sklearn和keras默认只处理数值型类别变量(需要先 LabelEncoder )

1.pandas进行哑变量编码

pandas中提供了get_dummies()函数：

pandas.get_dummies(prefix=) prefix参数设置编码后的变量名，也可以选择默认

# 导入数据
import pandas as pd
file = open('C:/Users/Dell/Documents/LTDSJ/example1data/data.csv',encoding='utf-8')
factory = pd.read_csv(file,engine='python')

数据如下：

Out[1]: 
0    C10235
1    C10135
2    C10035
3    C10135
4    C10385
5    C14534
6        **
7        **
8    C10129
9    C10150
Name: factory_id, dtype: object

进行哑变量编码

encode = pd.get_dummies(factory)

编码结果（部分展示）:

Out[2]: 
   **  C10035  C10129  C10135  C10150  C10235  C10385  C14534
0   0       0       0       0       0       1       0       0
1   0       0       0       1       0       0       0       0
2   0       1       0       0       0       0       0       0
3   0       0       0       1       0       0       0       0
4   0       0       0       0       0       0       1       0
5   0       0       0       0       0       0       0       1
6   1       0       0       0       0       0       0       0
7   1       0       0       0       0       0       0       0
8   0       0       1       0       0       0       0       0
9   0       0       0       0       1       0       0       0

2.sklearn进行one-hot编码

sklearn中有用OneHotEncoder()进行one-hot编码，但是由于sklearn只处理数值型类别变量，在编码前需要先将字符串转化为数值型，用 LabelEncoder函数：

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder  
    
label = preprocessing.LabelEncoder()
factory_label = label.fit_transform(factory) 
print(factory_label)  
# 输出结果  
[[5]
 [3]
 [1]
 [3]
 [6]
 [7]
 [0]
 [0]
 [2]
 [4]]

然后再进行one-hot编码

# 将行转列
factory_label = factory_label.reshape(len(factory_label), 1)
# 编码
factory_encoder = OneHotEncoder(sparse=False)
onehot_encoded = factory_encoder.fit_transform(factory_label)
# 查看编码结果
onehot_encoded

Out[53]: 
array([[0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1.],
       [1., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0.]])

3.keras进行one-hot编码

与OneHotEncoder()一样，再进行one-hot编码前也需要用 LabelEncoder()先将字符串转化为数值型。

# 基于keras的onehot编码 
from keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder
# str类型转换
label = LabelEncoder()
factory_label = label.fit_transform(factory)

# 编码
categorical_encoded = to_categorical(factory_label)
categorical_encoded

Out[85]: 
array([[0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1.],
       [1., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0.]], dtype=float32)

还可以对one-hot编码后的数据解码回去：

# 解码
import numpy as np
decode = []
for i in range(len(factory_label)):
    decode.append(np.argmax(categorical_encoded[i]))
print(decode)
# 结果
[5, 3, 1, 3, 6, 7, 0, 0, 2, 4]

对应的原始列表

index	factory	label
0	C10235	5
1	C10135	3
2	C10035	1
3	C10135	3
4	C10385	6
5	C14534	7
6	**	0
7	**	0
8	C10129	2
9	C10150	4

要早睡的码农

关注

7
点赞
踩
49

收藏

觉得还不错? 一键收藏
0
评论
Python对离散变量处理：哑变量编码和one-hot编码

在数据进行建模分析，无法直接把类别变量放入模型中去分析，因此，需要对类别变量进行处理。最常见的方法是对类别变量做哑变量编码或one-hot编码，所以运用最近的业务数据进行了尝试。哑变量编码和one-hot编码的具体介绍和区别在ML小菜鸟的博客中有比较详细的介绍博客链接。做哑变量编码的库：pandasone-hot编码的库：sklearn、keras注意：pandas默认只处理字符串类别变...
复制链接

扫一扫