特征编码one-hot与dummy的区别与联系

最新推荐文章于 2023-01-19 20:18:34 发布

Rachel_nana

最新推荐文章于 2023-01-19 20:18:34 发布

阅读量2.8k

点赞数 1

分类专栏： python 文章标签： Python编码

本文链接：https://blog.csdn.net/abcdrachel/article/details/88085240

版权

python 专栏收录该内容

32 篇文章 6 订阅

订阅专栏

在模型的训练过程中，我们会对数据集的连续特征进行离散化操作，如使用简单的LR模型，然后对离散化后的特征进行one-hot编码或哑变量编码。这样通常会使得我们模型具有较强的非线性能力。

one-hot编码

思想：将离散化特征的每一种取值都成是一种状态，若你的这一特征中有N个不同的取值，那么我们就可以将这些特征抽象成N种不同的状态，one-hot编码保证了每一个取值只有一种状态处于“激活态”，也就是说N种状态中只有一个状态值为1，其他状态为0。假设我们以学历为例，现有小雪、中学、大学、硕士、博士五种类别，使用one-hot编码就会得到：

from numpy import argmax
# define input string
data = ["小学","初中","大学","硕士","博士"]
print(data)
# define universe of possible input values
alphabet = ["幼儿园","小学","初中","高中","大学","硕士","博士","博士后"]
# define a mapping of chars to integers
char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))
#%%
# integer encode input data
integer_encoded = [char_to_int[char] for char in data]
print(integer_encoded)
#%%
# one hot encode
onehot_encoded = list()
for value in integer_encoded:
       letter = [0 for _ in range(len(alphabet))]
       letter[value] = 1
       onehot_encoded.append(letter)
print(onehot_encoded)
#%%
# invert encoding
inverted = int_to_char[argmax(onehot_encoded[0])]
print(inverted)

python中one-hot的用法

from sklearn import preprocessing
array = np.array([[0,0,3],[1,1,0],[0,2,1],[1,0,2]])
print(array)
enc = preprocessing.OneHotEncoder()
enc.fit(array)
print(enc.n_values_) #每个特征对应的最大位数
print(enc.transform([[0,1,3]]).toarray())

array([[0, 0, 3],
       [1, 1, 0],
       [0, 2, 1],
       [1, 0, 2]])
array是一个4行3列的矩阵，每一列对应于一个样本的特征序列，即一个样本有三个特征，4行表示传入了4个样本
[2 3 4]
每个特征对应的位数：观察第一列可知，第一个特征有两个取值0,1；第二个特征有三个取值0,1,2；第三个特征有四个取值0,1,2,3
[[1. 0. 0. 1. 0. 0. 0. 0. 1.]]
故第一个特征的onehot编码是一个两位的0、1串，第二个特征是一个三位0、1串，第三个特征是一个四位的0,1串

from sklearn import preprocessing
data = ["小学","初中","大学","硕士","博士"]
dataf = ["幼儿园","小学","初中","高中","大学","硕士","博士","博士后"]
le = preprocessing.LabelEncoder()
datafm = le.fit_transform(dataf)
datam = le.fit_transform(data)
#%%
onehot = preprocessing.OneHotEncoder()
onehot.fit(datafm.reshape(-1,1))
print(onehot.n_values_)
[8]

datafm
Out[78]: array([5, 4, 0, 7, 3, 6, 1, 2], dtype=int64)

datam
Out[79]: array([3, 0, 2, 4, 1], dtype=int64)
#%%
print(onehot.transform(datam.reshape(-1,1)).toarray())

[[0. 0. 0. 1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0.]]

print(onehot.transform(datafm.reshape(-1,1)).toarray())
[[0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0.]]
print(onehot.transform(datafm.reshape(-1,1)))
  (0, 5)        1.0
  (1, 4)        1.0
  (2, 0)        1.0
  (3, 7)        1.0
  (4, 3)        1.0
  (5, 6)        1.0
  (6, 1)        1.0
  (7, 2)        1.0

在对非数值型特征进行onehot编码时需要先通过LabelEncoder()将分类变量转换成整数形式，然后通过OneHotEncoder()进行编码，否则会报错误如下ValueError: could not convert string to float: '小学'。另外再用OneHotEncoder()进行编码时需为2D array，否则会有如下错误：ValueError: Expected 2D array, got 1D array instead:
array=[3. 0. 2. 4. 1.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

dummy encoding

哑变量编码直观的解释就是任意的将一个状态去除，如学位信息，使用哑变量可以用4个状态就足以反映上述5个类别的信息，也就是我们仅使用前四个状态位就可以表示博士了。因为对于我们的一个研究样本，他如果不是小学生，也不是中学生，大学生、研究生，那他便默认为博士了。所以，用哑变量可以将上述5类表示成：

data = ["小学","初中","大学","硕士","博士"]
pd.get_dummies(data,prefix="data")

   data_初中  data_博士  data_大学  data_小学  data_硕士
0        0        0        0        1        0
1        1        0        0        0        0
2        0        0        1        0        0
3        0        0        0        0        1
4        0        1        0        0        0