机器学习 LabelEncoding OnehotEncoding

最新推荐文章于 2023-05-28 18:50:27 发布

Michael_Flemming

最新推荐文章于 2023-05-28 18:50:27 发布

阅读量711

点赞数

分类专栏： scikit-learn个人学习笔记文章标签：机器学习 python sklearn

本文链接：https://blog.csdn.net/weixin_44360866/article/details/126546285

版权

scikit-learn个人学习笔记专栏收录该内容

17 篇文章 2 订阅

订阅专栏

终于看到一篇靠谱的文章：link

Label encoding

就是采用类似字典的思想，将类别变量的每个取值对应一个整数。
在python中的实现主要有两种方法：

sklearn中preprocessing库的LabelEncoder类。
pandas中的factorize()函数/方法。

1.LabelEncoder类

sklearn.preprocessing.LabelEncoder
常用方法：
fit(y) ：fit可看做一本空字典，y可看作要塞到字典中的词。
fit_transform(y)：相当于先进行fit再进行transform，即把y塞到字典中去以后再进行transform得到索引值。
inverse_transform(y)：根据索引值y获得原始数据。
transform(y) ：将y转变成索引值。

使用kaggle钻石数据集进行测试：

le = LabelEncoder()
le_cut = le.fit_transform(diamonds_data['cut'])
print(le.classes_)
print(le_cut)

['Fair' 'Good' 'Ideal' 'Premium' 'Very Good']
[2 3 1 ... 4 3 2]

多列数据进行编码：

le = LabelEncoder()
list_to_encode = ['cut', 'color', 'clarity']

for col in list_to_encode:
    diamonds_data[col] = le.fit_transform(diamonds_data[col])
    print(col + ":")
    print(le.classes_)
print(diamonds_data)

cut:
['Fair' 'Good' 'Ideal' 'Premium' 'Very Good']
color:
['D' 'E' 'F' 'G' 'H' 'I' 'J']
clarity:
['I1' 'IF' 'SI1' 'SI2' 'VS1' 'VS2' 'VVS1' 'VVS2']
       carat  cut  color  clarity  depth  table  price     x     y     z
1       0.23    2      1        3   61.5   55.0    326  3.95  3.98  2.43
2       0.21    3      1        2   59.8   61.0    326  3.89  3.84  2.31
3       0.23    1      1        4   56.9   65.0    327  4.05  4.07  2.31
4       0.29    3      5        5   62.4   58.0    334  4.20  4.23  2.63
5       0.31    1      6        3   63.3   58.0    335  4.34  4.35  2.75
...      ...  ...    ...      ...    ...    ...    ...   ...   ...   ...
53936   0.72    2      0        2   60.8   57.0   2757  5.75  5.76  3.50
53937   0.72    1      0        2   63.1   55.0   2757  5.69  5.75  3.61
53938   0.70    4      0        2   62.8   60.0   2757  5.66  5.68  3.56
53939   0.86    3      4        3   61.0   58.0   2757  6.15  6.12  3.74
53940   0.75    2      0        3   62.2   55.0   2757  5.83  5.87  3.64

2.pandas中的factorize()函数/方法

pandas.factorize(values, sort=False, na_sentinel=- 1, size_hint=None)

将对象编码为枚举类型或分类变量。
使用kaggle的钻石数据集测试了一下：

codes, unique = pd.factorize(diamonds_data['cut'])
print(codes)
print(unique)

[0 1 2 ... 3 1 0]
Index(['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'], dtype='object')

还可以设置参数sort=True，对unique进行排序后，再编码：

codes, unique = pd.factorize(diamonds_data['cut'], sort=True)
print(codes)
print(unique)

[2 3 1 ... 4 3 2]
Index(['Fair', 'Good', 'Ideal', 'Premium', 'Very Good'], dtype='object')

总结：
Label Encoding只是将文本转化为数值，并没有解决文本特征的问题：所有的标签都变成了数字，算法模型直接将根据其距离来考虑相似的数字，而不考虑标签的具体含义。使用该方法处理后的数据适合支持类别性质的算法模型，如LightGBM。

独热编码(One-Hot Encoding)

独热编码优缺点：
优点： 独热编码解决了分类器不好处理属性数据的问题，在一定程度上也起到了扩充特征的作用。它的值只有0和1，不同的类型存储在垂直的空间。
缺点： 当类别的数量很多时，特征空间会变得非常大。在这种情况下，一般可以用PCA（主成分分析）来减少维度。而且One-Hot Encoding+PCA这种组合在实际中也非常有用。

更多详见开头的博文。

基于Scikit-learn 的one hot encoding：
有两种方式：

LabelBinarizer
OneHotEncoder

1. LabelBinarizer:

将对应的数据转换为二进制型，类似于onehot编码，这里有几点不同：

可以处理数值型和类别型数据.
输入必须为1D数组.
可以自己设置正类和负类的表示方式.

例子：

list_to_encode = ['cut', 'color', 'clarity']
lb = LabelBinarizer()
for col in list_to_encode:
    print(col + ":")
    print(lb.fit_transform(diamonds_data[col]))
    print(lb.classes_)

cut:
[[0 0 1 0 0]
 [0 0 0 1 0]
 [0 1 0 0 0]
 ...
 [0 0 0 0 1]
 [0 0 0 1 0]
 [0 0 1 0 0]]
['Fair' 'Good' 'Ideal' 'Premium' 'Very Good']
color:
[[0 1 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 ...
 [1 0 0 ... 0 0 0]
 [0 0 0 ... 1 0 0]
 [1 0 0 ... 0 0 0]]
['D' 'E' 'F' 'G' 'H' 'I' 'J']
clarity:
[[0 0 0 ... 0 0 0]
 [0 0 1 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 1 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

2. OneHotEncoder：

OneHotEncoder只能对数值型数据进行处理，需要先将文本转化为数值（Label encoding）后才能使用。
只接受2D数组。

class sklearn.preprocessing.OneHotEncoder(*, categories='auto', drop=None, sparse=True, dtype=<class 'numpy.float64'>, handle_unknown='error')

参数：
sparse bool, default=True ：如果设置为True，将返回稀疏矩阵，否则将返回数组。

例子：
（写了一个行数对数据中的类别型变量用 OneHotEncoder进行编码，最终返回整个数据）

########################################OneHotEncoder##########################
# 对钻石数据集中的类别型变量进行独热编码，最终返回整个数据集
# 注意：1.使用OneHotEncoder前，要将字符串类型的变量先使用LabelEncoder转换成数值型.2.OneHotEncoder只接受2D数组。
def LabelOneHotEncoder(data, list_to_encode):
    data_num = np.array([])
    for col in data.columns:
        if col in list_to_encode:
            le = LabelEncoder()
            data[col] = le.fit_transform(data[col])  # 先将data中的类别型变量转换成数值标签
            print(col, ':', le.classes_)
            encoder_col = OneHotEncoder(sparse=False).fit_transform(np.array(diamonds_data['cut'])[:, np.newaxis])
            if len(data_num) == 0:
                data_num = encoder_col
            else:
                data_num = np.hstack([data_num, encoder_col])
        else:
            if len(data_num) == 0:
                data_num = np.array(data[col])[:, np.newaxis]
            else:
                data_num = np.hstack([data_num, np.array(data[col])[:, np.newaxis]])
        print(data_num.shape)
    return data_num


# # 使用钻石数据集进行测试
data_transf = LabelOneHotEncoder(diamonds_data, ['cut', 'color', 'clarity'])
print(data_transf)
print(data_transf.shape)

(53940, 1)
cut : ['Fair' 'Good' 'Ideal' 'Premium' 'Very Good']
(53940, 6)
color : ['D' 'E' 'F' 'G' 'H' 'I' 'J']
(53940, 11)
clarity : ['I1' 'IF' 'SI1' 'SI2' 'VS1' 'VS2' 'VVS1' 'VVS2']
(53940, 16)
(53940, 17)
(53940, 18)
(53940, 19)
(53940, 20)
(53940, 21)
(53940, 22)
[[0.23 0.   0.   ... 3.95 3.98 2.43]
 [0.21 0.   0.   ... 3.89 3.84 2.31]
 [0.23 0.   1.   ... 4.05 4.07 2.31]
 ...
 [0.7  0.   0.   ... 5.66 5.68 3.56]
 [0.86 0.   0.   ... 6.15 6.12 3.74]
 [0.75 0.   0.   ... 5.83 5.87 3.64]]
(53940, 22)

注：结果有一点点奇怪。比如，color有7类，但是最后data_num的列数只增加了5.

OrdinalEncoder

这是最简单的一种，对于一个具有m个category的Feature，我们将其对应地映射到 [0,m-1] 的整数。
当然 Ordinal Encoding 更适用于 Ordinal Feature，即各个特征有内在的顺序。例如对于”学历”这样的类别，”学士”、”硕士”、”博士” 可以很自然地编码成 [0,2]，因为它们内在就含有这样的逻辑顺序。但如果对于“颜色”这样的类别，“蓝色”、“绿色”、“红色”分别编码成[0,2]是不合理的，因为我们并没有理由认为“蓝色”和“绿色”的差距比“蓝色”和“红色”的差距对于特征的影响是不同的。

orec = OrdinalEncoder()
orec.fit(diamonds_data[['cut', 'color', 'clarity']])
print(orec.transform(diamonds_data[['cut', 'color', 'clarity']]))

[[2. 1. 3.]
 [3. 1. 2.]
 [1. 1. 4.]
 ...
 [4. 0. 2.]
 [3. 4. 3.]
 [2. 0. 3.]]