为什么在LabelEncoder后还要使用onehot？

最新推荐文章于 2024-07-09 23:56:48 发布

fire2fire2

最新推荐文章于 2024-07-09 23:56:48 发布

阅读量1.6k

点赞数

分类专栏： python理解文章标签： python 大数据

本文链接：https://blog.csdn.net/qq_41973062/article/details/116330846

版权

python理解专栏收录该内容

5 篇文章 0 订阅

订阅专栏

1、官网解释

2、关于距离更合适的解释

参考：

6.3. Preprocessing data — scikit-learn 0.24.2 documentation

为什么要用one-hot编码 - 简书 (jianshu.com)

1、官网解释

6.3. Preprocessing data — scikit-learn 0.24.2 documentation

Such integer representation can, however, not be used directly with all scikit-learn estimators, as these expect continuous input, and would interpret the categories as being ordered, which is often not desired (i.e. the set of browsers was ordered arbitrarily).

Another possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K, also known as one-hot or dummy encoding. This type of encoding can be obtained with the OneHotEncoder, which transforms each categorical feature with n_categories possible values into n_categories binary features, with one of them 1, and all others 0.

2、关于距离更合适的解释

将离散型特征使用one-hot编码，会让特征之间的距离计算更加合理。

比如，有一个离散型特征，代表工作类型，该离散型特征，共有三个取值，不使用one-hot编码，其表示分别是x_1 = (1), x_2 = (2), x_3 = (3)。

两个工作之间的距离是，(x_1, x_2) = 1, d(x_2, x_3) = 1, d(x_1, x_3) = 2。那么x_1和x_3工作之间就越不相似吗？显然这样的表示，计算出来的特征的距离是不合理。

那如果使用one-hot编码，则得到x_1 = (1, 0, 0), x_2 = (0, 1, 0), x_3 = (0, 0, 1)，那么两个工作之间的距离就都是sqrt(2).即每两个工作之间的距离是一样的，显得更合理。

fire2fire2

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
为什么在LabelEncoder后还要使用onehot？

目录1、官网解释2、关于距离更合适的解释参考：6.3. Preprocessing data — scikit-learn 0.24.2 documentation为什么要用one-hot编码 - 简书 (jianshu.com)1、官网解释6.3. Preprocessing data — scikit-learn 0.24.2 documentationSuch integer representation can, however, not be used direct
复制链接

扫一扫