离散数据作为神经网络的输入，我们该如何进行处理

最新推荐文章于 2023-06-17 14:17:25 发布

PilviMannis

最新推荐文章于 2023-06-17 14:17:25 发布

阅读量4.6k

点赞数 1

分类专栏：数据处理随笔文章标签：机器学习深度学习神经网络

本文链接：https://blog.csdn.net/circleyuanquan/article/details/111469615

版权

随笔同时被 2 个专栏收录

8 篇文章 0 订阅

订阅专栏

数据处理

3 篇文章 0 订阅

订阅专栏

离散数据归一化处理

离散型数据处理方式one-hot（原因总结如下）：

使用one-hot编码，将离散特征的取值扩展到了欧式空间，离散特征的某个取值就对应欧式空间中的某个点；
将离散的特征通过one-hot编码映射到欧式空间，是因为在回归、聚类、分类等机器学习算法中，特征之间距离的计算或者相似度的计算是非常重要的，而我们常用的距离或者相似度的计算都是在欧式空间的相似度计算，计算余弦相似性，基于的就是欧式空间。
将离散型特征使用one-hot编码，确实会让特征之间的距离计算更加合理。比如，有一个离散型特征，代表工作类型，该离散型特征，共有三个取值，不使用one-hot编码，其表示分别是x_1 = (1), x_2 =(2),x_3=(3)。两个工作之间的距离是，(x_1,x_2)=1,d(x_2, x_3)=1,d(x_1,x_3)=2。那么x_1和x_3工作之间就越不相似吗？显然这样的表示，计算出来的特征的距离是不合理。那如果使用one-hot编码，则得到x_1 = (1, 0, 0), x_2 = (0, 1,0),x_3=(0,0,1)，那么两个工作之间的距离就都是sqrt(2).即每两个工作之间的距离是一样的，显得更合理。

=========================================================

a) Binarize categorical/discrete features: For all categorical features, represent them as multiple boolean features. For example, instead of having one feature called marriage_status, have 3 boolean features - married_status_single, married_status_married, married_status_divorced and appropriately set these features to 1 or -1. As you can see, for every categorical feature, you are adding k binary feature where k is the number of values that the categorical feature takes.

a）对分类/离散特征进行二值化：对于所有分类特征，将它们表示为多个布尔特征。例如，不具有3个布尔值功能，而是具有3个布尔值功能-已婚状态，单身，已婚状态，已婚，已婚并已将这些功能适当地设置为1或-1。如您所见，对于每个分类特征，您要添加k个二进制特征，其中k是分类特征采用的值的数量。

1、Why do we binarize categorical features?

We binarize the categorical input so that they can be thought of as a vector from the Euclidean space (we call this as embedding the vector in the Euclidean space).

1、为什么要对分类特征进行二值化？
我们对分类输入进行二值化，以便可以将它们视为来自欧几里得空间的向量（我们称其为将向量嵌入到欧几里得空间中）。

2、Why do we embed the feature vectors in the Euclidean space?

Because many algorithms for classification/regression/clustering etc. requires computing distances between features or similarities between features. And many definitions of distances and similarities are defined over features in Euclidean space. So, we would like our features to lie in the Euclidean space as well.

2、为什么要在欧几里得空间中嵌入特征向量？
因为许多用于分类/回归/聚类等的算法需要计算特征之间的距离或特征之间的相似性。距离和相似度的许多定义是在欧氏空间中的特征上定义的。因此，我们希望我们的功能也位于欧几里得空间中。

3、Why does embedding the feature vector in Euclidean space require us to binarize categorical features?

Let us take an example of a dataset with just one feature (say job_type as per your example) and let us say it takes three values 1,2,3.

Now, let us take three feature vectors x_1 = (1), x_2 = (2), x_3 = (3). What is the euclidean distance between x_1 and x_2, x_2 and x_3 & x_1 and x_3? d(x_1, x_2) = 1, d(x_2, x_3) = 1, d(x_1, x_3) = 2. This shows that distance between job type 1 and job type 2 is smaller than job type 1 and job type 3. Does this make sense? Can we even rationally define a proper distance between different job types? In many cases of categorical features, we can properly define distance between different values that the categorical feature takes. In such cases, isn’t it fair to assume that all categorical features are equally far away from each other?

Now, let us see what happens when we binary the same feature vectors. Then, x_1 = (1, 0, 0), x_2 = (0, 1, 0), x_3 = (0, 0, 1). Now, what are the distances between them? They are sqrt(2). So, essentially, when we binarize the input, we implicitly state that all values of the categorical features are equally away from each other.

3，为什么在欧几里得空间中嵌入特征向量需要我们对分类特征进行二值化？
让我们以仅具有一个功能的数据集为例（按照您的示例说job_type），并说它接受三个值1,2,3。
现在，让我们采用三个特征向量x_1 =（1），x_2 =（2），x_3 =（3）。 x_1和x_2，x_2和x_3和x_1和x_3之间的欧式距离是多少？ d（x_1，x_2）= 1，d（x_2，x_3）=1，d（x_1，x_3）=2。这表明作业类型1和作业类型2之间的距离小于作业类型1和作业类型3。这有意义吗？我们甚至可以合理地定义不同工作类型之间的适当距离吗？在分类特征的许多情况下，我们可以正确定义分类特征所取不同值之间的距离。在这种情况下，假设所有分类特征彼此之间的距离相等是否公平？
现在，让我们看看对相同的特征向量进行二进制处理时会发生什么。然后，x_1=（1、0、0），x_2=（0、1、0），x_3=（0、0、1）。现在，它们之间的距离是多少？它们是sqrt（2）。因此，本质上，当我们对输入进行二值化时，我们隐式声明分类特征的所有值彼此相等。

4、About the original question?

Note that our reason for why binarize the categorical features is independent of the number of the values the categorical features take, so yes, even if the categorical feature takes 1000 values, we still would prefer to do binarization.

4，关于原始问题？
请注意，我们对分类特征进行二值化的原因与分类特征所取值的数量无关，因此，是的，即使分类特征取1000个值，我们仍然希望进行二值化。

5、Are there cases when we can avoid doing binarization?

Yes. As we figured out earlier, the reason we binarize is because we want some meaningful distance relationship between the different values. As long as there is some meaningful distance relationship, we can avoid binarizing the categorical feature. For example, if you are building a classifier to classify a webpage as important entity page (a page important to a particular entity) or not and let us say that you have the rank of the webpage in the search result for that entity as a feature, then

1] note that the rank feature is categorical,

2] rank 1 and rank 2 are clearly closer to each other than rank 1 and rank 3, so the rank feature defines a meaningful distance relationship and so, in this case, we don’t have to binarize the categorical rank feature.

5，是否有可以避免二值化的情况？
是。正如我们先前所指出的，我们进行二值化的原因是因为我们希望不同值之间存在有意义的距离关系。只要存在有意义的距离关系，我们就可以避免对分类特征进行二值化。例如，如果您要构建一个分类器，以将网页分类为重要实体页面（对特定实体重要的页面），或者不进行分类，可以说您将该实体在搜索结果中的网页排名作为特征，然后
1]注意等级特征是分类的，
2]等级1和等级2明显比等级1和等级3更接近，因此等级特征定义了有意义的距离关系，因此，在这种情况下，我们不不必对分类等级功能进行二值化处理。

More generally, if you can cluster the categorical values into disjoint subsets such that the subsets have meaningful distance relationship amongst them, then you don’t have binarize fully, instead you can split them only over these clusters. For example, if there is a categorical feature with 1000 values, but you can split these 1000 values into 2 groups of 400 and 600 (say) and within each group, the values have meaningful distance relationship, then instead of fully binarizing, you can just add 2 features, one for each cluster and that should be fine.

更一般而言，如果您可以将分类值聚类为不相交的子集，以使子集之间具有有意义的距离关系，那么您就不会完全二值化，而是只能将它们拆分成这些聚类。例如，如果有一个具有1000个值的分类特征，但是您可以将这1000个值分为2组，分别为400和600（例如），并且在每个组中，这些值具有有意义的距离关系，那么您可以将其完全二值化只需添加2个功能，每个集群一个，就可以了。

It depends on your ML algorithms, some methods requires almost no efforts to normalize features or handle both continuous and discrete features, like tree based methods: c4.5, Cart, random Forrest, bagging or boosting. But most of parametric models (generalized linear models, neural network, SVM,etc) or methods using distance metrics (KNN, kernels, etc) will require careful work to achieve good results. Standard approaches including binary all features, 0 mean unit variance all continuous features, etc。

这取决于您的ML算法，某些方法几乎不需要费力即可对特征进行规范化或同时处理连续特征和离散特征，例如基于树的方法：c4.5，Cart，Forrest随机，装袋或增强。但是，大多数参数模型（广义线性模型，神经网络，SVM等）或使用距离度量的方法（KNN，内核等）都需要认真工作才能取得良好的结果。标准方法包括二进制所有特征，0均值单位方差，所有连续特征等。

PilviMannis

关注

1
点赞
踩
13

收藏

觉得还不错? 一键收藏
3
评论
离散数据作为神经网络的输入，我们该如何进行处理

离散数据归一化处理离散型数据处理方式one-hot（原因总结如下）：使用one-hot编码，将离散特征的取值扩展到了欧式空间，离散特征的某个取值就对应欧式空间中的某个点；将离散的特征通过one-hot编码映射到欧式空间，是因为在回归、聚类、分类等机器学习算法中，特征之间距离的计算或者相似度的计算是非常重要的，而我们常用的距离或者相似度的计算都是在欧式空间的相似度计算，计算余弦相似性，基于的就是欧式空间。将离散型特征使用one-hot编码，确实会让特征之间的距离计算更加合理。比如，有一个离散型特征，
复制链接

扫一扫

专栏目录