What are good ways to handle discrete and continuous inputs together?

最新推荐文章于 2022-02-14 23:36:27 发布

asasasaababab

最新推荐文章于 2022-02-14 23:36:27 发布

阅读量249

点赞数

分类专栏：学习笔记文章标签： ML

学习笔记专栏收录该内容

37 篇文章 3 订阅

订阅专栏

Rescale bounded continuous features: All continuous input that are bounded, rescale them to [-1, 1] through x = (2x - max - min)/(max - min).
Standardize all continuous features: All continuous input should be standardized and by this I mean, for every continuous feature, compute its mean (u) and standard deviation (s) and do x = (x - u)/s.
Binarize categorical/discrete features: For all categorical features, represent them as multiple boolean features. For example, instead of having one feature called marriage_status, have 3 boolean features - married_status_single, married_status_married, married_status_divorced and appropriately set these features to 1 or -1. As you can see, for every categorical feature, you are adding k binary feature where k is the number of values that the categorical feature takes.

Now, you can represent all the features in a single vector which we can assume to be embedded in R^n and start using off-the-shelf packages for classification/regression etc.

Addendum:

If you use Kernel Based Methods, you can avoid this explicit embedding to R^n and focus on designing custom kernels for your feature vectors. You can even split your kernel into multiple kernels and use MKL models to learn weights over them. However, you may want to ensure positive semi-definiteness of your kernel so that the solver doesn’t have any problems. However, if you are unsure of whether you can design custom kernels, you can just follow the earlier embedding approach.

User-12798660346732021937 asked the following question in the comments:
I am interested in knowing .. would the answer still be same if our discrete input had a whole lot more possible values.
For example, instead of marriage_status, if we had an input variable called job_type. If job_type can take 100 values, it essentially means we create 100 variables out of those?
What if we had 1000 values?

I thought the answer to this question might be of benefit to a larger audience, so adding the answer here.

Before I answer the above question, let us go through some basic ideas.

Why do we binarize categorical features?
We binarize the categorical input so that they can be thought of as a vector from the Euclidean space (we call this as embedding the vector in the Euclidean space).

Why do we embed the feature vectors in the Euclidean space?
Because many algorithms for classification/regression/clustering etc. requires computing distances between features or similarities between features. And many definitions of distances and similarities are defined over features in Euclidean space. So, we would like our features to lie in the Euclidean space as well.

Why does embedding the feature vector in Euclidean space require us to binarize categorical features?
Let us take an example of a dataset with just one feature (say job_type as per your example) and let us say it takes three values 1,2,3.

Now, let us take three feature vectors x_1 = (1), x_2 = (2), x_3 = (3). What is the euclidean distance between x_1 and x_2, x_2 and x_3 & x_1 and x_3? d(x_1, x_2) = 1, d(x_2, x_3) = 1, d(x_1, x_3) = 2. This shows that distance between job type 1 and job type 2 is smaller than job type 1 and job type 3. Does this make sense? Can we even rationally define a proper distance between different job types? In many cases of categorical features, we can properly define distance between different values that the categorical feature takes. In such cases, isn’t it fair to assume that all categorical features are equally far away from each other?

Now, let us see what happens when we binary the same feature vectors. Then, x_1 = (1, 0, 0), x_2 = (0, 1, 0), x_3 = (0, 0, 1). Now, what are the distances between them? They are sqrt(2). So, essentially, when we binarize the input, we implicitly state that all values of the categorical features are equally away from each other.

About the original question?
Note that our reason for why binarize the categorical features is independent of the number of the values the categorical features take, so yes, even if the categorical feature takes 1000 values, we still would prefer to do binarization.

Are there cases when we can avoid doing binarization?
Yes. As we figured out earlier, the reason we binarize is because we want some meaningful distance relationship between the different values. As long as there is some meaningful distance relationship, we can avoid binarizing the categorical feature. For example, if you are building a classifier to classify a webpage as important entity page (a page important to a particular entity) or not and let us say that you have the rank of the webpage in the search result for that entity as a feature, then 1] note that the rank feature is categorical, 2] rank 1 and rank 2 are clearly closer to each other than rank 1 and rank 3, so the rank feature defines a meaningful distance relationship and so, in this case, we don’t have to binarize the categorical rank feature.

More generally, if you can cluster the categorical values into disjoint subsets such that the subsets have meaningful distance relationship amongst them, then you don’t have binarize fully, instead you can split them only over these clusters. For example, if there is a categorical feature with 1000 values, but you can split these 1000 values into 2 groups of 400 and 600 (say) and within each group, the values have meaningful distance relationship, then instead of fully binarizing, you can just add 2 features, one for each cluster and that should be fine.

asasasaababab

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
What are good ways to handle discrete and continuous inputs together?

Rescale bounded continuous features: All continuous input that are bounded, rescale them to [-1, 1] through x = (2x - max - min)/(max - min). Standardize all continuous features: All continuous input...
复制链接

扫一扫

专栏目录