Chapter 5：Catergorical variable: Counting Eggs in the Age of Robotic Chickens

Sarah ฅʕ•̫͡•ʔฅ

已于 2022-05-24 18:01:31 修改

阅读量374

点赞数

分类专栏： Book-feature engineer for ML 文章标签：机器学习深度学习矩阵

于 2018-11-16 19:05:03 首次发布

本文链接：https://blog.csdn.net/u014765410/article/details/84142212

版权

Book-feature engineer for ML 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

Encoding categorical variables

举例说明，one-hot，dummy code，effect code，3种编码方式。

one-hot vector
假如现有k个category，则各个category的feature vector 为k 维。

上述例子中，e1 + e2 + e3 =1，3个feature存在linear dependence，因此，用one-hot vector拟合的linear model不唯一，很难解释model fitting后linear model各个coefficient的意义。
dummy code
假如现有k个category，则各个category的feature vector 为k-1 维。其中，有一个category的feature vector为(0,0，…)，为reference category。

与one-hot相比，dummy拟合的linear model解释性更好,具有唯一性。
effect code
与dummy的唯一区别在于，其reference category为（-1，-1，…），举例说明：
3种coding 的pros和cons
1）简要阐述3个coding 方法下，linear model意义：
给出3个城市的房价信息，预估各个城市的房价，数据如下：

分别利用3中coding方式，对city进行编码，用feature engineering后的data拟合linear regression，预估Rent：
one-hot :
one-hot 后的data，其拟合的linear model中，bias代表gloable Rent的均值，各个feature上的coefficient分别代表各个city Rent的均值距gloable Rent 均值的距离。
用one-hot所得的linear model不唯一，难解释。
dummy：
dummy data 拟合的linear model，bias代表reference category的Rent 均值，各个feature上的coefficient分别代表各个city Rent的均值距reference category Rent均值的距离。
effect code：
In effect coding, no single feature represents the reference category, so the effect of the reference category needs to be separately computed as the negative sum of the coefficients of all other categories.
2）3种coding方式的pros 和 cons
one-hot vector：
pros：有效处理missing value：如果一个data中，某个feature value缺失，可以直接将该feature value = 0;
cons：feature之间linear dependence，这样的data会产生多个valid model，得到的model 解释性差；
dummy and effect code：
pros：二者均可产生unique model，model解释性好；
cons：dummy不能有效处理missing data，因为其reference category的vector为（0,0，…）；effect code中，其reference category的vector为（-1，-1，…），利用effect code会产生一个dense matrix，不利于storage和computation。

上述3中coding方式均存在一个downside，即：当category数目庞大时，3种方式均会break down。

Dealing with large categorical variables

feature hashing
1）feature hashing示意图
feature hashing本质：将各个category归入某一个bin中。

2）2种feature hashing方式
type1：ordinary feature hashing

type2：signed feature hashing

？？？adding a sign component ensures that the inner products between hashed features are equal in expectation to those of the original features.（不太理解）
The value of the inner product after hashing is within O(1/m^1/2) of the original inner product, so the size of the hash table m can be selected based on acceptable errors. In practice, picking the right m could take some trial and error.
Feature hashing can be used for models that involve the inner product of feature vectors and coefficients, such as linear models and kernel methods.
It has been demonstrated to be successful in the task of spam filtering (Weinberger et al.,2009).
Note that：feature hashing, being aggregated of original features, is no longer interpretable，but benefits for storage and computation。
bin counting
1）定义
The idea of bin counting is deviously simple: rather than using the value of the categorical variable as the feature, instead use the conditional probability of the target under that value. In other words, instead of encoding the identity of the categorical value, we compute the association statistics between that value and the target that we wish to predict. For those familiar with naive Bayes classifiers, this statistic should ring a bell, because it is the conditional probability of the class under the assumption that all features are independent.
示例如下：

Note that：one-hot产生的是一个sparse matrix，而bin counting则可产生一个dense matrix，示意如下：

2）rare category处理方式
just like rare word mentioned in previous chapter, rare category needs special treatment，现列举2种处理方式：
way1：设定一个threshold规定哪些category为rare，将所有的rare category归入到一个bin中，视为一个category计算value。这种处理方式中，会增加一个feature，is_rare_category，用于标识各个category是否为rare category。

way2：count-min sketch
在这种处理方式中，将所有category用多个hashing function映射，假设现有k个hashing function，每个hashing function的size为m，则k*m应该<category的总数。
利用这些hashing function表示一个category。这种表达方式，比起用一个hashing function，减少了category 表示中出现collision的probability，也不至于完全没有归并category的可能。
如下图所示：hi为hashing function，it为item，ct为hashing function table中的一个index，在count-min sketch中，item可用如下方法表示：当item计数增加1时，在下图中各个index位置，value += 1；

3）data leakage处理方式
在用bin counting进行feature engineering时，由于会涉及到target，可能会造成data leakage，下面介绍2中处理data leakage的方法：
way1：如下图所示：
use an earlier batch of data points for counting, use the current data points for training (mapping categorical variables to historical statistics(bin counting) we just collected), and use future data points for testing. This fixes the problem of leakage, but introduces the aforementioned delay (the input statistics and therefore the model will trail behind current data)。

way2：引入random noise
A statistic is approximately leakage-proof if its distribution stays roughly the same with or without any one data point. In practice, adding a small random noise with distribution Laplace(0,1) is sufficient to cover up any potential leakage from a single data point. This idea can be combined with leaving-one-out counting to formulate statistics on current data？？？（leave one out是指cross-validation中的 leave one out???） (Zhang, 2015).
4）counts without bounds处理方式
以ad clicking案例为例，如果model一直以历史数据作为training data的话，日积月累，随着count的不断增加，其model(count)的值会不断膨胀，原trained model的判决边界可能会失效，在这种情况下，需要不断的retrain model。
为了解决count increase without bounds的问题，可采用如下方式：
way1：use normalized counts that are guaranteed to be bounded in a known interval. For instance, the estimated click-through probability is bounded between [0, 1].
way2： Another method is to take the log transform, which imposes a strict bound, but the rate of increase will be very slow when the count is very large.
Note that：上述两种way都不能阻止input distribution的改变。如：在ad clicking case中，count的数量会随着user喜好不断变化而呈现不同的distribution，这种情况下，为了适应user 喜好的变化，需要不断retrain model；或者，直接采用 online learning的方式，保证model一直跟踪user 喜好。