Study notes for Feature Engineering

1. Introduction

  • A feature is an individual measurable heuristic property/attribute of a phenomenon being observed.Choosing discriminating and independent features is key to any pattern recognition algorithm being successful in classification. Note that the concept of "feature" is essentially the same as the concept of explanatory variable used in statistics. 
  • Feature selection is the process of selecting a subset of relevant features for use in model construction, and of throwing away irrelevant features.
  • Feature extraction transforms (or combines) the original input data (or features) into a new (reduced) space of features where the pattern recognition problem will be easier to solve. For example, time-series inputs are usually transformed from time domain to frequency domain by applying Fourier transform. Hence, it is domain-specific. 
  • Shallow data refers to the data where the number (n) of features is greater than the number (m)of examples, i.e., n>>m;Skinny data refers to the data where m>>n; Big data refers to the data where both m, n are large. 
  • A feature selection algorithm can be seen as the combination of a search technique for proposing new feature subsets, along with an evaluation measure which scores the different feature subsets. The choice of evaluation metric heavily influences the algorithm, and it is these evaluation metrics which distinguish between the three main categories of feature selection algorithms: wrappers, filters and embedded methods.
  • Implementations are available in scikit-learn python package.
  • Study notes: the main objective of feature engineering (both selection and extraction) is to combine or select relevant features that can better represent the raw input data, in order to reduce the dimensionality of input data (i.e, original feature space) and hence better apply the machine learning algorithms to learn the data pattern. Do remember that: relevance does not imply usefulness while usefulness does not imply relevance either. In other words, relevant features (e.g. redundant features) may not be useful in practice and irrelevant features (e.g. correlation or causality-related features) may be quite useful instead. 
  • ResourcesJMLR Special Issue on Variable and Feature Selection
  • Example. Which features should be used for classifying a student as a good or bad one? The available features are: marks, height, sex, weight, IQ. 
    • Feature selection would select marks and IQ, and discard height, weight and sex.
    • Feature extraction would build marks + IQ2 as the best features, i.e.., a combination of two basic features. 

2. How good are my features?

It is important for us to assess or estimate the usefulness of a specific feature or a feature subset. It can be done from the following point of views:

  1. Classical statistic viewpoint: performing statistical test. =>It is likely to be "relevance" (independent on specific problems), suitable for the "Filter" methods
    • Assess the "statistical significance" of the relevance of given features. 
    • For a training set, let the ranking index be a random variable R. 
    • A feature is probably approximately irrelevant iff
       
      where and  are the level of approximation, and the risk of being wrong, respectively. Both values are positive. 
    • Using the null hypothesis "The features is irrelevant", we then apply statistical tests such as Z-test, T-test, ANOVA test to compute the significance if the probability P(X) or P(Y|X) can be obtained. Otherwise, AUC Ranking Index, Wilcoxon –Mann-Whitney (i.e., based on classification accuracy rate, etc.) may be used instead to derive the significance.  
  2. Machine learning viewpoint: using a training set and a validation set. =>It is likely to be "usefulness" (dependent on given problems), suitable for the "Wrapper" methods. 
    • In general, it is realized by cross validation. 
    • The leave-one-out has a lot of variance. 
    • Often 10-fold cross validation is a good choice. 

3. Benefits of Feature Selection

The central assumption is that the data contains many redundant or irrelevant features. Redundant features are those which provide no more information than the currently selected features (they are highly correlated features). Irrelevant features provide no useful information in any context. Noisy features: signal-to-noise ratio is too low to be useful for discriminating outcome. 

Why does feature selection? Consider two cases:

  • Case 1: We are interested in features; we want to know which are relevant; we don't necessarily want to do prediction. If we fit a model, it should be interpretable. e.g. what the reasons (features) causing lung cancer? It is needed to do feature selection. 
  • Case 2: We are interested in prediction; features are not interesting in themselves, we just want to build a good classifier (or other kind of predictors). If the only concern is accuracy, and the whole data set can be processed, feature selection is not needed (as long as there is regularization). If computational complexity is critical (e.g., embedded device, web-scale data, fancy learning algorithm), consider using feature selection. 

What are the benefits of feature selection?

  • Alleviate the curse of dimensionality. As the number of features increase, the volume of feature space increases exponentially. Data becomes increasingly sparse in the feature space. 
  • Improved model interpretability. One motivation is to find a simple, "parsimonious" model. Occam's razor: the simplest explanation that accounts for the data is the best.
  • Shorter training times. Training with all features may be too expensive. In many practice, it is not unusual to come up with more than 10^6 features. 
  • Enhanced generalization by reducing over fitting due to many features (and less examples). The presence of irrelevant features hurt generalization. Two morals are:
    1. Moral 1: In the presence of many irrelevant features, we might just fit noise.
    2. Moral 2: Training error can lead us astray. 

4. Feature Subset - Search and Evaluation

  • Filter methods use a proxy measure instead of the error rate to score a feature subset. Features are selected before machine learning algorithm is run. This measure is chosen to be fast to compute, whilst still capturing the usefulness of the feature set. Filters are usually less computationally intensive than wrappers, but they produce a feature set which is not tuned to a specific type of predictive model. Many filters provide a feature ranking rather than an explicit best feature subset, and the cut off point in the ranking is chosen via cross-validation. A number of filter metrics are listed as below, just to name a few. 
    • Regression: Mutual information (=information gain?), Correlation
    • Classification with categorical data: Chi-squared, information gain, document frequency
    • Inter-class distance
    • Error probability
    • Probabilistic distance
    • Entropy
    • Minimum-redundancy-maximum-relevance (mRMR)
    The advantage of this kind of methods is that it is fast and simple to apply. The disadvantage is that it does not take into account interactions between features, for example, apparently useless features can be useful when grouped with others.
  • Wrapper methods use a predictive model to score feature subsets. Use machine learning algorithm as a black box to select best subset of features. Each new subset is used to train a model, which is tested on a hold-out set. Counting the number of mistakes made on that hold-out set (the error rate of the model) gives the score for that subset. As wrapper methods train a new model for each subset, they are very computationally intensive, but usually provide the best performing feature set for that particular type of model. In conclusion, wrappers can be computationally expensive and have a risk of over fitting to the model.
    • Exhaustive search subsets: too expensive to be used in practice. 
    • Greedy search is common and effective, including random selection, forward selection and backward elimination, where backward elimination tends to find better subset features, but it is frequently too expensive to fit the large subset features at the beginning of the search. However, they are too greedy and ignore the relationships between features. The source codes of back elimination algorithm could be: 
      Initialize S = {1, 2, ..., n}
      Do 
          remove a feature from S which improve the performance most using cross validation
      while performance can be improved
      
    • Best-first search
    • Stochastic search
  • Embedded methods are a catch-all group of techniques which perform feature selection as part of the model construction process. Feature selection occurs naturally as part of the
    machine learning algorithm.
     The exemplar of this approach is the LASSO method for constructing a linear model, which penalizes the regression coefficients, shrinking many of them to zero. Any features which have non-zero regression coefficients are 'selected' by the LASSO algorithm. One other popular approach is the Recursive Feature Elimination algorithm, commonly used with Support Vector Machines to repeatedly construct a model and remove features with low weights. These approaches tend to be between filters and wrappers in terms of computational complexity.
    • L1-regularizationLinear models penalized with the L1 norm have sparse solutions: many of their estimated coefficients are zero. Hence, we can choose the non-zero coefficients. In particular, for regression, we can use sparse regression, LASSO; for classification, we can use logistic regression or linearSVC. 
    • Decision tree
    • Regularized trees, i.e., regularized random forest
    • Many other machine learning methods applying a pruning step 

5. Feature Extraction

  • Best results are achieved when an expert constructs a set of application-dependent features. Usually, the constructive operators to construct features include:
    • The equality conditions {=, !=};
    • The arithmetic operators {+, -, x, /}; 
      • e.g. "Age" is defined as Age='Year of death' - 'Year of birth'. 
    • The array operators {max(S), min(S), average(S)}
    • Some others: 
      • count(S,C) that counts the number of features in the input space S satisfying some condition C; 
  • Nevertheless, if no such expert knowledge is available, general dimensionality reduction techniques may help, including:
    • Principal component analysis (PCA) = Karhunen-Loeve transform = Hotelling transform
      • PCA is the most popular feature extraction method, especially if the data are highly correlated and thus there is redundant information. 
      • PCA is a linear transformation to compress data
      • PCA has been successfully applied to human face recognition
      • Variants of PCA: kernel PCA, multilinear PCA
    • Linear discriminant analysis (LDA) = Fisher analysis
      • LDA is a linear transformation, also used in face recognition.
      • LDA seeks directions that are efficient for discrimination between classes.
      • In PCA, the subspace defined by the (feature) vectors is the one that better describes the conjunct of data. 
      • LDA tries to descriminate between the different classes of data.
    • Independent component analysis (ICA)
      • ICA is a statistical technique that represents a multidimensional random vector as a linear combination of nonguassian random variables ('independent components') that are as independent as possible. Put simply, it decomposes a complex dataset into independent sub-parts. To do so, it assumes that the sources are independent. 
      • ICA is somewhat similar to PCA
      • ICA has many applications in data analysis, source separation, and feature extraction. 
    • Non-negative matrix factorization (NMF)
      • NMF is a recently developed technique for finding parts, and it is based on linear representations of non-negative data.
      • In contrast to PCA or ICA, the non-negativity constraints make the representation purely additive. The intuition is that, in most real systems, the variables are non negative. PCA and ICA offer results complicated to interpret.
    • Other approaches: semidefinite embedding, multifactor dimension reduction, multilinear subspace learning, nonlinear dimensionality reduction, isomap, latent semantic analysis, partial least squares, autoencoder, etc.
    • Implementations are available via http://scikit-learn.org/stable/modules/decomposition.html 
  • In scikit-learn package, the most implemented feature extraction methods are the text and image feature extractions.
  • More feature extraction methods include: 
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值