Study notes for Feature Engineering

最新推荐文章于 2016-06-12 16:16:23 发布

Felix_夜雨

最新推荐文章于 2016-06-12 16:16:23 发布

阅读量2.6k

点赞数 1

分类专栏： Machine Learning 文章标签： machine learning 机器学习 study notes

本文链接：https://blog.csdn.net/u010693617/article/details/9026653

版权

Machine Learning 专栏收录该内容

23 篇文章 0 订阅

订阅专栏

1. Introduction

A feature is an individual measurable heuristic property/attribute of a phenomenon being observed.Choosing discriminating and independent features is key to any pattern recognition algorithm being successful in classification. Note that the concept of "feature" is essentially the same as the concept of explanatory variable used in statistics.
Feature selection is the process of selecting a subset of relevant features for use in model construction, and of throwing away irrelevant features.
Feature extraction transforms (or combines) the original input data (or features) into a new (reduced) space of features where the pattern recognition problem will be easier to solve. For example, time-series inputs are usually transformed from time domain to frequency domain by applying Fourier transform. Hence, it is domain-specific.
Shallow data refers to the data where the number (n) of features is greater than the number (m)of examples, i.e., n>>m;Skinny data refers to the data where m>>n; Big data refers to the data where both m, n are large.
A feature selection algorithm can be seen as the combination of a search technique for proposing new feature subsets, along with an evaluation measure which scores the different feature subsets. The choice of evaluation metric heavily influences the algorithm, and it is these evaluation metrics which distinguish between the three main categories of feature selection algorithms: wrappers, filters and embedded methods.
Implementations are available in scikit-learn python package.
Study notes: the main objective of feature engineering (both selection and extraction) is to combine or select relevant features that can better represent the raw input data, in order to reduce the dimensionality of input data (i.e, original feature space) and hence better apply the machine learning algorithms to learn the data pattern. Do remember that: relevance does not imply usefulness while usefulness does not imply relevance either. In other words, relevant features (e.g. redundant features) may not be useful in practice and irrelevant features (e.g. correlation or causality-related features) may be quite useful instead.
Resources: JMLR Special Issue on Variable and Feature Selection
Example. Which features should be used for classifying a student as a good or bad one? The available features are: marks, height, sex, weight, IQ.
- Feature selection would select marks and IQ, and discard height, weight and sex.
- Feature extraction would build marks + IQ² as the best features, i.e.., a combination of two basic features.

2. How good are my features?

It is important for us to assess or estimate the usefulness of a specific feature or a feature subset. It can be done from the following point of views:

Classical statistic viewpoint: performing statistical test. =>It is likely to be "relevance" (independent on specific problems), suitable for the "Filter" methods
- Assess the "statistical significance" of the relevance of given features.
- For a training set, let the ranking index be a random variable R.
- A feature is probably approximately irrelevant iff
  $Prob(R>\epsilon)\le \delta$
  where $\epsilon$ and $\delta$ are the level of approximation, and the risk of being wrong, respectively. Both values are positive.
- Using the null hypothesis "The features is irrelevant", we then apply statistical tests such as Z-test, T-test, ANOVA test to compute the significance if the probability P(X) or P(Y|X) can be obtained. Otherwise, AUC Ranking Index, Wilcoxon –Mann-Whitney (i.e., based on classification accuracy rate, etc.) may be used instead to derive the significance.
Machine learning viewpoint: using a training set and a validation set. =>It is likely to be "usefulness" (dependent on given problems), suitable for the "Wrapper" methods.
- In general, it is realized by cross validation.
- The leave-one-out has a lot of variance.
- Often 10-fold cross validation is a good choice.

3. Benefits of Feature Selection

The central assumption is that the data contains many redundant or irrelevant features. Redundant features are those which provide no more information than the currently selected features (they are highly correlated features). Irrelevant features provide no useful information in any context. Noisy features: signal-to-noise ratio is too low to be useful for discriminating outcome.

Why does feature selection? Consider two cases:

Case 1: We are interested in features; we want to know which are relevant; we don't necessarily want to do prediction. If we fit a model, it should be interpretable. e.g. what the reasons (features) causing lung cancer? It is needed to do feature selection.
Case 2: We are interested in prediction; features are not interesting in themselves, we just want to build a good classifier (or other kind of predictors). If the only concern is accuracy, and the whole data set can be processed, feature selection is not needed (as long as there is regularization). If computational complexity is critical (e.g., embedded device, web-scale data, fancy learning algorithm), consider using feature selection.

What are the benefits of feature selection?

Alleviate the curse of dimensionality. As the number of features increase, the volume of feature space increases exponentially. Data becomes increasingly sparse in the feature space.
Improved model interpretability. One motivation is to find a simple, "parsimonious" model. Occam's razor: the simplest explanation that accounts for the data is the best.
Shorter training times. Training with all features may be too expensive. In many practice, it is not unusual to come up with more than 10^6 features.
Enhanced generalization by reducing over fitting due to many features (and less examples). The presence of irrelevant features hurt generalization. Two morals are:
1. Moral 1: In the presence of many irrelevant features, we might just fit noise.
2. Moral 2: Training error can lead us astray.

4. Feature Subset - Search and Evaluation

Filter methods use a proxy measure instead of the error rate to score a feature subset. Features are selected before machine learning algorithm is run. This measure is chosen to be fast to compute, whilst still capturing the usefulness of the feature set. Filters are usually less computationally intensive than wrappers, but they produce a feature set which is not tuned to a specific type of predictive model. Many filters provide a feature ranking rather than an explicit best feature subset, and the cut off point in the ranking is chosen via cross-validation. A number of filter metrics are listed as below, just to name a few.
- Regression: Mutual information (=information gain?), Correlation
- Classification with categorical data: Chi-squared, information gain, document frequency
- Inter-class distance
- Error probability
- Probabilistic distance
- Entropy
- Minimum-redundancy-maximum-relevance (mRMR)
The advantage of this kind of methods is that it is fast and simple to apply. The disadvantage is that it does not take into account interactions between features, for example, apparently useless features can be useful when grouped with others.
Wrapper methods use a predictive model to score feature subsets. Use machine learning algorithm as a black box to select best subset of features. Each new subset is used to train a model, which is tested on a hold-out set. Counting the number of mistakes made on that hold-out set (the error rate of the model) gives the score for that subset. As wrapper methods train a new model for each subset, they are very computationally intensive, but usually provide the best performing feature set for that particular type of model. In conclusion, wrappers can be computationally expensive and have a risk of over fitting to the model.
- Exhaustive search subsets: too expensive to be used in practice.
- Greedy search is common and effective, including random selection, forward selection and backward elimination, where backward elimination tends to find better subset features, but it is frequently too expensive to fit the large subset features at the beginning of the search. However, they are too greedy and ignore the relationships between features. The source codes of back elimination algorithm could be:
```
Initialize S = {1, 2, ..., n}
Do 
    remove a feature from S which improve the performance most using cross validation
while performance can be improved
```
- Best-first search
- Stochastic search
Embedded methods are a catch-all group of techniques which perform feature selection as part of the model construction process. Feature selection occurs naturally as part of the
machine learning algorithm. The exemplar of this approach is the LASSO method for constructing a linear model, which penalizes the regression coefficients, shrinking many of them to zero. Any features which have non-zero regression coefficients are 'selected' by the LASSO algorithm. One other popular approach is the Recursive Feature Elimination algorithm, commonly used with Support Vector Machines to repeatedly construct a model and remove features with low weights. These approaches tend to be between filters and wrappers in terms of computational complexity.
- L1-regularization: Linear models penalized with the L1 norm have sparse solutions: many of their estimated coefficients are zero. Hence, we can choose the non-zero coefficients. In particular, for regression, we can use sparse regression, LASSO; for classification, we can use logistic regression or linearSVC.
- Decision tree
- Regularized trees, i.e., regularized random forest
- Many other machine learning methods applying a pruning step

5. Feature Extraction

Best results are achieved when an expert constructs a set of application-dependent features. Usually, the constructive operators to construct features include:
- The equality conditions {=, !=};
- The arithmetic operators {+, -, x, /};
  - e.g. "Age" is defined as Age='Year of death' - 'Year of birth'.
- The array operators {max(S), min(S), average(S)}
- Some others:
  - count(S,C) that counts the number of features in the input space S satisfying some condition C;
Nevertheless, if no such expert knowledge is available, general dimensionality reduction techniques may help, including:
- Principal component analysis (PCA) = Karhunen-Loeve transform = Hotelling transform
  - PCA is the most popular feature extraction method, especially if the data are highly correlated and thus there is redundant information.
  - PCA is a linear transformation to compress data
  - PCA has been successfully applied to human face recognition
  - Variants of PCA: kernel PCA, multilinear PCA
- Linear discriminant analysis (LDA) = Fisher analysis
  - LDA is a linear transformation, also used in face recognition.
  - LDA seeks directions that are efficient for discrimination between classes.
  - In PCA, the subspace defined by the (feature) vectors is the one that better describes the conjunct of data.
  - LDA tries to descriminate between the different classes of data.
- Independent component analysis (ICA)
  - ICA is a statistical technique that represents a multidimensional random vector as a linear combination of nonguassian random variables ('independent components') that are as independent as possible. Put simply, it decomposes a complex dataset into independent sub-parts. To do so, it assumes that the sources are independent.
  - ICA is somewhat similar to PCA
  - ICA has many applications in data analysis, source separation, and feature extraction.
- Non-negative matrix factorization (NMF)
  - NMF is a recently developed technique for finding parts, and it is based on linear representations of non-negative data.
  - In contrast to PCA or ICA, the non-negativity constraints make the representation purely additive. The intuition is that, in most real systems, the variables are non negative. PCA and ICA offer results complicated to interpret.
- Other approaches: semidefinite embedding, multifactor dimension reduction, multilinear subspace learning, nonlinear dimensionality reduction, isomap, latent semantic analysis, partial least squares, autoencoder, etc.
- Implementations are available via http://scikit-learn.org/stable/modules/decomposition.html
In scikit-learn package, the most implemented feature extraction methods are the text and image feature extractions.
More feature extraction methods include:
- Sparse coding and dictionary learning: over-complete and sparsity included coefficients.

Felix_夜雨

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Study notes for Feature Engineering

Study notes for feature engineering including selection and extraction of features. Feature engineering is key to machine learning.
复制链接

扫一扫