# 救命代码_救命！ 如何选择功能？

Often times, we’re not sure how to choose our features. This is just a small guide to help choose. (Disclaimer: For now I’ll talk about binary classification.)

Many times, when we are super-excited to predict using a fancy machine-learning algorithm, and we’re almost ready to apply our models to analyze and make classifications on the test data-set–– we don’t exactly know what features to pick. Often times, the # of features can range from tens to thousands, and it’s not exactly clear how to pick relevant features, and how many features we should select. Sometimes it’s a not a bad idea to combine features together, also known as feature engineering. A common example of this, you’ve probably heard in machine-learning –– is principal components analysis (PCA), where the data matrix X is factorized into its singular-value-decomposition (SVD) U*∑*V, where ∑ is a diagonal matrix with singular values, and the # of singular values you choose determines how many principal components. You can think of principal-components as a way to reduce the dimensions of your data-set. The awesome thing about PCA is that the new engineered features, or “principal-components”, are linear combinations of the original features. And that’s great! We love linear combinations, because it only involves addition and scalar-multiplication, and they’re not too hard to interpret. For example, if you did PCA on a dataset about house-price regression, and say you only selected 2 principal components. Then the first component, PC1, could be: c1*(# of bedrooms)+c2*(# sq.ft.). And PC2 could be something similar.

The limitation with principal components is, that the new features you make are *only* linear-combinations of some of the old ones. That means you can’t take advantage of making non-linear combinations of features. This is something neural networks are awesome at; they can create TONS of non-linear combinations/functions of features. But they have an even bigger problem: interpretability of the new features. The engineered features are basically hidden inside the weight-matrix multiplications between different layers of the network (which is just a composition of non-linear functions). And neural networks, with that extra non-linearity, can often be brittle and break under adversarial attacks, such as few-pixel attacks on convolutional neural networks, or tricking a neural network into mis-classifying a panda and a black square as a vulture –– weird, nonsense stuff like that.

So, what to do about features?? Well, if the ways we engineer new features are kinda limited, we could always just select a subset of the features we already have! But you need to be careful. There are many ways to do this, but not all of the are robust and consistent. For example, take random forests. It’s true that at the end of using the classifier, Python will output the relevant features with the feature_importances method of a random forest. But let’s think for a second: random forests work by training a bunch of decision trees, each one on a random subset of the training data. So if you kept repeating the RF model, you might get different feature-importances each time, and this is not robust or consistent. Wouldn’t it be confusing as a data scientist or ML engineer to see a different set of relevant features pop up each time? You clearly didn’t change the data set! So why should you trust different sets of “importance” features? The problem with this is that the “importance” features you’re picking, are dependent on the random-forest model itself––and even if RF’s have high accuracy, it also makes more sense to choose features based on the dataset alone, rather than including a heavy-duty model first.

The key to selecting features that are consistent, not confusing, and robust might be this: select features independently of your model. The relevant features you select should be relevant whether or not you use a neural network, an RF, logistic regression, or any other supervised learning model. This way, you don’t have to worry about the predictive power of your machine learning model while you’re trying to pick features at the same time, which be un-reliable.

So, how do you pick features that are independent of your model? Scikit-Learn has a few options. One of them which is my favorite is called mutual-information. It’s a important concept from probability-theory. Basically, it computes the dependence between your features-variables and your label-variable relative to the assumption that they’re independent. An easier way of saying that is it measure how much your class-labels depend on a specific feature.

So for example, say you’re predicting if someone has a tumor by looking at a bunch of feature columns in your dataset, like geometric-area, location, color-hue, etc. If you’re trying to choose relevant features to your prediction, you can use mutual-information to talk about how much each class-label depends on the geometric-area, location, and color-hue of the tumor. And this is a measurement gotten directly from the data; it never involved using a predictive model in the first place.

You can also use Sci-kit Learn’s chi-2, or “chi-squared”, to determine feature importance. What this does, is use a Chi-Squared test between the features and the label to determine which features are relevant to the label and which ones are independent of the label. You can think of this method as testing a “null hypothesis” H0: are the features independent of the classification label?To do this, you’d calculate a chi-squared statistic based on the data-table, get a p-value, and determine which features are independent or not. You then throw away the independent features (why? because they’re independent of the label according to your test, so they give no information) and keep the dependent ones.

This test is actually based on similar principles to the mutual-information calculation talked about above. However, chi2 does make the important assumptions that features in your dataset taking continuous values (say, 5.3, pi, sqrt(2), stuff like that) are normally distributed. Usually for big training-data sets this isn’t a problem, but for small training-data this assumption might be violated, so calculating mutual-information might be more reliable in those cases.

The basic point is this: mutual-information and chi-squared ways of feature-selecting are robust against the predictive model. Your predictive model might be wildly inaccurate, but the data you’ve collected is static in a table which never changes, so calculating your features without the model is more consistent.

Other ways of feature selecting include Recursive Feature Elimination (RFE), which uses a pre-fixed model (say, logistic/linear regression, or random forest) and tests almost all the subsets of features using the pre-fixed model, and decides which features are the best by seeing which subset of features gives the lowest accuracy error. (Technically, random forests use an additional method in Scikit Learn called feature_importance, but I won’t be getting into that here.) However, RFE does take a lot of time, because there are about 2-to-the-K subsets of features if you have K features, so it takes a long time to compute the model for each subset and get a score.

Another big reason I have against RFE and similar techniques is that it is fundamentally a feature-selection technique which is model-dependent. If your model is inaccurate, or overfits heavily, or does both and isn’t that interpretable by the user –– then the features you selected weren’t actually chosen by you, but by the model. So. the feature importance might not be an accurate representation of which features actually are predictive based just on the dataset.

So what can we take away from all this? Well, in the end, feature-selecting is extremely important if you don’t know how to interpret your engineered features using, say principal-component-analysis. However, when you do feature selection it’s just as important to take note about how you’re selecting your features, as well as computational time. Is your method taking too much time on the computer? Is your feature-selection based on using a particular model first? Ideally, you would want to feature-select regardless of what model you use, so in your Jupyter Notebook, you would ideally want to make a cell for feature-selection before the model –– something like this:

from sklearn.feature_selection import SelectKBestfrom sklearn.feature_selection import mutual_info_classif# "mutual_info_classif" is the mutual-information way of selecting# the K most dependent features based on the class-labelK = 3selector = SelectKBest(mutual_info_classif, K)X = new_df.iloc[:, :-1]y = new_df.iloc[:, -1]X_reduced = selector.fit_transform(X,y)                                    features_selected = selector.get_support()

First, I did the feature selection (above).

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X_reduced, y, train_size=0.7)# use logistic regression as a modellogreg = LogisticRegression(C=0.1, max_iter=1000, solver='lbfgs')logreg.fit(X_train, y_train)

And then I trained the model (above)! :)

• 0
点赞
• 0
评论
• 0
收藏
• 扫一扫，分享海报

05-02 1051
07-29 102

10-20 939
06-22 618
09-28 37
01-12 696
04-03 1049
05-30 1615
05-22 555
11-14 16