救命代码_救命！如何选择功能？

最新推荐文章于 2024-09-22 17:12:26 发布

weixin_26752765

最新推荐文章于 2024-09-22 17:12:26 发布

阅读量220

点赞数

文章标签： python java vue 算法 linux ViewUI

原文链接：https://towardsdatascience.com/help-how-do-i-feature-select-eaf37e58fdaf

版权

救命代码

Often times, we’re not sure how to choose our features. This is just a small guide to help choose. (Disclaimer: For now I’ll talk about binary classification.)

通常，我们不确定如何选择功能。 这只是帮助选择的小指南。 (免责声明：目前，我将讨论二进制分类。)

Many times, when we are super-excited to predict using a fancy machine-learning algorithm, and we’re almost ready to apply our models to analyze and make classifications on the test data-set–– we don’t exactly know what features to pick. Often times, the # of features can range from tens to thousands, and it’s not exactly clear how to pick relevant features, and how many features we should select. Sometimes it’s a not a bad idea to combine features together, also known as feature engineering. A common example of this, you’ve probably heard in machine-learning –– is principal components analysis (PCA), where the data matrix X is factorized into its singular-value-decomposition (SVD) U*∑*V, where ∑ is a diagonal matrix with singular values, and the # of singular values you choose determines how many principal components. You can think of principal-components as a way to reduce the dimensions of your data-set. The awesome thing about PCA is that the new engineered features, or “principal-components”, are linear combinations of the original features. And that’s great! We love linear combinations, because it only involves addition and scalar-multiplication, and they’re not too hard to interpret. For example, if you did PCA on a dataset about house-price regression, and say you only selected 2 principal components. Then the first component, PC1, could be: c1*(# of bedrooms)+c2*(# sq.ft.). And PC2 could be something similar.

很多时候，当我们非常兴奋地使用花哨的机器学习算法进行预测时，几乎准备好将我们的模型应用于测试数据集的分析和分类了，我们不知道到底有什么功能选择。通常，功能的数量可能从数十到数千不等，并且不清楚如何选择相关功能以及应该选择多少功能。有时将特征组合在一起并不是一个坏主意，也称为特征工程。您可能在机器学习中听说过一个常见的例子，即主成分分析(PCA) ，其中数据矩阵X被分解为其奇异值分解(SVD) U * ∑ * V ，其中∑是具有奇异值的对角矩阵，您选择的奇异值＃决定了多少个主成分。您可以将主成分视为减小数据集尺寸的一种方法。 PCA令人敬畏的是，新设计的功能或“主要组件”是原始功能的线性组合。太好了！我们喜欢线性组合，因为它只涉及加法和标量乘法，并且它们也不难解释。例如，如果您对有关房价回归的数据集进行了PCA，并说您只选择了2个主要成分。那么第一个组件PC1可以是： c1 * (卧室数量)+ c2 * (平方英尺)。与PC2可能类似。

The limitation with principal components is, that the new features you make are *only* linear-combinations of some of the old ones. That means you can’t take advantage of making non-linear combinations of features. This is something neural networks are awesome at; they can create TONS of non-linear combinations/functions of features. But they have an even bigger problem: interpretability of the new features. The engineered features are basically hidden inside the weight-matrix multiplications between different layers of the network (which is just a composition of non-linear functions). And neural networks, with that extra non-linearity, can often be brittle and break under adversarial attacks, such as few-pixel attacks on convolutional neural networks, or tricking a neural network into mis-classifying a panda and a black square as a vulture –– weird, nonsense stuff like that.

主要成分的局限性在于，新您制作的功能仅是某些旧功能的线性组合。这意味着您无法利用特征的非线性组合。这是神经网络的精妙之处。他们可以创建非线性的特征组合/功能的TONS。但是它们还有一个更大的问题：新功能的可解释性 。工程特征基本上隐藏在网络不同层之间的权重矩阵乘法中(这只是非线性函数的组合)。而且具有额外非线性的神经网络在对抗性攻击下通常会很脆弱，甚至会受到破坏，例如对卷积神经网络的小像素攻击，或者欺骗神经网络将熊猫和黑方块误分类为秃ul。 -像这样的古怪，胡说八道的东西。

So, what to do about features?? Well, if the ways we engineer new features are kinda limited, we could always just select a subset of the features we already have! But you need to be careful. There are many ways to do this, but not all of the are robust and consistent. For example, take random forests. It’s true that at the end of using the classifier, Python will output the relevant features with the feature_importances method of a random forest. But let’s think for a second: random forests work by training a bunch of decision trees, each one on a random subset of the training data. So if you kept repeating the RF model, you might get different feature-importances each time, and this is not robust or consistent. Wouldn’t it be confusing as a data scientist or ML engineer to see a different set of relevant features pop up each time? You clearly didn’t change the data set! So why should you trust different sets of “importance” features? The problem with this is that the “importance” features you’re picking, are dependent on the random-forest model itself––and even if RF’s have high accuracy, it also makes more sense to choose features based on the dataset alone, rather than including a heavy-duty model first.

那么，如何处理功能？好吧，如果我们设计新功能的方式受到限制，那么我们总是可以选择已经拥有的功能的子集！但是您需要小心。有许多方法可以做到这一点，但并非所有方法都是可靠且一致的。例如，采用随机森林。的确，在使用分类器的最后，Python将使用随机森林的feature_importances方法输出相关功能。但是让我们想一想：随机森林通过训练一堆决策树来工作，每个决策树都在训练数据的随机子集上。因此，如果您不断重复RF模型，则每次功能的重要性可能会有所不同，这既不可靠也不具有一致性。作为数据科学家或ML工程师，每次看到一组不同的相关功能都会感到困惑吗？您显然没有更改数据集！那么，为什么要信任不同组的“重要性”功能呢？这样做的问题在于，您选择的“重要性”特征依赖于随机森林模型本身-即使RF的准确性很高，也要仅根据数据集选择特征更有意义。而不是首先包括重型模型。

The key to selecting features that are consistent, not confusing, and robust might be this: select features independently of your model. The relevant features you select should be relevant whether or not you use a neural network, an RF, logistic regression, or any other supervised learning model. This way, you don’t have to worry about the predictive power of your machine learning model while you’re trying to pick features at the same time, which be un-reliable.

选择一致，不混乱和健壮的特征的关键可能是：独立于模型选择特征。无论您是否使用神经网络，RF，逻辑回归或任何其他监督学习模型，您选择的相关功能都应具有相关性。这样，当您试图同时选择不可靠的功能时，您不必担心机器学习模型的预测能力。

So, how do you pick features that are independent of your model? Scikit-Learn has a few options. One of them which is my favorite is called mutual-information. It’s a important concept from probability-theory. Basically, it computes the dependence between your features-variables and your label-variable relative to the assumption that they’re independent. An easier way of saying that is it measure how much your class-labels depend on a specific feature.

因此，如何选择与模型无关的功能？ Scikit-Learn有一些选择。我最喜欢的其中一种叫做互信息 。从概率论出发，这是一个重要的概念。基本上，它计算功能变量和标签变量之间的相关性 (假设它们是独立的)。说的更简单的方法是测量你的类的标签是多么 依赖一个特定的功能。

So for example, say you’re predicting if someone has a tumor by looking at a bunch of feature columns in your dataset, like geometric-area, location, color-hue, etc. If you’re trying to choose relevant features to your prediction, you can use mutual-information to talk about how much each class-label depends on the geometric-area, location, and color-hue of the tumor. And this is a measurement gotten directly from the data; it never involved using a predictive model in the first place.

因此，例如，假设您通过查看数据集中的一堆特征列(例如几何区域，位置，色相等)来预测某人是否患有肿瘤。如果您要尝试选择与您的特征相关的特征预测时，您可以使用相互信息来讨论每个类别标签在多大程度上取决于肿瘤的几何区域，位置和颜色。这是直接从数据中获得的度量；它从来没有涉及使用预测模型。

You can also use Sci-kit Learn’s chi-2, or “chi-squared”, to determine feature importance. What this does, is use a Chi-Squared test between the features and the label to determine which features are relevant to the label and which ones are independent of the label. You can think of this method as testing a “null hypothesis” H0: are the features independent of the classification label?To do this, you’d calculate a chi-squared statistic based on the data-table, get a p-value, and determine which features are independent or not. You then throw away the independent features (why? because they’re independent of the label according to your test, so they give no information) and keep the dependent ones.

您还可以使用Sci-kit Learn的chi-2或“卡方”来确定功能的重要性。这是在特征和标签之间使用Chi-Squared测试来确定哪些特征与标签相关，哪些特征与标签无关。您可以将这种方法视为测试“零假设” H0：特征是否独立于分类标签？为此，您需要根据数据表计算卡方统计量，获得p值，并确定哪些功能是独立的。然后，您丢弃独立的功能(为什么？，因为根据您的测试它们独立于标签 ，所以它们不提供任何信息)，并保留相关的功能。

This test is actually based on similar principles to the mutual-information calculation talked about above. However, chi2 does make the important assumptions that features in your dataset taking continuous values (say, 5.3, pi, sqrt(2), stuff like that) are normally distributed. Usually for big training-data sets this isn’t a problem, but for small training-data this assumption might be violated, so calculating mutual-information might be more reliable in those cases.

该测试实际上是基于与上述的互信息计算类似的原理。但是， chi2确实做出了重要的假设，即数据集中具有连续值(例如5.3，pi，sqrt(2)等东西)的特征是正态分布的。通常，对于大型训练数据集，这不是问题，但是对于小型训练数据，此假设可能会被违反，因此在这种情况下，计算互信息可能更可靠。

The basic point is this: mutual-information and chi-squared ways of feature-selecting are robust against the predictive model. Your predictive model might be wildly inaccurate, but the data you’ve collected is static in a table which never changes, so calculating your features without the model is more consistent.

基本要点是：特征选择的互信息和卡方方法对预测模型具有鲁棒性 。您的预测模型可能会非常不准确，但是您收集的数据在表中是静态的，永远不会改变，因此在没有模型的情况下计算特征更加一致。

Other ways of feature selecting include Recursive Feature Elimination (RFE), which uses a pre-fixed model (say, logistic/linear regression, or random forest) and tests almost all the subsets of features using the pre-fixed model, and decides which features are the best by seeing which subset of features gives the lowest accuracy error. (Technically, random forests use an additional method in Scikit Learn called feature_importance, but I won’t be getting into that here.) However, RFE does take a lot of time, because there are about 2-to-the-K subsets of features if you have K features, so it takes a long time to compute the model for each subset and get a score.

特征选择的其他方法包括递归特征消除(RFE)，它使用预先确定的模型(例如，逻辑/线性回归或随机森林)，并使用预先确定的模型测试几乎所有特征子集，并确定哪个通过查看哪些特征子集给出最低的准确度误差，可以确定最佳特征。 (从技术上讲，随机森林在Scikit Learn中使用了另一种称为feature_importance的方法 ，但在这里我不会赘述。)但是，RFE确实要花费很多时间，因为其中约有2个到K子集。如果您具有K个特征，则为每个子集计算模型并获得得分将花费很长时间。

Another big reason I have against RFE and similar techniques is that it is fundamentally a feature-selection technique which is model-dependent. If your model is inaccurate, or overfits heavily, or does both and isn’t that interpretable by the user –– then the features you selected weren’t actually chosen by you, but by the model. So. the feature importance might not be an accurate representation of which features actually are predictive based just on the dataset.

我反对RFE和类似技术的另一个重要原因是，从根本上讲，它是一种与模型相关的特征选择技术。如果您的模型不准确，或者过度拟合，或者两者兼而有之，并且用户无法解释–那么您选择的功能实际上不是您选择的，而是模型选择的。所以。特征重要性可能无法仅根据数据集准确表示哪些特征实际上是可预测的。

So what can we take away from all this? Well, in the end, feature-selecting is extremely important if you don’t know how to interpret your engineered features using, say principal-component-analysis. However, when you do feature selection it’s just as important to take note about how you’re selecting your features, as well as computational time. Is your method taking too much time on the computer? Is your feature-selection based on using a particular model first? Ideally, you would want to feature-select regardless of what model you use, so in your Jupyter Notebook, you would ideally want to make a cell for feature-selection before the model –– something like this:

那么，我们可以从这一切中拿走什么呢？好吧，最后，如果您不知道如何使用主成分分析来解释您的工程化特征，那么特征选择就非常重要。但是，在进行特征选择时，注意如何选择特征以及计算时间同样重要。您的方法在计算机上花费了太多时间吗？您的功能选择是否首先基于使用特定的模型？理想情况下，无论使用哪种模型，您都希望进行特征选择，因此，在Jupyter Notebook中，理想情况下，您希望在模型之前制作一个用于特征选择的单元–像这样：

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif
# "mutual_info_classif" is the mutual-information way of selecting
# the K most dependent features based on the class-labelK = 3
selector = SelectKBest(mutual_info_classif, K)
X = new_df.iloc[:, :-1]
y = new_df.iloc[:, -1]
X_reduced = selector.fit_transform(X,y) 
                                   
features_selected = selector.get_support()

First, I did the feature selection (above).

首先，我进行了特征选择(上文)。

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X_reduced, y, 
train_size=0.7)
# use logistic regression as a model
logreg = LogisticRegression(C=0.1, max_iter=1000, solver='lbfgs')
logreg.fit(X_train, y_train)

And then I trained the model (above)! :)

然后，我训练了模型(上面)！ :)