机器学习中特征选择_机器学习中的特征选择

机器学习中特征选择

机器学习 (Machine Learning)

In the real world, data is not as clean as it’s often assumed to be. That’s where all the data mining and wrangling comes in; to build insights out of the data that has been structured using queries, and now probably contains certain missing values, and exhibits possible patterns that are unseen to the naked eye. That’s where Machine Learning comes in: To check for patterns and make use of those patterns to predict outcomes using these newly understood relationships in the data.

在现实世界中,数据并不像通常认为的那么干净。 这就是所有数据挖掘和处理的地方。 从使用查询构造的数据中获取见解,现在可能包含某些缺失值,并且展现出肉眼看不见的可能模式。 这就是机器学习的用武之地:检查模式并利用这些模式使用数据中的这些新理解的关系来预测结果。

For one to understand the depth of the algorithm, one needs to read through the variables in the data, and what those variables represent. Understanding this is important because you need to prove your outcomes, based on your understanding of the data. If your data contains five, or even fifty variables, let’s say you’re able to go through them all. But what if it contains 200 variables? You don’t have the time to go through each variable. On top of that, various algorithms will not work with categorical data, so you have to convert all the categorical columns to quantitative variables (they look quantitative, but the metrics will justify that they are categorical), to push them into the model. So, this increases the number of variables in your data, and now you’re hanging around with 500 variables. How do you deal with them? You might think that dimensionality reduction is the answer, right away. Dimensionality reduction algorithms will reduce the dimensions, but the interpretability isn’t that good. What if I tell you that there are other techniques, that can eliminate features, and it would still be easy to understand and interpret the retained features?

为了使人们理解算法的深度,需要通读数据中的变量以及这些变量代表什么。 了解这一点很重要,因为您需要基于对数据的理解来证明结果。 如果您的数据包含五个甚至五十个变量,那么您就可以遍历所有变量。 但是如果包含200个变量怎么办? 您没有时间遍历每个变量。 最重要的是,各种算法都不适用于分类数据,因此您必须将所有分类列转换为定量变量(它们看起来是定量的,但度量标准将证明它们是分类的),才能将其推入模型。 因此,这增加了数据中变量的数量,现在您有500个变量。 您如何处理他们? 您可能会认为降维就是答案。 降维算法会缩小尺寸,但可解释性不是那么好。 如果我告诉您还有其他技术可以消除功能,并且仍然易于理解和解释保留的功能怎么办?

Depending on whether the analysis is regression or classification based, feature selection techniques can differ/vary but the general idea of how to implement it remains the same.

取决于分析是基于回归分析还是基于分类,特征选择技术可能会有所不同/有所不同,但是有关如何实现的基本思路仍然相同。

Here are some Feature Selection techniques to tackle this issue:

以下是一些用于解决此问题的功能选择技术:

1.高度相关的变量 (1. Highly Correlated Variables)

Variables which are highly correlated with each other, give the same information to the model, and hence it becomes unnecessary to include all of them for our analysis. For Example: If a dataset contains a feature “Browsing Time”, and another called “Data Used while Browsing”, then you can imagine that these two variables will be correlated to some extent, and we would see this high correlation even if we pick up an unbiased sample of the data. In such a case, we would require only one of these variables to be present as a predictor in the model, because if we use both, then the model will over-fit and become biased towards this particular feature(s).

相互之间高度相关的变量为模型提供了相同的信息 ,因此没有必要将所有变量都包括在内以进行分析。 例如: 如果一个数据集包含一个功能“浏览时间”,另一个功能叫做“浏览时使用的数据”,那么您可以想象这两个变量在某种程度上是相关的 ,即使选择收集数据的无偏见。 在这种情况下,我们只需要在模型中将这些变量中的一个作为预测变量即可,因为如果我们同时使用这两个变量,则模型将过度拟合并偏向于此特定特征。

Image for post
Photo by Akin Cakiner on Unsplash
Akin CakinerUnsplash拍摄的照片

2. P值 (2. P-Values)

In algorithms like Linear Regression, an initial statistical model is always a good idea, as it helps in visualizing the importance of features, with the use of their P-values that have been obtained using that model. On setting a level of significance, we check for the P-values obtained, and if this value is less than the level of significance, it shows that the feature is significant, i.e. a change in this value is likely to show a change in the value of the Target.

在线性回归之类的算法中, 初始统计模型始终是一个好主意,因为它有助于可视化特征的重要性,并利用通过该模型获得的特征值来实现。 在设置显着性水平时,我们检查获得的P值,如果该值小于显着性水平,则表明该特征很重要,即,该值的变化很可能表明特征的变化。目标的价值。

Image for post
Photo by Joshua Eckstein on Unsplash
Joshua EcksteinUnsplash拍摄的照片

3.转发选择 (3. Forward Selection)

Forward Selection is a technique that involves the use of step-wise regression. So, the model starts building from ground zero, i.e. an empty model, and then each iteration adds a variable such that there is an improvement in the model being built. The variable to be added in each iteration is determined using its significance, and that can be calculated using various metrics, with a common one being the P-value obtained from an initial statistical model built using all the variables. At times, Forward Selection can cause an over-fit because it can add highly correlated variables to the model, even when they provide the same data to the model (but the model shows an improvement).

正向选择是一种涉及逐步回归的技术。 因此,该模型从零地面开始构建,即一个空模型,然后每次迭代都添加一个变量,以使正在构建的模型有所改进。 每次迭代中要添加的变量均使用其重要性来确定 ,并且可以使用各种度量来计算,常见的度量是从使用所有变量构建的初始统计模型获得的P值。 有时,“前向选择”可能会导致过度拟合,因为即使向模型提供相同的数据 (但模型显示有所改进) ,它也会向模型添加高度相关的变量

Image for post
Photo by Edu Grande on Unsplash
Edu GrandeUnsplash拍摄的照片

4.向后淘汰 (4. Backward Elimination)

Backward Elimination too, involves step-wise feature selection, in a way that’s opposite to that of the Forward Selection. In this case, the initial model starts out with all the independent variables, and one by one, these variables are eliminated (one per iteration), if they don’t provide value to the newly formed regression model in each iteration. This is again, based on the P-values obtained using the initial statistical model, and based on these P-values, the features are eliminated from the model. Using this method as well, there is an uncertainty in the removal of highly correlated variables.

向后消除也涉及逐步特征选择,其方式与正向选择相反。 在这种情况下, 初始模型从所有自变量开始 ,如果它们在每次迭代中都不为新形成的回归模型提供价值,则这些变量将被逐一消除(每次迭代一个) 。 同样,这是基于使用初始统计模型获得的P值,并且基于这些P值,从模型中消除了特征。 同样使用这种方法,在去除高度相关的变量方面也存在不确定性。

Image for post
Photo by Markus Spiske on Unsplash
Markus SpiskeUnsplash拍摄的照片

5.递归特征消除(RFE) (5. Recursive Feature Elimination (RFE))

RFE is a widely used technique/algorithm to select an exact number of significant features, sometimes to explain a particular number of “most important” features impacting the business, and sometimes as a method to reduce a very high number of features (say around 200–400) down to only the ones that create even a bit of impact on the model, and eliminating the rest. RFE uses a rank-based system, to show ranks of the features in the dataset, and these ranks are used to eliminate features in a recursive loop, based on the collinearity present among them, and of course, the significance of these features in the model. Apart from ranking the features, RFE can show whether these features are important or not, even for the selected number of features (because it is very much possible that the selected number, that we chose, may not represent the optimal number of important features, and that the optimal number of features may be more or less than this number chosen by the user).

RFE是一种广泛使用的技术/算法,用于选择确切数量的重要功能,有时用于解释影响业务的特定数量的“最重要”功能,有时作为减少大量功能的方法(例如约200个) –400),直到只对模型产生一点影响的那些,并消除其余部分。 RFE使用基于等级的系统来显示数据集中要素的等级,这些等级用于根据递归循环中存在的共线性来消除递归循环中的要素,当然,这些要素在显式特征中的重要性也是如此。模型。 除了对功能进行排名之外,RFE还能显示这些功能是否重要,即使对于选定数量的功能(因为很有可能我们选择的选定数量可能不能代表重要功能的最佳数量,并且最佳功能数量可能大于或小于用户选择的数量)。

Image for post
Andrew Seaman on 安德鲁·希曼 ( Unsplash Under Splash)摄

6.图表特征的重要性 (6. Charted Feature Importance)

When we talk about the interpretability of machine learning algorithms, we usually discuss on linear regression (as we can analyze feature importance using the P-values) and decision tree (which practically shows the feature importance in the form of a tree, which shows the hierarchy of importance as well), but on the other hand, often we use the variable importance chart, to plot the variables and the “amount of their importance”, in algorithms such as Random Forest Classifier, Light Gradient Boosting Machine, and XG Boost. This is particularly useful when well-structured importance of features needs to be presented to a business that is being analyzed.

当我们谈论机器学习算法的可解释性时,我们通常讨论线性回归 (因为我们可以使用P值分析特征重要性) 和决策树 (实际上以树的形式显示了特征重要性),重要性等级),但另一方面, 我们经常使用变量重要性图表,在随机森林分类器,光梯度增强机和XG Boost等算法中绘制变量及其“重要性的数量” 。 当需要向正在分析的业务提供结构良好的重要性时,此功能特别有用。

Image for post
Photo by Robert Anasch on Unsplash
Robert AnaschUnsplash上的 照片

7.正则化 (7. Regularization)

Regularization is done to monitor the trade-off between bias and variance. Bias tells how much the model has over-fitted on the training data-set. Variance tells us how different were the predictions made on training and testing data-sets. Ideally, both bias and variance need to be reduced. Regularization comes to save the day here! There are mainly two types of regularization techniques:

进行正则化以监视偏差和方差之间的折衷。 偏差可以说明模型在训练数据集上的拟合程度。 方差告诉我们对训练和测试数据集所做的预测有何不同。 理想情况下,需要同时减少偏差和方差。 正则化在这里节省了一天! 主要有两种类型的正则化技术:

L1 Regularization - Lasso: Lasso penalizes the model’s beta coefficients to change their importance in the model, and may even ground them (turn them into zeros, i.e. basically remove these variables from the final model). Generally, Lasso is used when you observe that your data-set has a large number of variables, and you need to remove some of them for a better understanding of how the important features affect your model (i.e the features which are finally selected by Lasso, and their importance is assigned).

L1正则化-套索: 套索对模型的beta系数进行惩罚,以更改其在模型中的重要性,甚至可能将它们作为基础(将它们变为零,即,从最终模型中基本上删除了这些变量)。 通常,当您观察到数据集包含大量变量,并且需要删除其中一些变量以更好地了解重要特征如何影响模型(即,由Lasso最终选择的特征)时,会使用Lasso ,并指定了它们的重要性)。

L2 Regularization - Ridge: The function of Ridge is to maintain all the variables, i.e use all the variables to build the model, and at the same time, assign them importance such that there is an improvement in the model performance. Ridge is a great choice when the number of variables in the data-set is low, and hence all of those variables are required to interpret the insights and predicted target results obtained.

L2正则化-Ridge:Ridge 的功能是维护所有变量,即使用所有变量来构建模型,并同时为其赋予重要性,从而改善模型性能。 当数据集中的变量数量很少时,Ridge是一个不错的选择,因此需要所有这些变量来解释洞察力和获得的预测目标结果。

Since Ridge keeps all the variables intact, and Lasso does a better job at assigning importance to the variables, a combination of both, known as Elastic-Net was developed as a way to develop an algorithm, by combining the best features of Ridge and Lasso. Elastic-Net becomes the ideal choice in that way.

由于Ridge保持所有变量不变,并且Lasso在分配变量的重要性方面做得更好,因此通过结合Ridge和Lasso的最佳功能,开发了一种称为Elastic-Net的组合作为开发算法的方法。 这样,Elastic-Net成为理想的选择。

Image for post
Photo by Hunter Harritt on Unsplash
Hunter HarrittUnsplash拍摄的照片

There are more ways to select features while performing machine learning, but the base idea usually remains the same: Showcasing the feature importance and then eliminating variables based on the obtained “importance”. The importance here is a very subjective term since it is not one metric, but a collection of metrics and graphs, that can be used to check for the most important features.

在执行机器学习时,有更多选择特征的方法,但是基本思想通常保持不变:展示特征的重要性,然后根据获得的“重要性”消除变量。 这里的重要性是一个非常主观的术语,因为它不是一个度量,而是一组度量和图形,可用于检查最重要的功能。

Thank you for reading! Happy learning!

感谢您的阅读! 学习愉快!

翻译自: https://medium.com/towards-artificial-intelligence/feature-selection-in-machine-learning-3b2902852933

机器学习中特征选择

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值