python过滤器方法中机器学习的特征选择

过多的特征可能导致模型性能停滞不前。本文探讨了如何通过特征选择来提升模型性能,强调预测变量应与输出数据相关,并提供Python实践案例。
摘要由CSDN通过智能技术生成

Too many cooks spoil the broth.

太多的厨师把汤糟了。

Even back in 1575, George Gascoigne already knew that a sumptuous bowl of broth can’t be achieved with too many cooks in the kitchen. The rigor of that proverb extends to modern days, yes, even in Machine Learning.

甚至早在1575年,乔治·加斯科因(George Gascoigne)便已经知道,厨房里的厨师太多就无法获得一碗丰盛的汤。 是的,这句谚语的严格性一直延续到现代,即使在机器学习中也是如此。

Have you ever wondered why the performance of your model hit a plateau no matter how you fine-tune those hyperparameters? Or even worse that you only see a mediocre improvement on performance after using the most accurate set of data you could ever find? Well, the culprit might actually be the predictors (columns) you use to train your models.

您是否曾经想过,无论如何微调那些超参数,为什么模型的性能都会达到平稳状态? 甚至更糟的是,在使用可能找到的最准确的数据集之后,您只会看到性能的中等改善? 嗯,罪魁祸首实际上可能是您用来训练模型的预测变量(列)。

Ideally, predictors should be statistically relevant to the output data a model intends to predict, and those predictors should be carefully hand-picked to ensure the best-expected performance. This article will give you a brief walkthrough on what feature selection is all about, accompanied by some practical examples in Python.

理想情况下,预测变量应该与模型要预测的输出数据在统计上相关,并且应该精心挑选这些预测变量以确保获得最佳预期的性能。 本文将为您简要介绍功能选择的全部内容,并附带一些Python实用示例。

Why Feature Selection matters?

为什么特征选择很重要?

Image for post
Self-illustrated by the author
由作者自我说明

Feature selection is primarily focused on removing redundant or non-informative predictors from the model. [1]

特征选择主要集中在从模型中删除冗余或非信息性预测变量。 [1]

On the surface level, feature selection simply means discarding and reducing predictors to the sweet spot of an optimal subset. Some rationales why feature selection is important in machine learning:

从表面上看,特征选择仅意味着丢弃预测变量并将其减少到最佳子集的最佳位置。 为什么特征选择在机器学习中很重要的一些理由:

  • Parsimony (or simplicity) — simple models are easier to interpret than complex models, especially when making inferences.

    简约(或简单)-简单模型比复杂模型更易于解释,尤其是进行推理时。
  • Time is money. Fewer features mean less calculation time, which directly results in shorter training times.

    时间就是金钱。 更少的功能意味着更少的计算时间,这直接缩短了训练时间。
  • Avoiding the curse of dimensionality — A high accuracy model trained with a lot of features can be delusive, as it can be a sign of overfitting and won’t generalize to new samples.

    避免维数的诅咒-训练有很多功能的高精度模型可能会产生欺骗,因为这可能是过度拟合的标志,并且不会推广到新样本。

Approaches for Feature Selection

特征选择方法

There are generally three methods for feature selection:

通常有三种方法来选择特征:

Filter methods use statistical calculation to evaluate the relevance of the predictors outside of the predictive models and keep only the predictors that pass some criterion. [2] Considerations when choosing filter methods are the types of data involved, both in predictors and outcome — either numerical or categorical.

过滤器方法使用统计计算来评估预测模型之外的预测变量的相关性,并仅保留通过某些标准的预测变量。 [2]选择过滤器方法时要考虑的是预测值和结果中涉及的数据类型-数值或分类。

Wra

  • 0
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值