特征选择 回归_如何执行回归问题的特征选择

特征选择 回归

1.简介 (1. Introduction)

什么是功能选择(What is feature selection ?)

Feature selection is the procedure of selecting a subset (some out of all available) of the input variables that are most relevant to the target variable (that we wish to predict).

特征选择是选择与目标变量(我们希望预测)最相关的输入 变量子集 (某些可用变量中的一部分)的过程。

Target variable here refers to the variable that we wish to predict.

目标变量在这里 指我们希望预测变量

For this article we will assume that we only have numerical input variables and a numerical target for regression predictive modeling. Assuming that, we can easily estimate the relationship between each input variable and the target variable. This relationship can be established by calculating a metric such as the correlation value for example.

对于本文,我们将假设我们只有数字输入变量和用于回归预测建模的数字目标。 假设,我们可以轻松地估计每个输入变量和目标变量之间的关系 。 例如,可以通过计算诸如相关值之类的度量来建立该关系。

2.主要的数值特征选择方法 (2. The main numerical feature selection methods)

The 2 most famous feature selection techniques that can be used for numerical input data and a numerical target variable are the following:

可以用于数字输入数据和数字目标变量的两种最著名的特征选择技术如下:

  • Correlation (Pearson, spearman)

    相关性(皮尔逊,斯皮尔曼)
  • Mutual Information (MI, normalized MI)

    相互信息(MI,标准化MI)

Correlation is a measure of how two variables change together. The most widely used correlation measure is the Pearson’s correlation that assumes a Gaussian distribution of each variable and detects linear relationship between numerical variables.

相关性是两个变量如何一起变化的度量。 最广泛使用的相关度量是Pearson相关,它假设每个变量的高斯分布并检测数值变量之间的线性关系。

This is done in 2 steps:

分两个步骤完成:

  1. The correlation between each regressor and the target is computed, that is, ((X[:, i] — mean(X[:, i])) * (y — mean_y)) / (std(X[:, i]) * std(y)).

    计算每个回归变量与目标之间的相关性 ,即(((X [:, i]-mean(X [:, i]))*(y-mean_y))/(std(X [:, i] )* std(y))。

  2. It is converted to an F score then to a p-value.

    将其转换为F分数,然后转换为p值

Mutual information originates from the field of information theory. The idea is that the information gain (typically used in the construction of decision trees) is applied in order to perform the feature selection. Mutual information is calculated between two variables and measures as the reduction in uncertainty for one variable given a known value of the other variable.

互信息起源于信息理论领域。 这个想法是应用信息增益(通常用于构建决策树)来执行特征选择。 互信息是在两个变量之间计算的,并且在给定另一个变量的已知值的情况下,度量为一个变量的不确定性降低。

3.数据集 (3. The dataset)

We

  • 3
    点赞
  • 16
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值