特征选择 回归
1.简介 (1. Introduction)
什么是功能选择 ? (What is feature selection ?)
Feature selection is the procedure of selecting a subset (some out of all available) of the input variables that are most relevant to the target variable (that we wish to predict).
特征选择是选择与目标变量(我们希望预测)最相关的输入 变量的子集 (某些可用变量中的一部分)的过程。
Target variable here refers to the variable that we wish to predict.
目标变量在这里 指我们希望预测的变量 。
For this article we will assume that we only have numerical input variables and a numerical target for regression predictive modeling. Assuming that, we can easily estimate the relationship between each input variable and the target variable. This relationship can be established by calculating a metric such as the correlation value for example.
对于本文,我们将假设我们只有数字输入变量和用于回归预测建模的数字目标。 假设,我们可以轻松地估计每个输入变量和目标变量之间的关系 。 例如,可以通过计算诸如相关值之类的度量来建立该关系。
2.主要的数值特征选择方法 (2. The main numerical feature selection methods)
The 2 most famous feature selection techniques that can be used for numerical input data and a numerical target variable are the following:
可以用于数字输入数据和数字目标变量的两种最著名的特征选择技术如下:
- Correlation (Pearson, spearman) 相关性(皮尔逊,斯皮尔曼)
- Mutual Information (MI, normalized MI) 相互信息(MI,标准化MI)
Correlation is a measure of how two variables change together. The most widely used correlation measure is the Pearson’s correlation that assumes a Gaussian distribution of each variable and detects linear relationship between numerical variables.
相关性是两个变量如何一起变化的度量。 最广泛使用的相关度量是Pearson相关,它假设每个变量的高斯分布并检测数值变量之间的线性关系。
This is done in 2 steps:
分两个步骤完成:
The correlation between each regressor and the target is computed, that is, ((X[:, i] — mean(X[:, i])) * (y — mean_y)) / (std(X[:, i]) * std(y)).
计算每个回归变量与目标之间的相关性 ,即(((X [:, i]-mean(X [:, i]))*(y-mean_y))/(std(X [:, i] )* std(y))。
It is converted to an F score then to a p-value.
将其转换为F分数,然后转换为p值 。
Mutual information originates from the field of information theory. The idea is that the information gain (typically used in the construction of decision trees) is applied in order to perform the feature selection. Mutual information is calculated between two variables and measures as the reduction in uncertainty for one variable given a known value of the other variable.
互信息起源于信息理论领域。 这个想法是应用信息增益(通常用于构建决策树)来执行特征选择。 互信息是在两个变量之间计算的,并且在给定另一个变量的已知值的情况下,度量为一个变量的不确定性降低。
3.数据集 (3. The dataset)
We