特征工程的第二步：特征选择

最新推荐文章于 2022-04-20 17:28:31 发布

weixin_26737625

最新推荐文章于 2022-04-20 17:28:31 发布

阅读量304

点赞数

文章标签： python 机器学习算法人工智能 java

原文链接：https://medium.com/mljcunito/a-second-step-into-feature-engineering-feature-selection-d597619e6e2b

版权

We are ready to start with the second part of Feature Engineering (if you’ve missed the previous article, you can find it here). In this short article we’ll go trough a few simple techniques in Feature Selection and Extraction.

我们已经准备好从Feature Engineering的第二部分开始(如果您错过了上一篇文章，可以在这里找到它)。在这篇简短的文章中，我们将介绍“特征选择和提取”中的一些简单技术。

Not all features are created equal

并非所有功能都相同

Zhe Chen

陈哲

功能选择 (Feature Selection)

There would always be some features which are less important with respect to a specific problem. Those irrelevant features need to be removed. Feature selection addresses these problems by automatically selecting a subset that is most useful to the problem.

对于特定的问题，总会有一些不那么重要的功能。这些无关的功能需要删除。 功能选择通过自动选择对问题最有用的子集来解决这些问题。

Most of the times the reduction in the number of input variables shrinks the computational cost of modeling, but sometimes it might happen that it also improves the performance of the model.

在大多数情况下，输入变量数量的减少会减少建模的计算成本，但有时也可能会提高模型的性能。

Among the large amount of feature selection methods we’ll focus mainly on statistical-based ones. They involve evaluating the relationship between each input variable and the target variable using statistics. These methods are usually fast and effective, the only issue is that statistical measures depends on the data type of both input and output variables.

在众多特征选择方法中，我们将主要关注基于统计的方法。它们涉及使用统计信息评估每个输入变量和目标变量之间的关系。这些方法通常快速有效，唯一的问题是统计量取决于输入和输出变量的数据类型。

The classes in the sklearn.feature_selection module can be used for feature selection/dimensionality reduction on sample sets.

sklearn.feature_selection模块中的类可用于样本集的特征选择/ sklearn.feature_selection维。

Whenever you want to go for a simple approach, there’s always a threshold involved. VarianceThreshold is a simple baseline approach to select features. It removes all features whose variance doesn't reach a certain threshold.

每当您想采用简单的方法时，总会有一个门槛。 VarianceThreshold是选择VarianceThreshold的简单基准方法。它将删除方差未达到特定阈值的所有特征。

from sklearn import datasets
from sklearn.feature_selection import VarianceThreshold
#Load Data 
iris = datasets.load_iris()
# Create features and target 
X = iris.data
y = iris.target
# Conduct Variance Thresholding 
thresholder = VarianceThreshold(threshold = .6)
X_high_variance = thresholder.fit_transform(X)
# View High Variance Features 
X_high_variance[0:5]

X[0:5]

单变量特征选择 (Univariate Feature Selection)

Univariate feature selection examines each feature individually to determine the strength of the relationship of the feature with the response variable.

单变量特征选择会单独检查每个特征，以确定特征与响应变量之间关系的强度。

There are a few different options for univariate selection:

单变量选择有几种不同的选择：

We can perform chi-squared (𝝌²) test to the samples to retrieve only the two best features:

我们可以对样本执行卡方(𝝌²)测试，以仅检索两个最佳功能：

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# Load iris dataset
X, y = load_iris(return_X_y=True)
print(X.shape)
# retrieve the two best features by chi-squared test
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
print(X_new.shape)

We have different scoring functions for regression and classification, some of them are listed here:

对于回归和分类，我们具有不同的评分功能，其中一些功能在此处列出：

Regression: f_regression, mutual_info_regression
回归： f_regression ， Mutual_info_regression
Classification: chi2, f_classif, mutual_info_classif
分类： χ2 ， f_classif ， mutual_info_classif

递归特征消除 (Recursive Feature Elimination)

Recursive Feature Elimination (RFE) as its name suggests recursively removes features, builds a model using the remaining attributes and calculates model accuracy. RFE is able to work out the combination of attributes that contribute to the prediction on the target variable.

顾名思义，递归特征消除(RFE)递归移除特征，使用其余属性构建模型并计算模型准确性。 RFE能够计算出有助于目标变量预测的属性组合。

Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), the goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features.

给定将权重分配给特征(例如线性模型的系数)的外部估计器，递归特征消除(RFE)的目标是通过递归考虑越来越少的特征集来选择特征。

from sklearn.svm import LinearSVC
from sklearn.feature_selection import RFE
from sklearn import datasets
# Load Dataset
dataset = datasets.load_iris()
# Support Vectore Machine classifier
svm = LinearSVC()
# create the RFE model for the svm classifier 
# and select attributes
rfe = RFE(svm, 2)
rfe = rfe.fit(dataset.data, dataset.target)
# print summaries for the selection of attributes
print(rfe.support_)
print(rfe.ranking_)

特征提取(奖金) (Feature Extraction (Bonus))

Feature extraction is very different from Feature selection: the former consists in transforming arbitrary data, such as text or images, into numerical features usable for machine learning. The latter is a machine learning technique applied on these features.

特征提取与特征选择有很大不同：前者在于将任意数据(例如文本或图像)转换成可用于机器学习的数字特征。后者是应用于这些功能的机器学习技术。

We’ve decided to show you a standard technique from sklearn.

我们决定向您展示sklearn的标准技术。

从字典加载特征 (Loading Features from Dicts)

The class DictVectorizer transforms lists of feature-value mappings to vectors.

DictVectorizer类将特征值映射的列表转换为矢量。

In particular, it turns lists of mappings (dict-like objects) of feature names to feature values into Numpy arrays or scipy.sparse matrices for use with scikit-learn estimators.

特别是，它将特征名称到特征值的映射(类似dict的对象)列表转换为Numpy数组或scipy.sparse矩阵，以与scikit-learn估计器一起使用。

While not particularly fast to process, Python’s dict has the advantages of being convenient to use, being sparse (absent features need not be stored) and storing feature names in addition to values.

尽管处理起来不是特别快，但是Python的dict具有以下优点：易于使用，稀疏(不需要存储缺少的功能)以及除了值之外还存储功能名称。

measurements = [
    {'city': 'Milano', 'temperature': 33.},
    {'city': 'Torino', 'temperature': 12.},
    {'city': 'Roma', 'temperature': 18.},
]


from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
print(vec.fit_transform(measurements).toarray())
print(vec.get_feature_names())

DictVectorizer is also a useful representation transformation for training sequence classifiers in Natural Language Processing (NLP).

DictVectorizer对于自然语言处理(NLP)中的训练序列分类器也是有用的表示形式转换。

功能散列 (Feature Hashing)

Named as one of the best hacks in Machine Learning, Feature Hashing is a fast and space-efficient way of vectorizing features, i.e. turning arbitrary features into indices in a vector or matrix. For this topic, sklearn’s documentation is exhaustive, you can find it in the link above.

特征哈希被称为机器学习中最好的黑客之一，是一种快速且节省空间的向量矢量化方法，即将任意特征转换为向量或矩阵中的索引。对于此主题，sklearn的文档非常详尽，您可以在上面的链接中找到它。

特征构造 (Feature Construction)

There’s no strict recipe for Feature Construction, I personally consider it as 99% creativity. We’re gonna take a look at some use cases in the next lectures though.

特征构建没有严格的配方，我个人认为它是99％的创造力。我们将在接下来的讲座中看一些用例。

So far, you should take a look at the Feature Extraction part of this marvellous notebook from Beluga, one of the best Competitions Grandmasters on Kaggle.

到目前为止，您应该看一看他来自Beluga的 奇妙笔记本的功能提取部分，该笔记本是Kaggle上最好的Competitions大师之一。

In the last two articles we’ve been introducing Feature Engineering as a subsequent step to Feature Processing. As you can see, we’re building a data processing pipeline, indeed, the next step would be finding a way to deal with missing values. Stay tuned for the next article and don’t forget to take a look at our Github page, you’ll find the code related to this series of articles.

在前两篇文章中，我们一直在介绍特征工程，作为特征处理的后续步骤。如您所见，我们正在建立一个数据处理管道，实际上，下一步将是找到一种处理缺失值的方法。 请继续阅读下一篇文章，不要忘了看一下我们的Github页面，您将找到与本系列文章相关的代码。

翻译自: https://medium.com/mljcunito/a-second-step-into-feature-engineering-feature-selection-d597619e6e2b

weixin_26737625

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
特征工程的第二步：特征选择

We are ready to start with the second part of Feature Engineering (if you’ve missed the previous article, you can find it here). In this short article we’ll go trough a few simple techniques in Featur...
复制链接

扫一扫