关于特征提取、特征工程

最新推荐文章于 2024-07-28 23:59:53 发布

不甘心的程序员

最新推荐文章于 2024-07-28 23:59:53 发布

阅读量1.1k

点赞数

分类专栏：笔记文章标签： Feature Engineering 特征工程

笔记专栏收录该内容

5 篇文章 0 订阅

订阅专栏

特征过程的重要性？

Better features means flexibility
Better features means simpler models
Better features means better results

什么是特征工程？

Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.
特征工程就是将原始数据转换成特征的过程，并且特征能够表示预测模型的潜在问题，在未知数据上提高模型的准确度

主要包括四个类型的依赖：

模型性能测试的方法（RMSE，AUC）
问题的类型（分类，回归）
选择的模型（SVM）
预先准备的原始数据（需要采样、归一化、清理？）

特征工程是一个表示问题 Representation Problem

机器学习是将算法应用到原始数据上，产生问题的解决方案，对于特征工程来说，如何将传入机器学习算法的采样数据做到最优，也就是说如何最好地对原始数据进行表示。

you have to turn your inputs into things the algorithm can understand
— Shayne Miel, answer to “What is the intuitive explanation of feature engineering in machine learning?”

Feature Importance: An estimate of the usefulness of a feature

您可以客观地估计特征的有用性。
对于一些特征，他们的重要程度不同，一般会选择对于模型、结果比较重要的特征，其他的特征就会忽略掉。
如果一个特性与因变量高度相关，那么它可能很重要。
更复杂的预测建模算法在构建模型时，在内部执行特性和选择。比如MARS, Random Forest and Gradient Boosted Machines。

Feature Extraction: The automatic construction of new features from raw data

对于一些原始数据，如果直接将其应用到训练中，其中一些特征是多余的或者是冗余的。
特征提取是一个降低原始数据的维度，并将原始数据表示成可以用于建模的一个小的集合。

Feature extraction is a process of automatically reducing the dimensionality of these types of observations into a much smaller set that can be modelled.

关键：自动过程、解决数据高纬度问题

Feature Selection: From many features to a few that are useful

Feature selection addresses these problems by automatically selecting a subset that are most useful to the problem.

Feature selection algorithms may use a scoring method to rank and choose features, such as correlation or other feature importance methods.

Process of Machine Learning

(tasks before here…)
选取数据 Select Data: Integrate data, de-normalize it into a dataset, collect it together.
对数据进行预处理 Preprocess Data: Format it, clean it, sample it so you can work with it.
对数据进行转换 Transform Data: Feature Engineer happens here.
对数据进行建模 Model Data: Create models, evaluate them and tune them.
(tasks after here…)

Iterative Process of Feature Engineering

头脑风暴 Brainstorm features: Really get into the problem, look at a lot of data, study feature engineering on other problems and see what you can steal.
根据问题设计特征 Devise features: Depends on your problem, but you may use automatic feature extraction, manual feature construction and mixtures of the two.
使用不同特征选取方法根据特征的重要程度对特征选择 Select features: Use different feature importance scorings and feature selection methods to prepare one or more “views” for your models to operate upon.
在测试数据上对构建的模型进行评估 Evaluate models: Estimate model accuracy on unseen data using the chosen features.

General Examples of Feature Engineering

对于特征的选取：只能去尝试

Which of these is best? You cannot know before hand. You must try them and evaluate the results to achieve on your algorithm and performance measures.

Decompose Categorical Attributes

比如说在属性中有 ‘Item_Color’ 的属性，它的值为 Red，Blue，Unknown对其进行特征提取，
一种想法是：将属性作为为 Has_Color，如果有值则为1，没值则为0
另一种想法：划分属性为 Is_Red, Is_Blue, Is_Unknown
第一种想法比较简单，如果想要比较复杂的特征，可以选择第二种

Decompose a Date-Time

时间通常包括很多的模型不能利用的信息，比如 ISO 8601 标准，也就是说（i.e. 2014-09-20T20:45:40Z）
如果你感觉时间属性和其他的属性有关系时，可以对时间进行进一步划分，比如 day，year等
比如，如果一个模型与一天内的时间有关系，可以先将时间划分出 day，然后再划分为 Morning, Midday, Afternoon, Night等等，（这种划分可能对决策树效果比较好）

Reframe Numerical Quantities

一般的数据种应该都有一些数字信息，根据领域内的只是，可根据字段增加、分解出新的字段。
比如，在数据种有 Item_Weight 字段，值为 6289，你可以将其表示为6.289 或者6，只是标准不一样，就比如说1米可以表示成 1 米或者100 厘米等。
Item_Weight 也可以表示成两个字段的组合: Item_Weight_Kilograms 和 Item_Weight_Remainder_Grams
很多情况下，数字和其所表示的领域的内容有关，比如如在税收中，如果超过 4 将会有更高的税率，就上述的6.289来说，可以增加一个字段 Item_Above_4kg
超过则为 1 ，否则为 0。
又比如在购物方面，一个 Num_Customer_Purchases 字段可以产生 Purchases_Summer, Purchases_Fall, Purchases_Winter and Purchases_Spring 四个字段。

Concrete Examples of Feature Engineering

一般一个较好的算法或者方法都会跟着一个较好的数据表示，可以从中学习特征提取。

we touch on a few examples of interesting and notable post-competition write-ups that focus on feature engineering.

Predicting Student Test Performance in KDD Cup 2010

KDD Cup 是一个机器学习比赛
在2010年的比赛中，

Their approach is described in the paper “Feature Engineering and Classifier Ensemble for KDD Cup 2010”. The paper credits feature engineering as a key method in winning.
在其论文中有详细的介绍，如何对数据进行特征提取