特征提取与特征工程

目录

PartI: 数值类数据的特征工程

1.1 Counts 计数

1.2 Binarization 二值化

1.3 Rounding: 四舍五入

1.4 Interactions: 多元交互

1.5 Binning: 数据分箱

1.6 Statistical Transformations 统计变换

1.7 Log Transform

1.8 Box-Cox Transform

Part II: 分类数据的特征工程

2.1 Transforming Nominal Features: 转换名义特征

2.2 Transforming Ordinal Features: 转换序数特征

2.3 Encoding Categorical Features

Part III: 时态数据的特征工程

3.1 Date-Based Features: 基于日期的特征

3.2 Time-Based Features: 基于时间的特征

Part IV: 图像类数据的特征工程

4.1 Raw Image and Channel Pixels

4.2 Grayscale Image Pixels: 灰度

4.3 Binning Image Intensity Distribution

4.4 Image Aggregation Statistics

4.5 Edge Detection

4.6 Object Detection

4.7 Localized Feature Extraction


A data scientist approximately spends around 70% to 80% of his time in data processing, wrangling, and feature engineering for building any Machine Learning model.

Typically feature extraction and feature engineering are synonyms that indicate the process of using a combination of domain knowledge, hand-crafted techniques and mathematical transformations to convert data into features. Henceforth we will be using the term feature engineering to refer to all aspects concerning the task of extracting or creating new features from data.

什么是特征工程

What is feature engineering?

“Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models , resulting in improved model accuracy on unseen data .”

For solving any Machine Learning problem, feature engineering is the key.

为什么需要特征工程

Why Feature Engineering?

  • Better representation of data
  • Better performing models
  • Essential for model building and evaluation
  • More flexibility on data types
  • Emphasis on the business and domain

 

PartI: 数值类数据的特征工程

1.1 Counts 计数

1.2 Binarization 二值化

In [6]: watched = np.array(popsong_df['listen_count'])
   ...: watched[watched >= 1] = 1
   ...: popsong_df['watched'] = watched

Scikit-learn现成包: scikit-learn’s Binarizer

from sklearn.preprocessing import Binarizer

bn = Binarizer(threshold=0.9)
pd_watched = bn.transform([popsong_df['listen_count']])[0]
popsong_df['pd_watched'] = pd_watched
popsong_df.head(11)

1.3 Rounding: 四舍五入

Often when dealing with numeric attributes like proportions or percentages, we may not need values with a high amount of precision. Hence it makes sense to round off these high precision percentages into numeric integers.

In [8]: items_popularity = pd.read_csv('datasets/item_popularity.csv', encoding='utf-8')
   ...: # rounding off percentages
   ...: items_popularity['popularity_scale_10'] =
                    np.array(np.round((items_popularity['pop_percent'] * 10)), dtype='int')
   ...: items_popularity['popularity_scale_100'] =
                    np.array(np.round((items_popularity['pop_percent'] * 100)), dtype='int')
   ...: items_popularity
Out[8]:
    item_id  pop_percent  popularity_scale_10  popularity_scale_100
0  it_01345      0.98324                   10                    98
1  it_03431      0.56123                    6                    56
2  it_04572      0.12098                    1                    12
3  it_98021      0.35476                    4                    35

1.4 Interactions: 多元交互

Often in several real-world datasets and scenarios, it makes sense to also try to capture the interactions between these feature variables as a part of the input feature set.

In [10]: from sklearn.preprocessing import PolynomialFeatures
    ...:
    ...: pf = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
    ...: res = pf.fit_transform(atk_def)
    ...: res
Out[10]:
array([[    49.,     49.,   2401.,   2401.,   2401.],
       [    62.,     63.,   3844.,   3906.,   3969.],
       [    82.,     83.,   6724.,   6806.,   6889.],
       ...,
       [   110.,     60.,  12100.,   6600.,   3600.],
       [   160.,     60.,  25600.,   9600.,   3600.],
       [   110.,    120.,  12100.,  13200.,  14400.]])

1.5 Binning: 数据分箱

binning which is also known as quantization. The operation of binning is used for transforming continuous numeric values into discrete ones. These discrete numbers can be thought of as bins into which the raw values or numbers are binned or grouped into. Each bin represents a specific degree of intensity and has a specific range of values which must fall into that bin. There are various ways of binning data which include fixed-width and adaptive binning.

Fixed-Width Binning

fig, ax = plt.subplots()
fcc_survey_df['Age'].hist(color='#A9C5D3')
ax.set_title('Developer Age Histogram', fontsize=12)
ax.set_xlabel('Age', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)

Adaptive Binning

We have decided the bin width and ranges in fixed-width binning. However, this technique can lead to irregular bins that are not uniform based on the number of data points or values which fall in each bin. Adaptive binning is a safer and better approach where we use the data distribution itself to decide what should be the appropriate bins.

Let’s take a 4-Quantile or a quartile based adaptive binning scheme. The following snippet helps us obtain the income values that fall on the four quartiles in the distribution.

In [21]: quantile_list = [0, .25, .5, .75, 1.]
    ...: quantiles = fcc_survey_df['Income'].quantile(quantile_list)
    ...: quantiles
Out[21]:
0.00      6000.0
0.25     20000.0
0.50     37000.0
0.75     60000.0
1.00    200000.0

To visualize the quartiles obtained in this output better, we can plot them in our data distribution using the following code snippet.

In [22]: fig, ax = plt.subplots()
    ...: fcc_survey_df['Income'].hist(bins=30, color='#A9C5D3')
    ...:
    ...: for quantile in quantiles:
    ...:     qvl = plt.axvline(quantile, color='r')
    ...: ax.legend([qvl], ['Quantiles'], fontsize=10)
    ...:
    ...: ax.set_title('Developer Income Histogram with Quantiles', fontsize=12)
    ...: ax.set_xlabel('Developer Income', fontsize=12)
    ...: ax.set_ylabel('Frequency', fontsize=12)

The 4-Quantile values for the income attribute are depicted by red vertical lines. Let’s now use quantile binning to bin each of the developer income values into specific bins using the following code.

In [23]: quantile_labels = ['0-25Q', '25-50Q', '50-75Q', '75-100Q']
    ...: fcc_survey_df['Income_quantile_range'] = pd.qcut(fcc_survey_df['Income'],
    ...:                                                 q=quantile_list)
    ...: fcc_survey_df['Income_quantile_la
  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值