特征提取与特征工程

最新推荐文章于 2022-11-11 17:02:41 发布

Elffer

最新推荐文章于 2022-11-11 17:02:41 发布

阅读量1.1k

点赞数 1

分类专栏：机器学习文章标签：特征工程特征提取

本文链接：https://blog.csdn.net/bmwlwg/article/details/101690096

版权

1.4 Interactions: 多元交互

1.5 Binning: 数据分箱

1.6 Statistical Transformations 统计变换

1.7 Log Transform

1.8 Box-Cox Transform

Part II: 分类数据的特征工程

2.1 Transforming Nominal Features: 转换名义特征

2.2 Transforming Ordinal Features: 转换序数特征

2.3 Encoding Categorical Features

Part III: 时态数据的特征工程

3.1 Date-Based Features: 基于日期的特征

3.2 Time-Based Features: 基于时间的特征

Part IV: 图像类数据的特征工程

4.1 Raw Image and Channel Pixels

4.2 Grayscale Image Pixels: 灰度

4.3 Binning Image Intensity Distribution

4.4 Image Aggregation Statistics

4.5 Edge Detection

4.6 Object Detection

4.7 Localized Feature Extraction

A data scientist approximately spends around 70% to 80% of his time in data processing, wrangling, and feature engineering for building any Machine Learning model.

Typically feature extraction and feature engineering are synonyms that indicate the process of using a combination of domain knowledge, hand-crafted techniques and mathematical transformations to convert data into features. Henceforth we will be using the term feature engineering to refer to all aspects concerning the task of extracting or creating new features from data.

什么是特征工程

What is feature engineering?

“Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models , resulting in improved model accuracy on unseen data .”

For solving any Machine Learning problem, feature engineering is the key.

为什么需要特征工程

Why Feature Engineering?

Better representation of data
Better performing models
Essential for model building and evaluation
More flexibility on data types
Emphasis on the business and domain

PartI: 数值类数据的特征工程

1.1 Counts 计数

1.2 Binarization 二值化

In [6]: watched = np.array(popsong_df['listen_count'])
   ...: watched[watched >= 1] = 1
   ...: popsong_df['watched'] = watched

Scikit-learn现成包: scikit-learn’s Binarizer

from sklearn.preprocessing import Binarizer

bn = Binarizer(threshold=0.9)
pd_watched = bn.transform([popsong_df['listen_count']])[0]
popsong_df['pd_watched'] = pd_watched
popsong_df.head(11)

1.3 Rounding: 四舍五入

Often when dealing with numeric attributes like proportions or percentages, we may not need values with a high amount of precision. Hence it makes sense to round off these high precision percentages into numeric integers.

In [8]: items_popularity = pd.read_csv('datasets/item_popularity.csv', encoding='utf-8')
   ...: # rounding off percentages
   ...: items_popularity['popularity_scale_10'] =
                    np.array(np.round((items_popularity['pop_percent'] * 10)), dtype='int')
   ...: items_popularity['popularity_scale_100'] =
                    np.array(np.round((items_popularity['pop_percent'] * 100)), dtype='int')
   ...: items_popularity
Out[8]:
    item_id  pop_percent  popularity_scale_10  popularity_scale_100
0  it_01345      0.98324                   10                    98
1  it_03431      0.56123                    6                    56
2  it_04572      0.12098                    1                    12
3  it_98021      0.35476                    4                    35

1.4 Interactions: 多元交互

Often in several real-world datasets and scenarios, it makes sense to also try to capture the interactions between these feature variables as a part of the input feature set.

In [10]: from sklearn.preprocessing import PolynomialFeatures
    ...:
    ...: pf = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
    ...: res = pf.fit_transform(atk_def)
    ...: res
Out[10]:
array([[    49.,     49.,   2401.,   2401.,   2401.],
       [    62.,     63.,   3844.,   3906.,   3969.],
       [    82.,     83.,   6724.,   6806.,   6889.],
       ...,
       [   110.,     60.,  12100.,   6600.,   3600.],
       [   160.,     60.,  25600.,   9600.,   3600.],
       [   110.,    120.,  12100.,  13200.,  14400.]])

1.5 Binning: 数据分箱

binning which is also known as quantization. The operation of binning is used for transforming continuous numeric values into discrete ones. These discrete numbers can be thought of as bins into which the raw values or numbers are binned or grouped into. Each bin represents a specific degree of intensity and has a specific range of values which must fall into that bin. There are various ways of binning data which include fixed-width and adaptive binning.

Fixed-Width Binning

fig, ax = plt.subplots()
fcc_survey_df['Age'].hist(color='#A9C5D3')
ax.set_title('Developer Age Histogram', fontsize=12)
ax.set_xlabel('Age', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)

Adaptive Binning

We have decided the bin width and ranges in fixed-width binning. However, this technique can lead to irregular bins that are not uniform based on the number of data points or values which fall in each bin. Adaptive binning is a safer and better approach where we use the data distribution itself to decide what should be the appropriate bins.

Let’s take a 4-Quantile or a quartile based adaptive binning scheme. The following snippet helps us obtain the income values that fall on the four quartiles in the distribution.

In [21]: quantile_list = [0, .25, .5, .75, 1.]
    ...: quantiles = fcc_survey_df['Income'].quantile(quantile_list)
    ...: quantiles
Out[21]:
0.00      6000.0
0.25     20000.0
0.50     37000.0
0.75     60000.0
1.00    200000.0

To visualize the quartiles obtained in this output better, we can plot them in our data distribution using the following code snippet.

In [22]: fig, ax = plt.subplots()
    ...: fcc_survey_df['Income'].hist(bins=30, color='#A9C5D3')
    ...:
    ...: for quantile in quantiles:
    ...:     qvl = plt.axvline(quantile, color='r')
    ...: ax.legend([qvl], ['Quantiles'], fontsize=10)
    ...:
    ...: ax.set_title('Developer Income Histogram with Quantiles', fontsize=12)
    ...: ax.set_xlabel('Developer Income', fontsize=12)
    ...: ax.set_ylabel('Frequency', fontsize=12)

The 4-Quantile values for the income attribute are depicted by red vertical lines. Let’s now use quantile binning to bin each of the developer income values into specific bins using the following code.

In [23]: quantile_labels = ['0-25Q', '25-50Q', '50-75Q', '75-100Q']
    ...: fcc_survey_df['Income_quantile_range'] = pd.qcut(fcc_survey_df['Income'],
    ...:                                                 q=quantile_list)
    ...: fcc_survey_df['Income_quantile_la

最低0.47元/天解锁文章

Elffer

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
1
评论
特征提取与特征工程

目录PartI: 数值类数据的特征工程1.1 Counts 计数1.2 Binarization 二值化1.3 Rounding: 四舍五入1.4 Interactions: 多元交互1.5 Binning: 数据分箱1.6 Statistical Transformations 统计变换1.7 Log Transform1.8 Box-Cox Transfor...
复制链接

扫一扫