目录
1.6 Statistical Transformations 统计变换
2.1 Transforming Nominal Features: 转换名义特征
2.2 Transforming Ordinal Features: 转换序数特征
2.3 Encoding Categorical Features
3.1 Date-Based Features: 基于日期的特征
3.2 Time-Based Features: 基于时间的特征
4.1 Raw Image and Channel Pixels
4.2 Grayscale Image Pixels: 灰度
4.3 Binning Image Intensity Distribution
4.4 Image Aggregation Statistics
4.7 Localized Feature Extraction
A data scientist approximately spends around 70% to 80% of his time in data processing, wrangling, and feature engineering for building any Machine Learning model.
Typically feature extraction and feature engineering are synonyms that indicate the process of using a combination of domain knowledge, hand-crafted techniques and mathematical transformations to convert data into features. Henceforth we will be using the term feature engineering to refer to all aspects concerning the task of extracting or creating new features from data.
什么是特征工程
What is feature engineering?
“Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models , resulting in improved model accuracy on unseen data .”
For solving any Machine Learning problem, feature engineering is the key.
为什么需要特征工程
Why Feature Engineering?
- Better representation of data
- Better performing models
- Essential for model building and evaluation
- More flexibility on data types
- Emphasis on the business and domain
PartI: 数值类数据的特征工程
1.1 Counts 计数
1.2 Binarization 二值化
In [6]: watched = np.array(popsong_df['listen_count'])
...: watched[watched >= 1] = 1
...: popsong_df['watched'] = watched
Scikit-learn现成包: scikit-learn’s Binarizer
from sklearn.preprocessing import Binarizer
bn = Binarizer(threshold=0.9)
pd_watched = bn.transform([popsong_df['listen_count']])[0]
popsong_df['pd_watched'] = pd_watched
popsong_df.head(11)
1.3 Rounding: 四舍五入
Often when dealing with numeric attributes like proportions or percentages, we may not need values with a high amount of precision. Hence it makes sense to round off these high precision percentages into numeric integers.
In [8]: items_popularity = pd.read_csv('datasets/item_popularity.csv', encoding='utf-8')
...: # rounding off percentages
...: items_popularity['popularity_scale_10'] =
np.array(np.round((items_popularity['pop_percent'] * 10)), dtype='int')
...: items_popularity['popularity_scale_100'] =
np.array(np.round((items_popularity['pop_percent'] * 100)), dtype='int')
...: items_popularity
Out[8]:
item_id pop_percent popularity_scale_10 popularity_scale_100
0 it_01345 0.98324 10 98
1 it_03431 0.56123 6 56
2 it_04572 0.12098 1 12
3 it_98021 0.35476 4 35
1.4 Interactions: 多元交互
Often in several real-world datasets and scenarios, it makes sense to also try to capture the interactions between these feature variables as a part of the input feature set.
In [10]: from sklearn.preprocessing import PolynomialFeatures
...:
...: pf = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
...: res = pf.fit_transform(atk_def)
...: res
Out[10]:
array([[ 49., 49., 2401., 2401., 2401.],
[ 62., 63., 3844., 3906., 3969.],
[ 82., 83., 6724., 6806., 6889.],
...,
[ 110., 60., 12100., 6600., 3600.],
[ 160., 60., 25600., 9600., 3600.],
[ 110., 120., 12100., 13200., 14400.]])
1.5 Binning: 数据分箱
binning which is also known as quantization. The operation of binning is used for transforming continuous numeric values into discrete ones. These discrete numbers can be thought of as bins into which the raw values or numbers are binned or grouped into. Each bin represents a specific degree of intensity and has a specific range of values which must fall into that bin. There are various ways of binning data which include fixed-width and adaptive binning.
Fixed-Width Binning
fig, ax = plt.subplots()
fcc_survey_df['Age'].hist(color='#A9C5D3')
ax.set_title('Developer Age Histogram', fontsize=12)
ax.set_xlabel('Age', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
Adaptive Binning
We have decided the bin width and ranges in fixed-width binning. However, this technique can lead to irregular bins that are not uniform based on the number of data points or values which fall in each bin. Adaptive binning is a safer and better approach where we use the data distribution itself to decide what should be the appropriate bins.
Let’s take a 4-Quantile or a quartile based adaptive binning scheme. The following snippet helps us obtain the income values that fall on the four quartiles in the distribution.
In [21]: quantile_list = [0, .25, .5, .75, 1.]
...: quantiles = fcc_survey_df['Income'].quantile(quantile_list)
...: quantiles
Out[21]:
0.00 6000.0
0.25 20000.0
0.50 37000.0
0.75 60000.0
1.00 200000.0
To visualize the quartiles obtained in this output better, we can plot them in our data distribution using the following code snippet.
In [22]: fig, ax = plt.subplots()
...: fcc_survey_df['Income'].hist(bins=30, color='#A9C5D3')
...:
...: for quantile in quantiles:
...: qvl = plt.axvline(quantile, color='r')
...: ax.legend([qvl], ['Quantiles'], fontsize=10)
...:
...: ax.set_title('Developer Income Histogram with Quantiles', fontsize=12)
...: ax.set_xlabel('Developer Income', fontsize=12)
...: ax.set_ylabel('Frequency', fontsize=12)
The 4-Quantile values for the income attribute are depicted by red vertical lines. Let’s now use quantile binning to bin each of the developer income values into specific bins using the following code.
In [23]: quantile_labels = ['0-25Q', '25-50Q', '50-75Q', '75-100Q']
...: fcc_survey_df['Income_quantile_range'] = pd.qcut(fcc_survey_df['Income'],
...: q=quantile_list)
...: fcc_survey_df['Income_quantile_la