机器学习特征工程--特征预处理（下）

蜡笔小可

已于 2022-01-21 17:33:56 修改

阅读量957

点赞数

分类专栏：笔记文章标签：机器学习 sklearn 数据分析数据挖掘

于 2022-01-21 17:30:55 首次发布

本文链接：https://blog.csdn.net/lv739880037/article/details/122625796

版权

继上篇，主要包括：离散特征处理，连续性特征处理，缺失值处理以及生成多项式特征。

1.Encoding categorical features

2.Discretization

2.1 K-bins discretization

2.2 Feature binarization

3.Imputation of missing values¶

3.1 Univariate vs. Multivariate Imputation¶

3.2 Univariate feature imputation

3.3 Multivariate feature imputation¶

3.4 Multiple vs. Single Imputation

3.5 Nearest neighbors imputation

4.Generating polynomial features¶

4.1 Polynomial features

4.2 Spline transformer

4.3 附：贝塞尔曲线和b-样条曲线

贝塞尔曲线

B-样条曲线

1.Encoding categorical features

即类别（离散）特征one hot编码

如果你采用transforms each categorical feature to one new feature of integers (0 to n_categories - 1)，however, not be used directly with all scikit-learn estimators, as these expect continuous input, and would interpret the categories as being ordered, which is often not desired.

为了解决这个问题， use a one-of-K, also known as one-hot or dummy encoding. This type of encoding can be obtained with the OneHotEncoder, which transforms each categorical feature with n_categories possible values into n_categories binary features, with one of them 1, and all others 0.

这里需注意下，如果未来（测试集）出现了新的取值，应该怎么处理。

2.Discretization

Discretization (otherwise known as quantization or binning) provides a way to partition continuous features into discrete values. Certain datasets with continuous features may benefit from discretization, because discretization can transform the dataset of continuous attributes to one with only nominal attributes.

One-hot encoded discretized features can make a model more expressive, while maintaining interpretability. For instance, pre-processing with a discretizer can introduce nonlinearity to linear models.

2.1 K-bins discretization

Discretization is similar to constructing histograms for continuous data. However, histograms focus on counting features which fall into particular bins, whereas discretization focuses on assigning feature values to these bins.

KBinsDiscretizer implements different binning strategies, which can be selected with the strategy parameter. The ‘uniform’ strategy uses constant-width bins. The ‘quantile’ strategy uses the quantiles values to have equally populated bins in each feature. The ‘kmeans’ strategy defines bins based on a k-means clustering procedure performed on each feature independently.

2.2 Feature binarization

Feature binarization is the process of thresholding numerical features to get boolean values. This can be useful for downstream probabilistic estimators that make assumption that the input data is distributed according to a multi-variate