机器学习特征工程--特征预处理(下)

继上篇,主要包括:离散特征处理,连续性特征处理,缺失值处理以及生成多项式特征。

目录

1.Encoding categorical features

 2.Discretization

2.1 K-bins discretization

 2.2 Feature binarization

3.Imputation of missing values¶

3.1 Univariate vs. Multivariate Imputation¶

3.2 Univariate feature imputation

3.3 Multivariate feature imputation¶

3.4 Multiple vs. Single Imputation

3.5 Nearest neighbors imputation

4.Generating polynomial features¶

4.1 Polynomial features

4.2 Spline transformer

4.3 附:贝塞尔曲线和b-样条曲线

贝塞尔曲线

B-样条曲线


1.Encoding categorical features

即类别(离散)特征one hot编码

如果你采用transforms each categorical feature to one new feature of integers (0 to n_categories - 1),however, not be used directly with all scikit-learn estimators, as these expect continuous input, and would interpret the categories as being ordered, which is often not desired.

为了解决这个问题, use a one-of-K, also known as one-hot or dummy encoding. This type of encoding can be obtained with the OneHotEncoder, which transforms each categorical feature with n_categories possible values into n_categories binary features, with one of them 1, and all others 0.

这里需注意下,如果未来(测试集)出现了新的取值,应该怎么处理。

 2.Discretization

Discretization (otherwise known as quantization or binning) provides a way to partition continuous features into discrete values. Certain datasets with continuous features may benefit from discretization, because discretization can transform the dataset of continuous attributes to one with only nominal attributes.

One-hot encoded discretized features can make a model more expressive, while maintaining interpretability. For instance, pre-processing with a discretizer can introduce nonlinearity to linear models. 

2.1 K-bins discretization

Discretization is similar to constructing histograms for continuous data. However, histograms focus on counting features which fall into particular bins, whereas discretization focuses on assigning feature values to these bins.

KBinsDiscretizer implements different binning strategies, which can be selected with the strategy parameter. The ‘uniform’ strategy uses constant-width bins. The ‘quantile’ strategy uses the quantiles values to have equally populated bins in each feature. The ‘kmeans’ strategy defines bins based on a k-means clustering procedure performed on each feature independently.

 2.2 Feature binarization

Feature binarization is the process of thresholding numerical features to get boolean values. This can be useful for downstream probabilistic estimators that make assumption that the input data is distributed according to a multi-variate 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值