机器学习之特征工程--特征预处理（上）

最新推荐文章于 2024-07-06 22:59:16 发布

蜡笔小可

最新推荐文章于 2024-07-06 22:59:16 发布

阅读量1.9k

点赞数

分类专栏：笔记文章标签：机器学习 sklearn 数据分析数据挖掘

本文链接：https://blog.csdn.net/lv739880037/article/details/121342261

版权

机器学习特征工程--特征预处理（上）

最近又重新看了下常用的特征预处理方法，主要来源是sklearn官方文档，一些关键信息记录下，留存用，有些乱和杂，抽时间再整理。

此为上篇，主要包括：线性转化，非线性转化，及样本归一化，每部分还会有是否应该采用这样的做法。

先放一句话：There are no rules of thumb that apply to all applications.

机器学习特征工程--特征预处理（上）

1.线性转化

标准化

特征缩放（Scaling features to a range）

附：Scaling vs Whitening

Should I standardize the input variables (column vectors/特征)?

Should I standardize the target variables (column vectors/标签)?

Should I standardize the variables (column vectors) for unsupervised learning?

附：saturation

2.非线性转化（Non-linear transformation）

quantile transforms

附：Quantile function（分位数函数）

power transforms

Should I nonlinearly transform the data?

3.Normalization（样本归一化）

Should I standardize the input cases (row vectors/样本)?

附：Compare the effect of different scalers on data with outliers

1.线性转化

标准化

简介：

Standardization, or mean removal and variance scaling，transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.

出发点：

many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

注意点：

they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.
对异常值敏感

特征缩放（Scaling features to a range）

简介：

An alternative standardization is scaling features to lie between a given minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size. This can be achieved using MinMaxScaler or MaxAbsScaler, respectively.

MinMaxScaler： (X - X.min) / (X.max - X.min)，scale to the [0, 1] range
MaxAbsScaler： scales in a way that the training data lies within the range [-1, 1] by dividing through the largest maximum value in each feature. It is meant for data that is already centered at zero or sparse data.

动机：

The motivation to use this scaling include robustness to very small standard deviations of features and preserving zero entries in sparse data.
统一特征的尺度

注意点：

一是特征原有的稀疏性；Centering sparse data would destroy the sparseness structure in the data, and thus rarely is a sensible thing to do. However, it can make sense to scale sparse inputs, especially if features are on different scales. MaxAbsScaler was specifically designed for scaling sparse data, and is the recommended way to go about this.

二是异常值，Scaling data with outliers，MinMaxScaler or MaxAbsScaler都是异常值敏感的，If your data contains many outliers, scaling using the mean and variance of the data is likely to not work very well. In these cases, you can use RobustScaler as a drop-in replacement instead. It uses more robust estimates for the center and range of your data.

附：Scaling vs Whitening

这里顺便提下Whitening（白化）和scaling的区别

简介：

A whitening transformation or sphering transformation is a linear transformation that transforms a vector of random variables with a known covariance matrix into a set of new variables whose covariance is the identity matrix, meaning that they are uncorrelated and each have variance

最低0.47元/天解锁文章

蜡笔小可

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
机器学习之特征工程--特征预处理（上）

机器学习特征工程--特征预处理（上）最近又重新看了下常用的特征预处理方法，主要来源是sklearn官方文档，一些关键信息记录下，留存用，有些乱和杂，抽时间再整理。此为上篇，主要包括：线性转化，非线性转化，及样本归一化，每部分还会有是否应该采用这样的做法。先放一句话：There are no rules of thumb that apply to all applications.目录机器学习特征工程--特征预处理（上）1.线性转化标准化特征缩放（Scaling feat
复制链接

扫一扫