Sklearn-特征预处理Preprocessing

最新推荐文章于 2022-03-28 17:49:43 发布

BigPanl

最新推荐文章于 2022-03-28 17:49:43 发布

阅读量324

点赞数 1

分类专栏： ML-Arithmetic-Learning 文章标签： python 机器学习

本文链接：https://blog.csdn.net/weixin_40815637/article/details/109700494

版权

1.为什么要进行特征预处理（Preprocessing data）

一句经典语录：“Garbage in，garbage out”。如果输入的数据没有进行很好的处理，那么即使经过训练也不会有好的结果。

Sklearn中讲到
In general, learning algorithms benefit from standardization of the data set. If some outliers are present in the set, robust scalers or transformers are more appropriate.
provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.

大概意思是说，对于机器学习算法通常是受益于数据的标准化的，如果数据集中存在一些异常数据，则通过标准化会是模型训练更加准确。预处理或者标准化就是通过一些转换函数将特征数据转换成更加适合算法模型的特征数据过程。
Compare the effect of different scalers on data with outliers
这是Sklearn中提供的一个案例，使用不同的Scaler对Outlier数据处理前后的比较可视化展示。

这里先介绍两种数据预处理的Scaler方式，后学学习到其他的方式再进行添加。

2.Scalers

2.1归一化 MinMaxScaler

归一化，就是通过对原始数据的处理，使数据映射到指定值[mi,mx]（默认[0,1]）之间。
最小最大归一化公式
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
where min, max = feature_range.This transformation is often used as an alternative to zero mean, unit variance scaling. 通常用来进行0-1标准化。

在这里插入图片描述

作用于每一列，max为一列的最大值，min为一列的最小值,那么X’’为最终结果，mx，mi分别为指定区间值默认mx为1,mi为0


>>> from sklearn.preprocessing import MinMaxScaler
>>> data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
>>> scaler = MinMaxScaler()
>>> print(scaler.fit(data))
MinMaxScaler()
>>