StandardScaler
作用:去均值和方差归一化。且是针对每一个特征维度来做的,而不是针对样本
并不是所有的标准化都能给estimator带来好处。
“Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).”
from sklearn.preprocessing import StandardScaler
import numpy as np
def test_algorithm():
np.random.seed(123)
print('use sklearn')
# 注:shape of data: [n_samples, n_features]
data = np.random.randn(10, 4)
scaler = StandardScaler()
scaler.fit(data)
trans_data = scaler.transform(data)
print('original data: ')
print(data)
print('transformed data: ')
print(trans_data)
print('scaler info: scaler.mean_: {}, scaler.var_: {}'.format(scaler.mean_, scaler.var_))
print('\n')
print('use numpy by self')
mean = np.mean(data, axis=0)
std = np.std(data, axis=0)
var = std * std
print('mean: {}, std: {}, var: {}'.format(mean, std, var))
# numpy 的广播功能
another_trans_data = data - mean
# 注:是除以标准差
another_trans_data = another_trans_data / std
print('another_trans_data: ')
print(another_trans_data)
if __name__ == '__main__':
test_algorithm()
运行结果:
use sklearn
original data:
[[-1.0856306 0.99734545 0.2829785 -1.50629471]
[-0.57860025 1.65143654 -2.42667924 -0.42891263]
[ 1.26593626 -0.8667404 -0.67888615 -0.09470897]
[ 1.49138963 -0.638902 -0.44398196 -0.43435128]
[ 2.20593008 2.18678609 1.0040539 0.3861864 ]
[ 0.73736858 1.49073203 -0.93583387 1.17582904]
[-1.25388067 -0.6377515 0.9071052 -1.4286807 ]
[-0.14006872 -0.8617549 -0.25561937 -2.79858911]
[-1.7715331 -0.69987723 0.92746243 -0.17363568]
[ 0.00284592 0.68822271 -0.87953634 0.28362732]]
transformed data:
[[-0.94511643 0.58665507 0.5223171 -0.93064483]
[-0.53659117 1.16247784 -2.13366794 0.06768082]
[ 0.9495916 -1.05437488 -0.42049501 0.3773612 ]
[ 1.13124423 -0.85379954 -0.19024378 0.06264126]
[ 1.70696485 1.63376764 1.22910949 0.8229693 ]
[ 0.52371324 1.02100318 -0.67235312 1.55466934]
[-1.08067913 -0.85278672 1.13408114 -0.858726 ]
[-0.18325687 -1.04998594 -0.00561227 -2.1281129 ]
[-1.49776284 -0.9074785 1.15403514 0.30422599]
[-0.06810748 0.31452186 -0.61717074 0.72793583]]
scaler info: scaler.mean_: [ 0.08737571 0.33094968 -0.24989369 -0.50195303], scaler.var_: [1.54038781 1.29032409 1.04082479 1.16464894]
use numpy by self
mean: [ 0.08737571 0.33094968 -0.24989369 -0.50195303], std: [1.24112361 1.13592433 1.02020821 1.07918902], var: [1.54038781 1.29032409 1.04082479 1.16464894]
another_trans_data:
[[-0.94511643 0.58665507 0.5223171 -0.93064483]
[-0.53659117 1.16247784 -2.13366794 0.06768082]
[ 0.9495916 -1.05437488 -0.42049501 0.3773612 ]
[ 1.13124423 -0.85379954 -0.19024378 0.06264126]
[ 1.70696485 1.63376764 1.22910949 0.8229693 ]
[ 0.52371324 1.02100318 -0.67235312 1.55466934]
[-1.08067913 -0.85278672 1.13408114 -0.858726 ]
[-0.18325687 -1.04998594 -0.00561227 -2.1281129 ]
[-1.49776284 -0.9074785 1.15403514 0.30422599]
[-0.06810748 0.31452186 -0.61717074 0.72793583]]
Process finished with exit code 0