最值归一化(normalization)
将所有数据映射到0-1之间,适用于分布有明显的边界的数据,容易受异常值点的影响
python实现
import numpy as np
np.random.seed(123)
X=np.random.randint(0,100,10)
print(X)
X=(X-np.min(X))/(np.max(X)-np.min(X))
print(X)
[66 92 98 17 83 57 86 97 96 47]
[0.60493827 0.92592593 1. 0. 0.81481481 0.49382716 0.85185185 0.98765432 0.97530864 0.37037037]
sklearn实现
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler,MinMaxScaler
X = load_digits()["data"]
y = load_digits()["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
minMaxScaler=MinMaxScaler()
minMaxScaler.fit(X_train)
X_train=minMaxScaler.transform(X_train)
X_test=minMaxScaler.transform(X_test)
均值方差归一化(standardization)
将所有数据归一化到均值为0,方差为1的分布中,适用于数据分布没有明显边界的情况
python实现
import numpy as np
np.random.seed(123)
X=np.random.randint(0,100,10)
print(X)
X=(X-np.mean(X))/np.std(X)
print(X)
[66 92 98 17 83 57 86 97 96 47]
[-0.31060745 0.71164492 0.94754932 -2.23716001 0.35778833 -0.66446405 0.47574053 0.90823192 0.86891452 -1.05763804]
sklearn实现
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler,MinMaxScaler
X = load_iris()["data"]
y = load_iris()["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
standardScaler=StandardScaler()
standardScaler.fit(X_train)
X_train=standardScaler.transform(X_train)
X_test=standardScaler.transform(X_test)
print(X_train)
print(X_test)