有些变量之间差距太大.量纲不同.影响太大.可能会占主导地位.
为了取消这种问题.所以需要归一化
归一化:将所有数据映射到同一尺度
最值归一化:把所有的数据映射到0-1之间
适用于有边界为题的,如果存在极端的极值.会影响归一化的结果
x = np.random.randint(0,100,size=100)
(x-x.min())/(x.max()-x.min())
#多行归一化
x = np.random.randint(0,100,(50,2))
x = np.array(x, dtype = float)
x[:,0] = (x[:,0]-x[:,0].min())/(x[:,0].max()-x[:,0].min())
x[:,1] = (x[:,1]-x[:,1].min())/(x[:,1].max()-x[:,1].min())
#再多行就用循环
均值方差归一化
把所有数据归一到均值为0方差为1的分布中
适用于没明显边界,有可能存在极值的.(一般用这个归一化)
x = np.random.randint(0,100,(50,2))
x = np.array(x,dtype = float)
x[:,0] = (x[:,0]- np.mean(x[:,0]))/np.sqrt(np.var(x[:,0]))
x[:,1] = (x[:,1]- np.mean(x[:,1]))/np.std(x[:,1])
#现在x的每一列都是均值为0方差为1
当使用测试数据集时,所用的归一化是用测试数据的均值方差,因为训练所得的模型是以训练数据集训练得到的
使用sklearn进行归一化
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
iris = datasets.load_iris()
x = iris.data
y = iris.target
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=666)
#sklearn处理归一化的模块
from sklearn.preprocessing import StandardScaler
stand = StandardScaler()
stand.fit(x_train)
x_train = stand.transform(x_train)
x_test = stand.transform(x_test)
#现在就是处理好的数据,可以拿来直接用
import numpy as np
class StandardScal:
def __init__(self):
self.mean_ = None
self.scale_ = None
def fit(self, X):
assert X.ndmi == 2, '必须为二维数组'
self.mean_ = [np.mean(X[:,i]) for i in range(X.shape[1])]
self.scale_ = [np.std(X[:,i]) for i in range(X.shape[1])]
return self
def transform(self, X):
assert X.ndmi == 2, '必须为二维数组'
assert self.mean_ is not None and self.scale_ is not None, '必须先初始化'
assert X.shape[1] == len(self.mean_), '维度必须相同'
res = np.empty(shape = X.shape, dtype=np.float)
for col in range(X.shape[1]):
res[:, col] = (X[:, col] - self.mean_[col]) / self.scale_[col]
return res