数据归一化
解决方法:将所有的数据映射到同一尺度
最值归一化:把所有的数据映射到0-1之间
Xscale = (X - Xmin)/Xmax - Xmin
适用于有明显边界的情况(比如说学生的分数最低值是0分,最大值是100分);受outlier影响较大
### 数据归一化处理
import numpy as np
import matplotlib.pyplot as plt
### 最值归一化 Normalization
x = np.random.randint(0,100,size=100)
x
运行结果:
array([ 8, 57, 61, 4, 46, 86, 41, 68, 51, 0, 49, 91, 15, 74, 96, 50, 70,
25, 24, 93, 1, 73, 23, 92, 92, 71, 88, 3, 81, 83, 10, 83, 31, 38,
1, 80, 43, 10, 45, 28, 49, 15, 98, 0, 8, 1, 26, 57, 42, 43, 81,
81, 97, 21, 54, 61, 30, 87, 69, 30, 55, 82, 52, 67, 33, 14, 61, 89,
87, 40, 51, 7, 26, 87, 26, 36, 20, 29, 84, 98, 17, 50, 75, 11, 61,
70, 24, 91, 30, 3, 47, 9, 29, 80, 88, 18, 22, 97, 33, 13])
(x - np.min(x))/(np.max(x) - np.min(x))
运行结果:
array([0.08163265, 0.58163265, 0.62244898, 0.04081633, 0.46938776,
0.87755102, 0.41836735, 0.69387755, 0.52040816, 0. ,
0.5 , 0.92857143, 0.15306122, 0.75510204, 0.97959184,
0.51020408, 0.71428571, 0.25510204, 0.24489796, 0.94897959,
0.01020408, 0.74489796, 0.23469388, 0.93877551, 0.93877551,
0.7244898 , 0.89795918, 0.03061224, 0.82653061, 0.84693878,
0.10204082, 0.84693878, 0.31632653, 0.3877551 , 0.01020408,
0.81632653, 0.43877551, 0.10204082, 0.45918367, 0.28571429,
0.5 , 0.15306122, 1. , 0. , 0.08163265,
0.01020408, 0.26530612, 0.58163265, 0.42857143, 0.43877551,
0.82653061, 0.82653061, 0.98979592, 0.21428571, 0.55102041,
0.62244898, 0.30612245, 0.8877551 , 0.70408163, 0.30612245,
0.56122449, 0.83673469, 0.53061224, 0.68367347, 0.33673469,
0.14285714, 0.62244898, 0.90816327, 0.8877551 , 0.40816327,
0.52040816, 0.07142857, 0.26530612, 0.8877551 , 0.26530612,
0.36734694, 0.20408163, 0.29591837, 0.85714286, 1. ,
0.17346939, 0.51020408, 0.76530612, 0.1122449 , 0.62244898,
0.71428571, 0.24489796, 0.92857143, 0.30612245, 0.03061224,
0.47959184, 0.09183673, 0.29591837, 0.81632653, 0.89795918,
0.18367347, 0.2244898 , 0.98979592, 0.33673469, 0.13265306])
X = np.random.randint(0,100,(50,2))
X[:10,:]
运行结果:array([[48, 55],
[44, 98],
[51, 35],
[57, 21],
[61, 43],
[46, 54],
[99, 65],
[96, 24],
[82, 57],
[86, 94]])
X = np.array(X,dtype=float)
X[:10,:]
运行结果:array([[48., 55.],
[44., 98.],
[51., 35.],
[57., 21.],
[61., 43.],
[46., 54.],
[99., 65.],
[96., 24.],
[82., 57.],
[86., 94.]]
X[:,0] = (X[:,0] - np.min(X[:,0])) / (np.max(X[:,0]) - np.min(X[:,0]))
X[:,1] = (X[:,1] - np.min(X[:,1])) / (np.max(X[:,1]) - np.min(X[:,1]))
X[:10,:]
运行结果:array([[0.46875 , 0.55555556],
[0.42708333, 0.98989899],
[0.5 , 0.35353535],
[0.5625 , 0.21212121],
[0.60416667, 0.43434343],
[0.44791667, 0.54545455],
[1. , 0.65656566],
[0.96875 , 0.24242424],
[0.82291667, 0.57575758],
[0.86458333, 0.94949495]])
plt.scatter(X[:,0],X[:,1])
plt.show()
得到图片:
np.mean(X[:,0])//X中第0列对应的均值
运行结果:0.5339583333333333
np.std(X[:,0])//X中第0列对应的方差
运行结果:0.29990457104904406
np.mean(X[:,1])//X中第1列对应的均值
运行结果:0.5018181818181818
np.std(X[:,1])//X中第1列对应的方差
运行结果:0.2829035974757983
均值方差归一化:把所有的数据归一到均值为0方差为1的分布中
数据分布没有明显的边界;有可能存在极端数据
Xscale = (X-Xmean)/S
### 均值方差归一化 Standardization
X2 = np.random.randint(0,100,(50,2))
X2 = np.array(X2,dtype=float)
X2[:,0] = (X2[:,0] - np.mean(X2[:,0])) / np.std(X2[:,0])
X2[:,1] = (X2[:,1] - np.mean(X2[:,1])) / np.std(X2[:,1])
plt.scatter(X2[:,0],X2[:,1])
plt.show()
绘制的散点图如下:
np.mean(X2[:,0])
运行结果:-8.43769498715119e-17
np.std(X2[:,0])
运行结果:1.0
np.mean(X2[:,1])
运行结果:-3.9412917374193055e-17
np.std(X2[:,1])
运行结果:1.0