- 标准化和归一化都是数据预处理方法,由于翻译问题在国内经常被搞混,这里专门明确一下
文章目录
1. 四种特征缩放 (Feature scaling) 方法
- 参考维基百科,其实标准化和归一化都属于四种特征缩放(Feature scaling)方法
1.1 Rescaling (min-max normalization)
Rescaling (min-max normalization)
有时简称normalization
,通常说的归一化
就是指这个
x ′ = x − min ( x ) max ( x ) − min ( x ) x' = \frac{x-\min(x)}{\max(x)-\min(x)} x′=max(x)−min(x)x−min(x)- 这种方式处理后,可以将所有数据值压缩在 [ 0 , 1 ] [0,1] [0,1] 范围内,消除量纲影响,同时保持样本间相对距离
- 下面给出一个推广,可以将数据压缩到指定的
[
a
,
b
]
[a,b]
[a,b] 范围中
x ′ = a + ( b − a ) x − min ( x ) max ( x ) − min ( x ) x' = a+(b-a)\frac{x-\min(x)}{\max(x)-\min(x)} x′=a+(b−a)max(x)−min(x)x−min(x)
1.2 Mean normalization
Mean normalization
,常翻译为均值归一化
x ′ = x − mean ( x ) max ( x ) − min ( x ) x' = \frac{x-\text{mean} (x)}{\max(x)-\min(x)} x′=max(x)−min(x)x−mean(x)- 这种操作把所有样本移动到 0 附近,消除量纲影响,并保持样本间相对距离
1.3 Standardization (Z-score normalization)
Standardization (Z-score normalization)
,通常说的标准化
就是指这个
x ′ = x − mean ( x ) σ ( x ) x' = \frac{x-\text{mean}(x)}{\sigma(x)} x′=σ(x)x−mean(x)- 这种操作将所有样本调整到均值为 0,方差为 1,正态分布标准化后得到标准正态分布,但这不意味着只有正态分布才能做标准化,也不意味着标准化后得到的都是标准正态分布,事实上任意分布都能做标准化,标准化后分布改变了,但分布类型没变,只是做了平移和缩放
注:对于多维正态分布,只有各个特征相互独立(各向同性)时,标准化后分布才呈现一个正圆/正球体,否则标准化后分布也不是正圆/正球体
1.4 Scaling to unit length
Scaling to unit length
,常翻译为单位化
x ′ = x ∣ ∣ x ∣ ∣ x' = \frac{x}{||x||} x′=∣∣x∣∣x- 这种操作将所有样本变化到零点周围的单位超球面上
2. 示例
2.1 维度间相关的二维高斯分布
- 生成一个二维正态分布,设置期望为 μ = [ − 1 2 ] \pmb{\mu} = \begin{bmatrix}-1\\2 \end{bmatrix} μμ=[−12],协方差矩阵为 B = [ 0.6 0.2 0.2 0.1 ] \pmb{B} = \begin{bmatrix}0.6 &0.2\\0.2 &0.1 \end{bmatrix} BB=[0.60.20.20.1]
- 如下绘制此分布,注意两个维度间不是独立的
%matplotlib notebook import numpy as np import scipy.stats as st import matplotlib.pylab as plt from matplotlib.ticker import MultipleLocator, FormatStrFormatter fig = plt.figure(figsize = (5,5)) mu = np.array([-1, 2]) sigma = np.array([[0.6,0.2],[0.2,0.1]]) points = np.random.multivariate_normal(mu,sigma,10000) a0 = fig.add_subplot(1,1,1,label='a0') a0.grid(which='minor',alpha=0.5) a0.scatter(points[:,0], points[:,1],s=1,alpha=0.5,cmap="rainbow") a0.grid(which='major',alpha=0.5)
- 使用四种 Feature scaling 方法处理并可视化,结果如下
%matplotlib notebook import numpy as np import scipy.stats as st import matplotlib.pylab as plt from matplotlib.ticker import MultipleLocator, FormatStrFormatter def MinMaxNormalization(px): for i in range(px.shape[1]): t = px[:,i] tmin,tmax = np.min(t),np.max(t) t[:] = (t-tmin)/(tmax-tmin) def MeanNormalization(px): for i in range(px.shape[1]): t = px[:,i] tmin,tmax,tmean = np.min(t),np.max(t),np.mean(t) t[:] = (t-tmean)/(tmax-tmin) def Standardization(px): for i in range(px.shape[1]): t = px[:,i] tmean,tstd = np.mean(t),np.std(t.copy()) t[:] = (t-tmean)/tstd def Scaling2Unit(px): norm = np.linalg.norm(px,axis=1) for i in range(px.shape[1]): t = px[:,i] t[:] = t/norm majorLocator = MultipleLocator(2) # 主刻度标签设置为1的倍数 p1 = points.copy() p2 = points.copy() p3 = points.copy() p4 = points.copy() MinMaxNormalization(p1) MeanNormalization(p2) Standardization(p3) Scaling2Unit(p4) fig = plt.figure(figsize = (12,3)) a1 = fig.add_subplot(1,4,1,label='a1') a2 = fig.add_subplot(1,4,2,label='a2') a3 = fig.add_subplot(1,4,3,label='a3') a4 = fig.add_subplot(1,4,4,label='a4') subplot = {a1:(p1,'min-max normalization'), a2:(p2,'mean normalization'), a3:(p3,'standardization'), a4:(p4,'scaling to unit')} for ax in subplot: px,title = subplot[ax] ax.scatter(px[:,0], px[:,1],s=1,alpha=0.5,cmap="rainbow") ax.axis([-4,4,-4,4]) ax.xaxis.set_major_locator(majorLocator) ax.yaxis.set_major_locator(majorLocator) ax.grid(which='major',alpha=0.5) ax.set_title(title)
2.2 维度间独立的二维高斯分布
-
最后再看一下两个维度独立的情况,设置期望为 μ = [ − 1 2 ] \pmb{\mu} = \begin{bmatrix}-1\\2 \end{bmatrix} μμ=[−12],协方差矩阵为 B = [ 0.6 0 0 0.1 ] \pmb{B} = \begin{bmatrix}0.6 &0\\0 &0.1 \end{bmatrix} BB=[0.6000.1]
因为前三种操作都消除了量纲,因此都得到圆形分布