标准化、归一化概念梳理（附代码）

云端FFF

已于 2022-09-06 14:32:15 修改

阅读量3.1k

点赞数 2

分类专栏：机器学习 ========================= 文章标签：标准化归一化 Normalization Standardization

于 2022-01-29 00:58:56 首次发布

本文链接：https://blog.csdn.net/wxc971231/article/details/122735356

版权

机器学习 ========================= 专栏收录该内容

10 篇文章 7 订阅

订阅专栏

标准化和归一化都是数据预处理方法，由于翻译问题在国内经常被搞混，这里专门明确一下

1. 四种特征缩放 (Feature scaling) 方法

参考维基百科，其实标准化和归一化都属于四种特征缩放（Feature scaling）方法

1.1 Rescaling (min-max normalization)

Rescaling (min-max normalization) 有时简称 normalization，通常说的归一化就是指这个
$\frac{x-\min(x)}{\max(x)-\min(x)}$
这种方式处理后，可以将所有数据值压缩在 $[0, 1]$ 范围内，消除量纲影响，同时保持样本间相对距离
下面给出一个推广，可以将数据压缩到指定的 $[a, b]$ 范围中
$a+(b-a)\frac{x-\min(x)}{\max(x)-\min(x)}$

1.2 Mean normalization

Mean normalization，常翻译为均值归一化
$\frac{x-\text{mean} (x)}{\max(x)-\min(x)}$
这种操作把所有样本移动到 0 附近，消除量纲影响，并保持样本间相对距离

1.3 Standardization (Z-score normalization)

Standardization (Z-score normalization)，通常说的标准化就是指这个
$\frac{x-\text{mean}(x)}{\sigma(x)}$
这种操作将所有样本调整到均值为 0，方差为 1，正态分布标准化后得到标准正态分布，但这不意味着只有正态分布才能做标准化，也不意味着标准化后得到的都是标准正态分布，事实上任意分布都能做标准化，标准化后分布改变了，但分布类型没变，只是做了平移和缩放

注：对于多维正态分布，只有各个特征相互独立（各向同性）时，标准化后分布才呈现一个正圆/正球体，否则标准化后分布也不是正圆/正球体

1.4 Scaling to unit length

Scaling to unit length，常翻译为单位化
$\frac{x}{||x||}$
这种操作将所有样本变化到零点周围的单位超球面上

2. 示例

2.1 维度间相关的二维高斯分布

生成一个二维正态分布，设置期望为 $\pmb{\mu} = \begin{bmatrix}-1\\2 \end{bmatrix}$ ，协方差矩阵为 $\pmb{B} = \begin{bmatrix}0.6 &0.2\\0.2 &0.1 \end{bmatrix}$

如下绘制此分布，注意两个维度间不是独立的

%matplotlib notebook
import numpy as np
import scipy.stats as st
import matplotlib.pylab as plt
from matplotlib.ticker import MultipleLocator, FormatStrFormatter

fig = plt.figure(figsize = (5,5))
mu = np.array([-1, 2])
sigma = np.array([[0.6,0.2],[0.2,0.1]])
points = np.random.multivariate_normal(mu,sigma,10000)

a0 = fig.add_subplot(1,1,1,label='a0')
a0.grid(which='minor',alpha=0.5) 
a0.scatter(points[:,0], points[:,1],s=1,alpha=0.5,cmap="rainbow")
a0.grid(which='major',alpha=0.5)

在这里插入图片描述

使用四种 Feature scaling 方法处理并可视化，结果如下

%matplotlib notebook
import numpy as np
import scipy.stats as st
import matplotlib.pylab as plt
from matplotlib.ticker import MultipleLocator, FormatStrFormatter

def MinMaxNormalization(px):
    for i in range(px.shape[1]):   
        t = px[:,i]
        tmin,tmax = np.min(t),np.max(t)
        t[:] = (t-tmin)/(tmax-tmin)  

def MeanNormalization(px):
    for i in range(px.shape[1]):   
        t = px[:,i]
        tmin,tmax,tmean = np.min(t),np.max(t),np.mean(t)
        t[:] = (t-tmean)/(tmax-tmin)  

def Standardization(px):
    for i in range(px.shape[1]):   
        t = px[:,i]
        tmean,tstd = np.mean(t),np.std(t.copy())
        t[:] = (t-tmean)/tstd

def Scaling2Unit(px):
    norm = np.linalg.norm(px,axis=1)
    for i in range(px.shape[1]):   
        t = px[:,i]
        t[:] = t/norm    
        
majorLocator = MultipleLocator(2) # 主刻度标签设置为1的倍数
p1 = points.copy()
p2 = points.copy()
p3 = points.copy()
p4 = points.copy()

MinMaxNormalization(p1)
MeanNormalization(p2)
Standardization(p3)
Scaling2Unit(p4)

fig = plt.figure(figsize = (12,3))
a1 = fig.add_subplot(1,4,1,label='a1')
a2 = fig.add_subplot(1,4,2,label='a2')
a3 = fig.add_subplot(1,4,3,label='a3')
a4 = fig.add_subplot(1,4,4,label='a4')

subplot = {a1:(p1,'min-max normalization'),
           a2:(p2,'mean normalization'),
           a3:(p3,'standardization'),
           a4:(p4,'scaling to unit')}

for ax in subplot:
    px,title = subplot[ax]
    ax.scatter(px[:,0], px[:,1],s=1,alpha=0.5,cmap="rainbow")
    ax.axis([-4,4,-4,4]) 
    ax.xaxis.set_major_locator(majorLocator)
    ax.yaxis.set_major_locator(majorLocator)
    ax.grid(which='major',alpha=0.5)                    
    ax.set_title(title)

在这里插入图片描述

2.2 维度间独立的二维高斯分布

最后再看一下两个维度独立的情况，设置期望为 $\pmb{\mu} = \begin{bmatrix}-1\\2 \end{bmatrix}$ ，协方差矩阵为 $\pmb{B} = \begin{bmatrix}0.6 &0\\0 &0.1 \end{bmatrix}$

因为前三种操作都消除了量纲，因此都得到圆形分布