归一化与标准化

最新推荐文章于 2024-07-17 22:12:19 发布

好运来2333

最新推荐文章于 2024-07-17 22:12:19 发布

阅读量661

点赞数

分类专栏： MachineLearning

本文链接：https://blog.csdn.net/qq_33254870/article/details/92089797

版权

MachineLearning 专栏收录该内容

12 篇文章 7 订阅

订阅专栏

参考链接：https://en.wikipedia.org/wiki/Feature_scaling
在讲解归一化与标准化之前，先了解一下什么是 Feature scaling ？
Feature scaling is a method used to normalize the range of independent variables or features of data.
那么为什么需要进行 Feature scaling 呢？

the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance. （把有量纲的表达式变为无量纲的表达式；各个特征对目标函数的影响权重一致）
gradient descent converges much faster with feature scaling than without it.

归一化与标准化就是指代四种 Feature scaling 方法。

Rescaling (min-max normalization)
$\frac{x - min(x)}{max(x)-min(x)}$
Mean normalization
$\frac{x - mean(x)}{max(x)-min(x)}$
Standardization (Z-score Normalization)
$\frac{x -\mu}{\sigma}$
Scaling to unit length
$\frac{x}{||x||}$

注：归一化与标准化都是针对每个特征进行的，即针对数据的各列进行，而不是针对一个样本行。

网上有关于归一化与标准化区别的激烈讨论（比如有观点片面地说：归一化改变了原始数据的分布，而标准化不改变原始数据分布），我认为根据定义出发去理解自然就水到渠成了，也就可以判断各个观点的合理性与误导性了。
这里请读者思考归一化与标准化是否改变数据分布取决于哪些因素？请用代码验证一下！

在这里插入图片描述
现对上面数据的第一列进行 min-max normalization：

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

datafile = 'data.xlsx' # 参数初始化
data = pd.read_excel(datafile, header = None) # 读取数据

column = data[:][0]

new_column = (column - column.min())/(column.max() - column.min()) # min-max normalization

fig, ax = plt.subplots(1,2)
sns.distplot(column, ax = ax[0])
ax[0].set_title('Original Data')
sns.distplot(new_column,ax=ax[1])
ax[1].set_title('Normalized Data')
plt.show()