机器学习 归一化 标准化
Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information. Normalization is also required for some algorithms to model the data correctly.
规范化是一种经常用作机器学习数据准备过程中的技术。 归一化的目标是将数据集中的数字列的值更改为使用公共刻度,而不会扭曲值范围内的差异或丢失信息。 一些算法还需要规范化以正确地对数据建模。
For example, assume your input dataset contains one column with values ranging from 0 to 1, and another column with values ranging from 10,000 to 100,000. The great difference in the scale of the numbers could cause problems when you attempt to combine the values as features during modeling.
例如,假设您的输入数据集包含一列,其值的范围从0到1,另一列的值的范围是10,000到100,000。 当您在建模期间尝试将值组合为要素时,数字比例的巨大差异可能会导致问题。
Normalization avoids these problems by creating new values that maintain the general distribution and ratios in the source data, while keeping values within a scale applied across all numeric columns used in the model.
规范化通过创建新值来保持源数据中的一般分布和比率,同时将值保持在模型中使用的所有数字列上的刻度范围内,从而避免了这些问题。
There are several ways to normalize the data.Some of them are as follows.
有几种标准化数据的方法,其中一些如下。
日志转换 (Log transformation)
A log transformation is a very useful tool when you have data that clearly does not follow a normal distribution. Log transformation can help reduce skewness when you have skewed data, and can help reducing variability of data. Please do make sure your data is only positive and non-zero numbers as log of negative or 0 is undefined. For just positive numbers that might contain zero’s there is a log 1+p transformation that, as you might have guessed, adds 1 to all the numbers and then does