python_归一化

python_归一化

最大最小值 MinMaxScaler

标准化 &中位数和四分位数间距进行缩放

使用曼哈顿范数&欧式范数归一化

4.1 Rescaling a feature¶
Use scikit-learn's MinMaxScaler to rescale a feature array

# 数据缩放  归一化  最大最小值
import numpy as np
from sklearn import preprocessing
​
# create a feature
feature = np.array([
    [-500.5],
    [-100.1],
    [0],
    [100.1],
    [900.9]
])
​
feature
# create scaler
minmax_scaler = preprocessing.MinMaxScaler(feature_range=(0,1))
# minmax_scaler
# scale feature   进行转化, 缩放 
scaled_feature = minmax_scaler.fit_transform(feature)
​
scaled_feature
array([[0.        ],
       [0.28571429],
       [0.35714286],
       [0.42857143],
       [1.        ]])
Discussion
Rescaling is a common preprocessing task in machine learning. Many of the algorithms described later in this book will assume all features are on the same scale, typically 0 to 1 or -1 to 1. There are a number of rescaling techniques, but one of the simlest is called min-max scaling. Min-max scaling uses the minimum and maximum values of a feature to rescale values to within a range. Specfically, min-max calculates:
x‘i=xi−min(x)max(x)min(x)
xi‘=xi−min(x)max(x)min(x)
 
where x is the feature vector,  xixi  is an individual element of feature x, and  x‘ixi‘  is the rescaled element
  1. 标准化 &中位数和四分位数间距进行缩放

4.2 Standardizing a Feature¶
scikit-learn's StandardScaler transforms a feature to have a mean of 0 and a standard deviation of 1.

# 标准化
import numpy as np   
from sklearn import preprocessing
​
# create a feature
feature = np.array([
    [-1000.1],
    [-200.2],
    [500.5],
    [600.6],
    [9000.9]
])# create scaler
scaler = preprocessing.StandardScaler()# transform the feature
standardized = scaler.fit_transform(feature)
​
standardized
​
# 转化方式见下面, 表示距离 平均值多少个  标准差
array([[-0.76058269],
       [-0.54177196],
       [-0.35009716],
       [-0.32271504],
       [ 1.97516685]])
Discussion
A common alternative to min-max scaling is rescaling of features to be approximately standard normally distributed. To achieve this, we use standardization to tranform the data such that it has a mean,  x¯x¯ , or 0 and a standard deviation  σσ , of 1. Specifically, each element in the feature is transformed so that:
x‘i=xi−x¯σ
xi‘=xi−x¯σ
 
Where  x‘IxI‘  is our standardized form of  xixi . The transformed feature represents the number of standard deviations in the original value is away from the feature's mean value (also called a z-score in statistics)

Standardization is a common go-to scaling method for machine learning preprocessing and in my experience is used more than min-max scaling. However it depends on the learning algorithm. For example, principal component analysis often works better using standardization, while min-max scaling is often recommended for neural netwroks. As a general rule, I'd recommend defauling to standardization unless you have a specific reason to use an alternative.

We can see the effect of standardization by looking at the mean and standard deviation of our solutions output:

# 查看 平均值和标准差
print("Mean {}".format(round(standardized.mean())))
print("Standard Deviation: {}".format(standardized.std()))
Mean 0.0
Standard Deviation: 1.0
If our data has significant outliers, it can negatively impact our standardizatino by affecting the feature's mean and variance. In this scenario, it is often helpful to instead rescale the feature using the median and quartile range. In scikit-learn, we do this using the RobustScaler method:

 
# create scaler
robust_scaler = preprocessing.RobustScaler()
# 使用中位数和四分位数 
# transform feature
robust_scaler.fit_transform(feature)
array([[-1.87387612],
       [-0.875     ],
       [ 0.        ],
       [ 0.125     ],
       [10.61488511]])

使用曼哈顿范数&欧式范数归一化

4.3 Normalizing Observations¶
Use scikit-learn's Normalizer to rescale the feature values to have unit norm (a total length of 1)

# L2范数 :两个点之间的距离
# L1范数:一个人沿着 街道走的距离
# create normalizer
normalizer = Normalizer(norm="l2")

# transofmr feature matrix
normalizer.transform(features)
# 归一化观察值
# 让每一特征的值加起来总和为1
import numpy as np
from sklearn.preprocessing import Normalizer
​
# create feature matrix
features = np.array([
    [0.5, 0.5],
    [1.1, 3.4],
    [1.5, 20.2],
    [1.63, 34.4],
    [10.9, 3.3]
])
​
​
# L2范数 :两个点之间的距离
# L1范数:一个人沿着 街道走的距离
# create normalizer
normalizer = Normalizer(norm="l2")# transofmr feature matrix
normalizer.transform(features)
array([[0.70710678, 0.70710678],
       [0.30782029, 0.95144452],
       [0.07405353, 0.99725427],
       [0.04733062, 0.99887928],
       [0.95709822, 0.28976368]])
Discussion
Many rescaling methods operate of features; however, we can also rescale across individual observations. Normalizer rescales the values on individual observations to have unit norm (the sum of their lengths is 1). This type of rescaling is often used when we have many equivalent features (e.g. text-classification when every word is n-word group is a feature).

Normalizer provides three norm options with Euclidean norm (often called L2) being the default:
||x||2=x21+x22+...+x2n⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯√
||x||2=x12+x22+...+xn2
 
where x is an individual observation and x_n is that observation's value for the nth feature.

Alternatively, we can specify Manhattan norm (L1):
||x||1=∑i=1nxi
||x||1=∑i=1nxi
 
Intuitively, L2 norm can be thought of as the distance between two poitns in New York for a bird (i.e. a straight line), while L1 can be thought of as the distance for a human wlaking on the street (walk north one block, east one block, north one block, east one block, etc), which is why it is called "Manhattan norm" or "Taxicab norm".

Practically, notice that norm='l1' rescales an observation's values so they sum to 1, which can sometimes be a desirable quality

# transform feature matrix
features_l1_norm = Normalizer(norm="l1").transform(features)
features_l1_norm
print("Sum of the first observation's values: {}".format(features_l1_norm[0,0] + features_l1_norm[0,1]))
Sum of the first observation's values: 1.0
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值