【机器学习】特征缩放

最新推荐文章于 2024-06-27 14:02:41 发布

昱萱

最新推荐文章于 2024-06-27 14:02:41 发布

阅读量2.8k

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/duxinyuhi/article/details/53100320

版权

机器学习专栏收录该内容

24 篇文章 1 订阅

订阅专栏

特征缩放公式

这个公式的优点是值较稳定，在【0,1】之间

缺点是如果有异常值，特征缩放会很棘手，因为Xmin和Xmax可能是极端值

如果Xmin和Xmax相等，分母为0.

""" quiz materials for feature scaling clustering """

### FYI, the most straightforward implementation might 
### throw a divide-by-zero error, if the min and max
### values are the same
### but think about this for a second--that means that every
### data point has the same value for that feature!  
### why would you rescale it?  Or even use it at all?

def featureScaling(arr):
    maxarr = max(arr)
    minarr = min(arr)
    for i in range(len(arr)):
        arr[i]=(arr[i]-minarr)/float(maxarr-minarr)
    return arr


# tests of your feature scaler--line below is input data
data = [115, 140, 175]
print featureScaling(data)

输出为[0.0, 0.4166666666666667, 1.0]

利用python中的库函数

from sklearn.preprocessing import MinMaxScaler
import numpy
weights = numpy.array([[115.],[140.],[175.]])
scaler = MinMaxScaler()
rescaled_weight = scaler.fit_transform(weights)
print rescaled_weight

输出为

[[ 0. ]
[ 0.41666667]
[ 1. ]]

有些算法受特征缩放的影响，比如SVM，K-means算法。

SVM算法，计算距离时，是在一个维度和另一个维度间做权衡。因为SVM是要找一个分割面把点分开，涉及到两个维度的交互。

K-means算法，是计算每个点到集群中心的距离，如果你将一个变量扩大一倍，它的数值也会扩大一倍。

而决策树和线性回归不受特征缩放的影响。决策树的分割是一系列的水平线和垂直线，不存在两者的交换。在考虑一个维度时，不需要考虑另一个维度的值。

如果把一个维度进行缩放，分割的位置会变化，但顺序不会变。

线性回归的每个特征有一个相应的系数，系数和特征总是一起出现。特征A的变化不会影响到特征B的系数。

一些算法，我们可以通过特征缩放改变结果，一些算法则不受影响。