为什么要进行特征缩放?
"Most of the times, your dataset will contain features highly varying in magnitudes, units and range. But since, most of the machine learning algorithms use Eucledian distance between two data points in their computations, this is a problem.
If left alone, these algorithms only take in the magnitude of features neglecting the units. The results would vary greatly between different units, 5kg and 5000gms. The features with high magnitudes will weigh in a lot more in the distance calculations than features with low magnitudes.
To supress this effect, we need to bring all features to the same level of magnitudes. This can be acheived by scaling."
以上引自https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e
本文以线性回归为例,简单介绍特征缩放在利用梯度下降法求解多变量线性回归方程中参数时的应用:
说明:文中示例所使用的数据来自于Andrew Ng的机器学习公开课。
上代码:
%matplotlib notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d
path = 'ex1data2.txt'
data = pd.read_csv(path, header=None, names=['Size', 'Bedrooms', 'Price'])
data中部分数据如下图所示:
#对data中的特征进行归一化处理:
max_min_scaler = lambda x : (x-np.average(x))/(np.max(x)-np.min(x))
data[['Size']] = data[['Size']].apply(max_min_scaler)
data[['Bedrooms']] = data[['Bedrooms']].apply(max_min_scaler)
进行特征缩放后,data中的部分数据如下图所示:
data的各描述统计量信息如下:
利用梯度下降求解多变量回归方程中的未知参数:
# 向data中添加一列便于矩阵计算的辅助列:
data.insert(0, 'Ones', 1)
# 定义代价函数:
def computeCost(X, y, theta):
inner = np.power(((X * theta.T) - y), 2)
return np.sum(inner) / (2 * len(X))
# 定义梯度下降函数:
def gradientDescent(X, y, theta, alpha, iters):
temp = np.matrix(np.zeros(theta.shape))
parameters = int(theta.ravel().shape[1])
cost = np.zeros(iters)
for i in range(iters):
error = (X * theta.T) - y
for j in range(parameters):
term = np.multiply(error, X[:,j])
temp[0,j] = theta[0,j] - ((alpha / len(X)) * np.sum(term))
theta = temp
cost[i] = computeCost(X, y, theta)
return theta, cost
# 获取训练数据集的特征和标签:
cols = data.shape[1]
X = data.iloc[:,0:cols-1]
y = data.iloc[:,cols-1:cols]
# 将数据集的特征和标签转换成矩阵形式:
X = np.matrix(X.values)
y = np.matrix(y.values)
# 初始化相关参数
alpha = 0.01
iters = 10000
theta = np.matrix(np.array([0,0,0]))
# 调用梯度下降函数求解多变量线性回归方程中的未知参数:
g, cost = gradientDescent(X, y, theta, alpha, iters)
# g的值为matrix([[340412.65957447, 468817.94834267, 10324.476191 ]])
绘制代价函数值与迭代次数的关系图像:
fig, ax = plt.subplots()
ax.plot(np.arange(iters), cost, 'r')
ax.set_xlabel('Iterations')
ax.set_ylabel('Cost')
ax.set_title('Error vs. Training Epoch')
fig.savefig('p1.png')
结果:
绘制进行特征缩放后数据的线性拟合图:
# 先绘制原始数据的三维散点图
fig = plt.figure()
axes=plt.subplot(111,projection='3d')
axes.scatter(X[:,1],X[:,2],y)
# 在绘制原始数据的线性拟合图
h = X*g.T
axes.scatter(X[:,1],X[:,2],h,c='r')
axes.set(xlabel='Size',ylabel='Bedrooms',zlabel='Price')
fig.savefig('3dscatter.png')
结果:
注:本文使用的特征缩放方式是标准化缩放,除此之外,常用的特征缩放方法还有均值归一化、Min-Max缩放和单位向量等。详细内容可以参考https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e。
其他参考:https://blog.csdn.net/tiancai13579/article/details/72781111
PS:本文为博主原创文章,转载请注明出处。