多变量线性回归中的特征缩放

最新推荐文章于 2022-07-03 10:26:37 发布

Backcanhave7

最新推荐文章于 2022-07-03 10:26:37 发布

阅读量1.5k

点赞数

分类专栏：机器学习文章标签：特征缩放多变量线性回归

本文链接：https://blog.csdn.net/qq_41080850/article/details/85629824

版权

机器学习专栏收录该内容

24 篇文章 3 订阅

订阅专栏

为什么要进行特征缩放？

"Most of the times, your dataset will contain features highly varying in magnitudes, units and range. But since, most of the machine learning algorithms use Eucledian distance between two data points in their computations, this is a problem.

If left alone, these algorithms only take in the magnitude of features neglecting the units. The results would vary greatly between different units, 5kg and 5000gms. The features with high magnitudes will weigh in a lot more in the distance calculations than features with low magnitudes.

To supress this effect, we need to bring all features to the same level of magnitudes. This can be acheived by scaling."

以上引自https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e

本文以线性回归为例，简单介绍特征缩放在利用梯度下降法求解多变量线性回归方程中参数时的应用：

说明：文中示例所使用的数据来自于Andrew Ng的机器学习公开课。

上代码：

%matplotlib notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d

path =  'ex1data2.txt'
data = pd.read_csv(path, header=None, names=['Size', 'Bedrooms', 'Price'])

data中部分数据如下图所示：

#对data中的特征进行归一化处理：

max_min_scaler = lambda x : (x-np.average(x))/(np.max(x)-np.min(x))
data[['Size']] = data[['Size']].apply(max_min_scaler)
data[['Bedrooms']] = data[['Bedrooms']].apply(max_min_scaler)

进行特征缩放后，data中的部分数据如下图所示：

data的各描述统计量信息如下：

利用梯度下降求解多变量回归方程中的未知参数：

# 向data中添加一列便于矩阵计算的辅助列：
data.insert(0, 'Ones', 1)


# 定义代价函数：
def computeCost(X, y, theta):
    inner = np.power(((X * theta.T) - y), 2)
    return np.sum(inner) / (2 * len(X))


# 定义梯度下降函数：
def gradientDescent(X, y, theta, alpha, iters):
    temp = np.matrix(np.zeros(theta.shape))
    parameters = int(theta.ravel().shape[1])
    cost = np.zeros(iters)
    
    for i in range(iters):
        error = (X * theta.T) - y
        
        for j in range(parameters):
            term = np.multiply(error, X[:,j])
            temp[0,j] = theta[0,j] - ((alpha / len(X)) * np.sum(term))
            
        theta = temp
        cost[i] = computeCost(X, y, theta)
        
    return theta, cost


# 获取训练数据集的特征和标签：
cols = data.shape[1]
X = data.iloc[:,0:cols-1]
y = data.iloc[:,cols-1:cols]

# 将数据集的特征和标签转换成矩阵形式：
X = np.matrix(X.values)
y = np.matrix(y.values)

# 初始化相关参数
alpha = 0.01
iters = 10000
theta = np.matrix(np.array([0,0,0]))

# 调用梯度下降函数求解多变量线性回归方程中的未知参数：
g, cost = gradientDescent(X, y, theta, alpha, iters)

# g的值为matrix([[340412.65957447, 468817.94834267,  10324.476191  ]])

绘制代价函数值与迭代次数的关系图像：

fig, ax = plt.subplots()
ax.plot(np.arange(iters), cost, 'r')
ax.set_xlabel('Iterations')
ax.set_ylabel('Cost')
ax.set_title('Error vs. Training Epoch')
fig.savefig('p1.png')

结果：

绘制进行特征缩放后数据的线性拟合图：

# 先绘制原始数据的三维散点图
fig = plt.figure()
axes=plt.subplot(111,projection='3d')
axes.scatter(X[:,1],X[:,2],y)

# 在绘制原始数据的线性拟合图
h = X*g.T
axes.scatter(X[:,1],X[:,2],h,c='r')
axes.set(xlabel='Size',ylabel='Bedrooms',zlabel='Price')
fig.savefig('3dscatter.png')

结果：