机器学习(二)——多变量线性回归

最新推荐文章于 2024-04-12 01:46:10 发布

LeeeeeJay

最新推荐文章于 2024-04-12 01:46:10 发布

阅读量1.3k

点赞数 5

分类专栏：机器学习文章标签：机器学习线性回归梯度递降算法

本文链接：https://blog.csdn.net/lijiecao0226/article/details/78199086

版权

机器学习专栏收录该内容

4 篇文章 2 订阅

订阅专栏

一. 前言

本文继续《机器学习(一)——单变量线性回归》的例子，介绍多维特征中的线性回归问题，并通过矩阵计算的方法优化机器学习的计算效率。

二. 模型表示

现在我们对房价预测模型增加更多的特征值，如房间数、楼层、房屋年限等，构成一个多变量的模型，模型中的特征为（ $x_1,x_2,...,x_n$ ）。
(说明: 在现实机器学习的问题中往往具有几百甚至上万维的特征值的模型)

这里写图片描述

2.1 变量定义

下面我们引入新的变量（其余变量与单变量线性回归相同）：
- n 代表特征的数量
- $x^{(i)}$ 代表第 i 个训练实例，是特征矩阵中的第 i 行，是一个向量(vertor)。如上图的
$x^{(2)} = \left[\begin{matrix}1416 \\ 3\\ 2 \\ 40\end{matrix}\right]$
- $x^{(i)}_j$ 代表特征矩阵中第 i 行的第 j 个特征，也就是第 i 个训练实例的第 j 个特征。如上图中 $x^{(2)}_3 = 2$ 。

2.2 模型定义

2.2.1 假设函数

多变量的假设函数 $h_\theta(x)$ 表示为：

h θ (x) = θ 0 + θ 1 x 1 + θ 2 x 2 + . . . + θ n x n (1)

$h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n \tag{1}$
此公式中有n+1个参数和n个变量，为公式简洁化，引入

x0=1 $x_0=1$ ，同时定义两个向量：

x=⎡⎣⎢⎢⎢⎢⎢x0x1x2...xn⎤⎦⎥⎥⎥⎥⎥ θ=⎡⎣⎢⎢⎢⎢⎢θ0θ1θ2...θn⎤⎦⎥⎥⎥⎥⎥ $x = \left[\begin{matrix}x_0 \\ x_1\\ x_2 \\ ...\\ x_n\end{matrix}\right] \ \theta = \left[\begin{matrix}\theta_0 \\ \theta_1\\ \theta_2 \\ ...\\ \theta_n\end{matrix}\right]$

则公式可转化为： $h_\theta(x) = \theta^Tx$ ，其中上标T表示矩阵的转置。

2.2.2 代价函数

与单变量线性回归类似，多变量线性回归中代价函数表示为：

J (θ 0, θ 1, . . ., θ n) = 1 2 m \sum i = 1 m (h θ (x (i)) - y (i)) 2 (2)

$J(\theta_0, \theta_1, ... , \theta_n) = \frac{1}{2m}\sum_{i=1}^m\left(h_\theta(x^{(i)}) - y^{(i)}\right)^2 \tag{2}$
其中：

hθ(x)=θTx=θ0x0+θ1x1+θ2x2+...+θnxn $h_\theta(x) = \theta^Tx = \theta_0x_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n$

三. 算法训练

3.1公式推导

与单变量线性回归问题一样，我们需要找到使得代价函数最小的一系列参数。同样我们可以使用梯度递降法来最小化代价函数：

R e p} e a t {θ j : = θ j - α \partial \partial θ j J (θ 0, θ 1, . . ., θ n) (3)

$\begin{align*} Rep&eat \ \boldsymbol{\{} \\ & \theta_j: = \theta_j - \alpha \frac{\partial}{\partial\theta_j}J(\theta_0, \theta_1, ... , \theta_n) \\ \boldsymbol{\}} \end{align*}\tag{3}$
即：

R e p} e a t {θ j : = θ j - α \partial \partial θ j 1 2 m \sum i = 1 m (h θ (x (i)) - y (i)) 2 (4)

$\begin{align*} Rep&eat \ \boldsymbol{\{} \\ & \theta_j: = \theta_j - \alpha \frac{\partial}{\partial\theta_j} \frac{1}{2m}\sum_{i=1}^m\left(h_\theta(x^{(i)}) - y^{(i)}\right)^2 \\ \boldsymbol{\}} \end{align*} \tag{4}$

下面(4)式求偏导过程(可跳过):

α \partial \partial θ j 1 2 m \sum i = 1 m (h θ (x (i)) - y (i)) 2 = α 1 2 m \sum i = 1 m \partial \partial θ j (h θ (x (i)) - y (i)) 2 = α 1 2 m \sum i = 1 m 2 \cdot (h θ (x (i)) - y (i)) \cdot \partial \partial θ j h θ (x (i)) = α 1 m \sum i = 1 m (h θ (x (i)) - y (i)) \cdot x (i) j (5)

$\begin{align*} & \alpha \frac{\partial}{\partial\theta_j} \frac{1}{2m}\sum_{i=1}^m\left(h_\theta(x^{(i)}) - y^{(i)}\right)^2 \\ & = \alpha\frac{1}{2m}\sum_{i=1}^m\frac{\partial} {\partial\theta_j} \left(h_\theta(x^{(i)}) - y^{(i)}\right)^2 \\ & = \alpha\frac{1}{2m}\sum_{i=1}^m2\cdot\left(h_\theta(x^{(i)}) - y^{(i)} \right)\cdot \frac{\partial} {\partial\theta_j}h_\theta\left(x^{(i)}\right) \\ & = \alpha\frac{1}{m}\sum_{i=1}^m\left(h_\theta(x^{(i)}) - y^{(i)}\right)\cdot x^{(i)}_j \\ \end{align*} \tag{5}$
注意：公式推导过程中

x(i)j、y(i) $x^{(i)}_j、y^{(i)}$ 可视为常量。

带入(4)式最终可得：

R e p} e a t {θ j : = θ j - α 1 m \sum i = 1 m (h θ (x (i)) - y (i)) \cdot x (i) j (s i m u l t a n e o u s l y u p d a t e θ j, f o r j = 0, 1, . . ., n) (6)

$\begin{align*} Rep&eat \ \boldsymbol{\{} \\ & \theta_j: = \theta_j - \alpha\frac{1}{m}\sum_{i=1}^m\left(h_\theta(x^{(i)}) - y^{(i)}\right)\cdot x^{(i)}_j\\ & (simultaneously \ \ update \ \ \theta_j, \ \ for \ j=0,1,...,n ) \\ \boldsymbol{\}} \end{align*} \tag{6}$

注意：(6)式与单变量线性回归中的公式形式一致，且在后续的逻辑回归、神经网络的公式也一致，因此有必要记住此公式。

3.2 算法步骤

下面给出算法过程：
1. 随机初始化 $\theta_0,\theta_1, \theta_2, ..., \theta_n$
2. 计算 $h_\theta(x)$ 的值
3. 计算 $J(\theta_0,\theta_1, \theta_2, ..., \theta_n)$ 的值（可记录每次迭代中J的值）。
4. 判断 $J(\theta_0,\theta_1, \theta_2, ..., \theta_n)$ 是否小于小量 $\epsilon$ 或迭代次数大于阀值，若小于 $\epsilon$ 或迭代次数大于阀值则结束循环，当前 $\theta_0,\theta_1, \theta_2, ..., \theta_n$ 值为最终所求值；否则跳转至第5步。
5. 使用公式(6) 同步更新 $\theta_0,\theta_1, \theta_2, ..., \theta_n$
6. 跳转至第2步，进行迭代。

3.3 算法复杂度

我们分析下上述算法复杂度：
先分析一次迭代过程，计算 $h_\theta(x^{(i)})$ 的复杂度为 $o(n)$ , 计算 $J(\theta_0,\theta_1, \theta_2, ..., \theta_n)$ 的复杂度为 $o(m*n)$ ，计算单个 $\theta_j$ 的复杂度为 $o(m*n)$ ，计算 $\theta_0,\theta_1, \theta_2, ..., \theta_n$ 的复杂度为 $o(m*n^2)$ 。
假设迭代次数为s, 则算法的整体复杂度为 $o(s*m*n^2)$ ,当n和m很大时，算法的效率将变得很低。

因此下面从向量化的角度优化算法的计算性能。

3.4 向量化（Vectorization）

先说下向量化的优点：
1. 简化代码，减轻编码工作量，减少自己编码的错误率，原本需要使用循环的地方，可以利用矩阵计算，一行代码搞定。
2. 利用线性代数库，可以极大提高计算性能，可以更好地利用硬件的并行处理，性能比非向量化实现提高几十甚至上百倍。

笔者在学习过程中也是尽量使用向量化来实现各种算法。
我们先定义矩阵：

Θ = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ θ 0 θ 1 . . . θ n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ X = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ x (1) 0 x (2) 0 . . . x (m) 0 x (1) 1 x (2) 1 . . . x (m) 1 . . . . . . . . . . . . x (1) n x (2) n . . . x (m) n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ Y = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ y (1) y (2) . . . y (m) ⎤ ⎦ ⎥ ⎥ ⎥ ⎥

$\Theta = \left[\begin{matrix} \theta_0 \\ \theta_1 \\ ... \\ \theta_n \end{matrix}\right] \ X = \left[\begin{matrix} x^{(1)}_0 & x^{(1)}_1 & ... & x^{(1)}_n \\ x^{(2)}_0 & x^{(2)}_1 & ... & x^{(2)}_n \\ ... & ... & ... & ... \\ x^{(m)}_0 & x^{(m)}_1 & ... & x^{(m)}_n \end{matrix}\right] Y = \left[\begin{matrix} y^{(1)} \\ y^{(2)} \\ ... \\ y^{(m)} \end{matrix}\right]$

h θ (X) = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ h θ (x (1)) h θ (x (2)) . . . h θ (x (m)) ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ θ 0 x (1) 0 + θ 1 x (1) 1 + . . . + θ n x (1) n θ 0 x (2) 0 + θ 1 x (2) 1 + . . . + θ n x (2) n . . . θ 0 x (m) 0 + θ 1 x (m) 1 + . . . + θ n x (m) n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥

$\ h_\theta(X) = \left[\begin{matrix} h_\theta(x^{(1)}) \\ h_\theta(x^{(2)}) \\ ... \\ h_\theta(x^{(m)}) \end{matrix}\right] = \left[\begin{matrix} \theta_0x_0^{(1)} + \theta_1x_1^{(1)} + ... + \ \theta_nx_n^{(1)} \\ \theta_0x_0^{(2)} + \theta_1x_1^{(2)} + ... + \ \theta_nx_n^{(2)} \\ ... \\ \theta_0x_0^{(m)} + \theta_1x_1^{(m)} + ... + \ \theta_nx_n^{(m)} \end{matrix}\right]$

则 $\Theta \in R^{nx1}, \ \ X \in R^{m x n}, \ \ h_\theta(X) \in R^{m x 1}$ .（此处为方便编写将n+1维计成n维）

根据矩阵乘法法则，可得：

h θ (X) = X Θ (7)

$h_\theta(X) = X \Theta \tag{7}$

J (θ 0, θ 1, . . ., θ n) = 1 2 m \sum i = 1 m (h θ (x (i)) - y (i)) 2 = 1 2 m * s u m ((h θ (X) - Y) 2) (8)

$\begin{align*} J(\theta_0, \theta_1, ... , \theta_n) &= \frac{1}{2m}\sum_{i=1}^m\left(h_\theta(x^{(i)}) - y^{(i)}\right)^2 \\ &=\frac{1}{2m} * sum\left((h_\theta(X) - Y) ^ 2\right) \end{align*} \tag{8}$

以上我们得到了 $h_\theta(X)$ 和 $J(\theta_0, \theta_1, ... , \theta_n)$ 的向量化计算，下面我们来看较为复杂的 $\theta_0,\theta_1, \theta_2, ..., \theta_n$ 的计算。

我们可以假设:

Θ = Θ - α 1 m δ (9)

$\Theta = \Theta - \alpha \frac{1}{m} \delta \tag{9}$
其中

δ∈Rnx1的向量。 $\delta \in R^{n x 1}的向量。$

δ = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ δ 0 δ 1 . . . δ n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ \sum i = 1 m (h θ (x (i)) - y (i)) \cdot x (i) 0 \sum i = 1 m (h θ (x (i)) - y (i)) \cdot x (i) 1 . . . \sum i = 1 m (h θ (x (i)) - y (i)) \cdot x (i) n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ = X T (h θ (X) - Y) (10)

$\ \delta = \left[\begin{matrix} \delta_0 \\ \delta_1 \\ ... \\ \delta_n \end{matrix}\right] = \left[\begin{matrix} \sum_{i=1}^m\left(h_\theta(x^{(i)}) - y^{(i)}\right)\cdot x^{(i)}_0 \\ \sum_{i=1}^m\left(h_\theta(x^{(i)}) - y^{(i)}\right)\cdot x^{(i)}_1 \\ ... \\ \sum_{i=1}^m\left(h_\theta(x^{(i)}) - y^{(i)}\right)\cdot x^{(i)}_n \end{matrix}\right] \\ =X^T \left(h_\theta(X) - Y\right) \tag{10}$
注：此处省略公式(10)的最后一步推导过程，感兴趣的可自行推导，另外可通过验证等式两边的矩阵维数做验证。
将(10)式带入(9)式，可得：