AndrewNg - 线性回归【1】梯度下降

最新推荐文章于 2024-07-08 04:26:38 发布

Victor-Gun

最新推荐文章于 2024-07-08 04:26:38 发布

阅读量947

点赞数

分类专栏： Math Machine Learning Algorithms 文章标签：梯度下降机器学习吴恩达

本文链接：https://blog.csdn.net/Victor_Gun/article/details/45176453

版权

Machine Learning 同时被 3 个专栏收录

5 篇文章 0 订阅

订阅专栏

Algorithms

4 篇文章 0 订阅

订阅专栏

Math

2 篇文章 0 订阅

订阅专栏

AndrewNg - 线性回归

经典的Ng房屋问题，给定数据集如下：

房 屋 面 积 20141600240014163000 ⋮ 房 间 数 量 33324 ⋮ 价 格 400220369232540 ⋮

$\begin{array}{c|c|c} \text{房屋面积} & \text{房间数量} & \text{价格} \\ \hline 2014 & 3 & 400 \\ 1600 & 3 & 220 \\ 2400 & 3 & 369 \\ 1416 & 2 & 232 \\ 3000 & 4 & 540 \\ \vdots& \vdots & \vdots\\ \end{array}$

x∈R2 $x\in\mathbb{R}^2$ ，

x(i)1 $x_1^{(i)}$ 表示房屋面积，

x(i)2 $x_2^{(i)}$ 表示房间数量，首先我们会估计

y $y$ 是

x $x$ 是一个线性函数：

hθ(x)=θ(0)+θ(1)x1+θ(2)x2 $h_\theta(x) = \theta_{(0)}+\theta_{(1)}x_1+\theta_{(2)}x_2$ 。其中

θi $\theta_i$ 为参数（也称为权重），为了更方便于表达，我们定义

x0=1 $x_0=1$ ，所以有：

h θ (x) = \sum i = 1 n θ i x i = θ T x,

$h_\theta(x)=\sum_{i=1}^n\theta_ix_i=\theta^Tx,$
其中

n $n$ 为输入向量中特征个数（不包含

x0 $x_0$ ）。那么对于给定的训练集，我们如何选择或学习得出

θ $\theta$ ？在这里我们选取的方法是让

hθ(x) $h_\theta(x)$ 尽量的接近于

y $y$ ，所以我们得到成本函数（

cost function $cost \ function$ -最小二乘法）如下：

J (θ) = 1 2 \sum i = 1 m (h θ (x (i)) - y (i)) 2

$J(\theta)=\frac{1}{2}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2$

最小均方差算法（ Least mean square）

我们的目标是选取可以使 $J(\theta)$ 最小的的 $\theta$ 。为了得到最终的 $\theta$ ，一般我们会给 $\theta$ 赋上初值，通过相关的算法对 $\theta$ 迭代求值，知道 $\theta$ 收敛。这里提及的是梯度下降法，迭代式如下：

θ j : = θ j - α \partial \partial θ j J (θ) .

$\theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta).$
当然，

θ0,...,θn $\theta_0,...,\theta_n$ 是同时迭代更新的。这里的

α $\alpha$ 我们是用来控制学习速率的参数。当然写代码的时候式子中的偏导还得再求一下，为了计算方便起见，假设我们先只有一个样本

(x,y) $(x,y)$ ，即

J(θ) $J(\theta)$ 中的求和符号先忽略一下：

\partial \partial θ j J (θ) = \partial \partial θ j 1 2 (h θ (x) - y) 2 = (h θ (x) - y) \partial \partial θ j (h θ (x) - y) = (h θ (x) - y) \partial \partial θ j (\sum i = 0 n θ i x i - y) = (h θ (x) - y) x j

$\begin{align} \frac{\partial}{\partial\theta_j}J(\theta)&=\frac{\partial}{\partial\theta_j}\frac{1}{2}(h_\theta(x)-y)^2\\ &=(h_\theta(x)-y)\frac{\partial}{\partial\theta_j}(h_\theta(x)-y)\\ &=(h_\theta(x)-y)\frac{\partial}{\partial\theta_j}\left(\sum_{i=0}^n\theta_ix_i-y\right)\\ &=(h_\theta(x)-y)x_j\\ \end{align}$
所以对于一个训练样本来说，迭代式会变成：

θ j : = θ j + α (y (i) - h θ (x) (i)) x (i) j .

$\theta_j:=\theta_j+\alpha(y^{(i)}-h_\theta(x)^{(i)})x_j^{(i)}.$
要将上边的迭代式拓展到含

m $m$ 样本的训练集上，我们一般用到的有两种修改方法，其一如下：

L o o p u n t i l c o n v e r g e n c e : {θ j : = θ j + α \sum i = 1 m (y (i) - h θ (x) (i)) x (i) j (f o r e v e r y j) .}

$\begin{aligned} &Loop\quad until \quad convergence: \quad\{\\ &\qquad \theta_j:=\theta_j+\alpha\sum_{i=1}^m(y^{(i)}-h_\theta(x)^{(i)})x_j^{(i)}\qquad (for\ every\ j). \\ &\} \end{aligned}$

很明显这个算法每一次的迭代都要遍历整个训练集，所以起名叫 批量梯度下降。我们说沿着梯度方向总能够找到局部最优解（说明问题的优化与初值有关），而且我们这里的问题还是一个凸二次函数，说明它只有一个局部最优解就是全局最优解。

这里写图片描述

可以看到图中是初值为(48,30)时梯度下降法的迹。与批量梯度下降法对应， 随机梯度下降法相对来说更适合比较大的数据集：

L o o p u n t i l c o n v e r g e n c e : {f o r i t o m {θ j : = θ j + α (y (i) - h θ (x) (i)) x (i) j (f o r e v e r y j) .}}

$\begin{aligned} &Loop\quad until \quad convergence: \quad\{\\ &\qquad for\ i\ to\ m\{\\ &\qquad\qquad \theta_j:=\theta_j+\alpha(y^{(i)}-h_\theta(x)^{(i)})x_j^{(i)}\qquad (for\ every\ j). \\ &\qquad\}\\ &\} \end{aligned}$

相比于批量梯度下降法每次都要遍历训练集才能更新

θ $\theta$ 来说，随机梯度下降立杆见影，每一步都会对

θ $\theta$ 有一个调整，不过其最后只能接近最优而到不了真正的最优。