线性回归之批梯度下降、随机梯度下降和mini-batch梯度下降算法

最新推荐文章于 2024-09-16 16:56:48 发布

Wise_monkey

最新推荐文章于 2024-09-16 16:56:48 发布

阅读量772

点赞数

文章标签：机器学习算法

本文链接：https://blog.csdn.net/weixin_43970346/article/details/104654613

版权

线性回归

对于 $y = a x + b$ 一元线性回归如下图所示：
在这里插入图片描述
考虑多个变量的情形：

$h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2\\ h(x) = \sum_{i=0}^n \theta_ix_i = \theta^Tx$
我们选取一个比较“符合常理”的误差函数为（损失函数）：
$J(\theta) = \frac{1}{2}\sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)})^2$
当损失函数取得极小值时，求得的 $\theta$ 即为局部最优解。（对于 $J(\theta)$ 这个二次函数而言，当取得极小值，求得的 $\theta$ 为全局最优解）

对于 $\theta$ 的解析式的求解过程如下：
在这里插入图片描述
可得最小二乘意义下的参数最优解为：
$\theta = (X^TX)^{-1}X^Ty$
特别的，当 $X^TX$ 阶过高时，仍然需要使用梯度下降的方式计算数值解

梯度下降算法

步骤：
1、初始化 $\theta$ （随机初始化）
2、迭代得到新的 $\theta$ 能够是的 $J(\theta)$ 更小
3、如果 $J(\theta)$ 能够继续减少，返回（2）
迭代公式（ $\alpha$ 为称为学习率)：
$\theta_j :=\theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta)$
梯度方向（本质上是对 $\theta$ 求偏导）：
$\begin{aligned} \frac{\partial}{\partial\theta_j}J(\theta) &= \frac{\partial}{\partial\theta_j}\frac{1}{2}(h_\theta(x) - y)^2 \\ &= 2*\frac{1}{2}(h_\theta(x) - y) *\partial\frac{\partial}{\partial\theta_j}(h_\theta(x)-y) \\ &= (h_\theta(x)-y)*\frac{\partial}{\partial\theta_j}(\sum_{i=0}^n \theta_ix_i - y) \\ &= (h_\theta(x)-y)x_j \end{aligned}$

批梯度下降算法

$\text{Repeat until convergence\{}\\ \theta_j := \theta_j + \alpha\sum_{i=1}^m(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}\\ \text{\}}$
批梯度下降图示：
在这里插入图片描述

随机梯度下降算法

$\text{Loop\{}\\ \text{for i = 1 to m,\{} \theta_j :=\theta_j + \alpha(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)} \text{\}}\\ \text{\}}$

mini-batch梯度下降算法

$\text{Repeat until convergence\{}\\ \theta_j := \theta_j + \alpha\sum_{i=1}^m(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}\\ \text{\}}$

$\text{Loop\{}\\ \text{for i = 1 to m,\{} \theta_j :=\theta_j + \alpha(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)} \text{\}}\\ \text{\}}$

梯度下降算法总结

对于批梯度下降算法而言，由于要计算所有样本的偏差，效率相对随机梯度下降算法和mini-batch梯度下降算法较差，其优点在于能够稳步地收敛得到最优解。而随机梯度下降算法由于时随机选取样本计算梯度，导致其收敛的时快时慢，有时甚至会产生振荡现象（由于梯度过大错过了全局最优解），但胜在效率更高。mini-batch则是两者的结合。