梯度下降中的梯度与下降

最新推荐文章于 2022-05-31 10:13:46 发布

阔海星沉

最新推荐文章于 2022-05-31 10:13:46 发布

阅读量250

点赞数

分类专栏：深度学习文章标签：逻辑回归机器学习人工智能

本文链接：https://blog.csdn.net/ManWZD/article/details/104022976

版权

深度学习专栏收录该内容

3 篇文章 0 订阅

订阅专栏

梯度下降的原理应用在监督学习的各个算法中，它的作用至关重要。但是，在学习过程中萌发了一个疑问，偏导数代表函数值在某个点某个变量上的变化方向和速度(变量变化1个单位，函数值变化偏导数个单位，此单位越小越精确)。为什么迭代式寻找函数最小值时，该变量的偏导数可以用作每次迭代的步进幅度呢？本文通过实验，解答了这个问题。

重温公式

以一次线性函数的拟合为例，训练集 $(x,y)\in( \reals ^m, \reals ^m)$ ，学习速率 $\alpha$ 是标量：

预测函数： $h_{\theta}(x)=\theta_0 + \theta_1 * x$
代价函数： $J(\theta_0,\theta_1)=\frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x^i)-y^i)^2$
梯度下降：
- $\theta_0 := \theta_0 - \alpha\frac{\partial }{\partial \theta_0}J(\theta_0,\theta_1)$
- $\theta_1 := \theta_1 - \alpha\frac{\partial }{\partial \theta_1}J(\theta_0,\theta_1)$
梯度下降展开式：
- $\frac{\partial }{\partial \theta_0}J(\theta_0,\theta_1) = \theta_0 - \alpha\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^i)-y^i) = a*\theta_0+b \space (a=1)$
- $\frac{\partial }{\partial \theta_1}J(\theta_0,\theta_1) = \theta_0 - \alpha\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^i)-y^i)x^i = c*\theta_1+d \space (c>0)$

实验

function theta = test_grad(theta, alpha, num_iters)
 
load('test_alpha_train.mat');
figure 1
subplot(3,2,1)
plot(x, y, 'ro');
title('Data Set And Regress Result')

m = length(y); 
X = [ones(m, 1) x];
J_history = [];
Theta_history = [];
Grad_history = [];
itv = num_iters/100;
for iter = 1:num_iters
  grad = X'*(X*theta-y)/m;
  theta = theta - alpha*grad;
  if mod(iter,itv) == 0
    delt = X*theta-y;
    J_history = [J_history;(delt'*delt)/2/m];
    Theta_history = [Theta_history theta];
    Grad_history = [Grad_history grad];
  end
end

y_pre = X*theta;
hold on 
plot(x, y_pre)
hold off

t = 1:(length(J_history));
subplot(3,2,2)
plot(t, J_history)
title('History of Costfunction J')
subplot(3,2,3)
plot(t, Theta_history(1,:))
title('History of Theta0')
subplot(3,2,4)
plot(t, Theta_history(2,:), 'r')
title('History of Theta1')
subplot(3,2,5)
plot(t, Grad_history(1,:))
title('History of Grad on Theta0')
subplot(3,2,6)
plot(t, Grad_history(2,:), 'r')
title('History of Grad on Theta1')

end

test_grad([0;0], 0.00002,300000)

尝试将近 10 次后确定 $\alpha=0.00002$ ，在迭代 300000 次后收敛
下图为测试数据：
- 图1：收敛后的拟合情况，图2：迭代过程中代价函数的值，图3：迭代过程中 $\theta_0$ 的值，图4：迭代过程中 $\theta_1$ 的值，图5：迭代过程中 $\theta_0$ 的偏导数值，图6：迭代过程中 $\theta_1$ 的偏导数值；
- 偏导数的绝对值一直在向 0 逼近，说明在逐步接近最低点；
- $\theta_0$ 和 $\theta_1$ 变化的绝对值相差 3 个数量级，但是变化的速率(幅度比例)几乎一致；

结论

学习速率 $\alpha$ 是标量，我把它理解为梯度下降中一个"恒定"的梯度；偏导数 $\frac{\partial }{\partial \theta_j}J(\theta_0,\theta_1)$ 代表下降的方向和幅度；因为不同变量与最小值的"距离"不同(实验中 $\theta_0$ 是 318， $\theta_1$ 是 0.49)，要通过同样的迭代次数达到最小值，每次迭代的幅度也不相同。

阔海星沉

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
梯度下降中的梯度与下降

梯度下降的原理应用在监督学习的各个算法中，它的作用至关重要。但是，在学习过程中萌发了一个疑问，**偏导数**代表函数值在某个点某个变量上的变化方向和速度(变量变化1个单位，函数值变化**偏导数**个单位，此单位越小越精确)。为什么迭代式寻找函数最小值时，该变量的**偏导数**可以用作每次迭代的步进幅度呢？本文通过实验，解答了这个问题。
复制链接

扫一扫

专栏目录