梯度下降（二）：自适应梯度(AdaGrad)、均方根传递(RMSProp)、自适应增量(AdaDelta)、自适应矩估计(Adam)、Nesterov自适应矩估计(Nadam)

最新推荐文章于 2025-03-21 10:00:00 发布

顧辰

最新推荐文章于 2025-03-21 10:00:00 发布

阅读量1w

点赞数 18

分类专栏：梯度下降深度学习机器学习文章标签：随机梯度下降深度学习机器学习

本文链接：https://blog.csdn.net/weixin_43290523/article/details/105760817

版权

机器学习同时被 3 个专栏收录

7 篇文章

订阅专栏

深度学习

3 篇文章

订阅专栏

梯度下降

2 篇文章

订阅专栏

梯度下降（二）：自适应学习率（AdaGrad）、均方根传递（RMSProp）、自适应增量（AdaDelta）、自适应矩估计(Adam）Nesterov自适应矩估计（Nadam）

前言

上一文中，我们为了方便演示，分别对 $\theta =[\theta_0, \theta_1]$ 设置了不同的学习率。但是如果参数过多（如神经网络），对每个参数设置不同的学习率未免太过繁琐且不切实际，然而相同学习率并不一定适合所有参数，比如上篇中η0 = 0.1远大于η1 = 0.002/0.005。

所以对于参数 $\theta$ ：

如果使用大的全局学习率， $\theta_1$ 无法拟合甚至发散。
如果使用小的全局学习率， $\theta_0$ 收敛慢，需要更长的迭代周期。

所以能不能通过参数空间的不同而自适应的调整学习率呢？

自适应梯度（AdaGrad / Adaptive Gradient）

AdaGrad在训练过程中动态调整学习率，对不同参数根据累计梯度平方和更新不同学习率。其公式如下：
$\nabla_\theta J(\theta) \odot \nabla_\theta J(\theta)$ $\theta := \theta - \frac{\eta}{\sqrt{s+ \epsilon}}\odot\nabla_\theta J(\theta)$ 其中 $\odot$ 是点乘，相当于求梯度的平方。 $\epsilon$ 为防止除0及维持数据稳定的极小项，这里我们取 $10^{-6}$ 。

因为 $s$ 是关于梯度平方和的累加项，所以：

梯度一直变化较大的参数，学习率下降也较快，即高频特征使用较小学习率。
梯度一直变化较小的参数，学习率下降也较慢，即低频特征使用较大学习率。
因为累加性，学习率的趋势是不断衰减的，这也符合迭代后期靠近极值点时需设置较小的学习率的直观想法。

优点：每个变量有自己的节奏。
缺点：由于学习率的不断衰减，在迭代过程早期衰减过快可能直接导致后期收敛动力不足，使得AdaGrad无法获得满意的结果。

下图显示了使用小批量梯度下降与Adagrad且learning rate = 8, batchSize = 32, iterations = 50的迭代过程。
在这里插入图片描述

下面是AdaGrad的matlab相关代码：

% Writen by weichen GU, data 4/23th/2020
clear, clf, clc;
data = linspace(-20,20,100);                    % x range
col = length(data);                             % Obtain the number of x
data = [data;0.5*data + wgn(1,100,1)+10];       % Generate dataset - y = 0.5 * x + wgn^2 + 10;
X = [ones(1, col); data(1,:)]';                 % X ->[1;X];

t1=-40:0.1:50;
t2=-4:0.1:4;
[meshX,meshY]=meshgrid(t1,t2);
meshZ = getTotalCost(X, data(2,:)', col, meshX,meshY);

theta =[-30;-4];        % Initialize parameters
LRate = 8  % Learning rate
thresh = 0.5;           % Threshold of loss for jumping iteration
iteration = 50;        % The number of teration

lineX = linspace(-30,30,100);
[row, col] = size(data)                                     % Obtain the size of dataset
lineMy = [lineX;theta(1)*lineX+theta(2)];                   % Fitting line
hLine = plot(lineMy(1,:),lineMy(2,:),'c','linewidth',2);    % draw fitting line

loss = getLoss(X,data(2,:)',col,theta);                     % Obtain current loss value

subplot(2,2,1);
plot(data(1,:),data(2,:),'r.','MarkerSize',10);
title('Data fiting using Univariate LR');
axis([-30,30,-10,30])
xlabel('x');
ylabel('y');
hold on;

% Draw 3d loss surfaces
subplot(2,2,2)
mesh(meshX,meshY,meshZ)
xlabel('θ_0');
ylabel('θ_1');
title('3D surfaces for loss')
hold on;
scatter3(theta(1),theta(2),loss,'r*');

% Draw loss contour figure
subplot(2,2,3)
contour(meshX,meshY,meshZ)
xlabel('θ_0');
ylabel('θ_1');
title('Contour figure for loss')
hold on;
plot(theta(1),theta(2),'r*')

% Draw loss with iteration
subplot(2,2,4)
hold on;
title('Loss when using AdaGrad');
xlabel('iter');
ylabel('loss');
plot(0,loss,'b*');

set(gca,'XLim',[0 iteration]);
%set(gca,'YLim',[0 4000]);
hold on;

batchSize = 32;

s = 0;
for iter = 1 : iteration
    delete(hLine) % set(hLine,'visible','off')

    
    %[thetaOut] = GD(X,data(2,:)',theta,LRate); % Gradient Descent algorithm
    [thetaOut,s] = MBGD(X,data(2,:)',theta,LRate,batchSize,s);
    subplot(2,2,3);
    line([theta(1),thetaOut(1)],[theta(2),thetaOut(2)],'color','k')

    theta = thetaOut;
    loss = getLoss(X,data(2,:)',col,theta); % Obtain losw
    
    
    lineMy(2,:) = theta(2)*lineX+theta(1); % Fitting line
    subplot(2,2,1);
    hLine = plot(lineMy(1,:),lineMy(2,:),'c','linewidth',2); % draw fitting line
    %legend('training data','linear regression');
    
    subplot(2,2,2);
    scatter3(theta(1),theta(2),loss,'r*');
    
    subplot(2,2,3);
    plot(theta(1),theta(2),'r*')
    
    
    subplot(2,2,4)
    plot(iter,loss,'b*');
    
    drawnow();
    
    if(loss < thresh)
        break;
    end
end

hold off


function [Z] = getTotalCost(X,Y, num,meshX,meshY);
    [row,col] = size(meshX);
    Z = zeros(row, col);
    for i = 1 : row
        theta = [meshX(i,:); meshY(i,:)];
        Z(i,:) =  1/(2*num)*sum((X*theta-repmat(Y,1,col)).^2);   
    end

end


function [Z] = getLoss(X,Y, num,theta)
    Z= 1/(2*num)*sum((X*theta-Y).^2);
end

function [thetaOut] = GD(X,Y,theta,eta)
    dataSize = length(X);                       % Obtain the number of data
    dx = 1/dataSize.*(X'*(X*theta-Y));          % Obtain the gradient of Loss function
    thetaOut = theta -eta.*dx;                  % Update parameters(theta)
end

function [thetaOut,s] = AdaGrad(X,Y,theta,eta,s)
    [dataSize,col] = size(X);                       % Obtain the number of data
    eps = 10^-7*ones(col,1);
    dx = 1/dataSize.*(X'*(X*theta-Y));          % Obtain the gradient of Loss function
    s = s + dx.*dx;
    thetaOut = theta -eta./sqrt(eps+s).*dx;                  % Update parameters(theta)
end


% @ Depscription: 
%       Mini-batch Gradient Descent (MBGD)
%       Stochastic Gradient Descent(batchSize = 1) (SGD)
% @ param:
%       X - [1 X_] X_ is actual X; Y - actual Y
%       theta - theta for univariate linear regression y_pred = theta_0 + theta1*x
%       eta - learning rate;
%
function [thetaOut,s] = MBGD(X,Y,theta, eta,batchSize,s) 
    dataSize = length(X);           % obtain the number of data 
    k = fix(dataSize/batchSize);    % obtain the number of batch which has absolutely same size: k = batchNum-1;    
    batchIdx = randperm(dataSize);  % randomly sort for every epoch for achiving sample diversity
    
    batchIdx1 = reshape(batchIdx(1:k*batchSize),k,batchSize);   % batches which has absolutely same size
    batchIdx2 = batchIdx(k*batchSize+1:end);                    % ramained batch
    
    for i = 1 : k
        %thetaOut = GD(X(batchIdx1(i,:),:),Y(batchIdx1(i,:)),theta,eta);
        [thetaOut,s] = AdaGrad(X(batchIdx1(i,:),:),Y(batchIdx1(i,:)),theta,eta,s);
    end
    if(~isempty(batchIdx2))
        %thetaOut = GD(X(batchIdx2,:),Y(batchIdx2),thetaOut,eta);
        [thetaOut,s] = AdaGrad(X(batchIdx2,:),Y(batchIdx2),thetaOut,eta,s);
    end
end

均方根传递（RMSProp / Root Mean Square Propagation）

针对于AdaGrad的学习率衰减过快缺点，RMSProp通过指数加权移动平均(累计局部梯度信息)替代累计平方梯度和来优化AdaGrad，使得远离当前点的梯度贡献小。其具体公式如下：
$\beta\cdot s+(1-\beta)\cdot\nabla_\theta J(\theta)\odot\nabla_\theta J(\theta)$ $\theta := \theta -\frac{\eta}{\sqrt{s+\epsilon}}\odot\nabla_\theta J(\theta)$ 其中 $\beta$ 为RMSProp的衰减因子，下文中我们取β = 0.99。 $s$ 为关于梯度的指数加权移动平方和，初始值为0。 $\odot$ 为点乘，即对应项乘积。超参数一般设置为β=0.9, η=0.001。

优点：在Adagrad基础上添加衰减因子，在学习率更新过程中权衡过去与当前的梯度信息，减轻了因梯度不断累计导致学习率大幅降低的影响，防止学习过早结束。
缺点：引入了超参数 $\beta$ ,增加模型复杂性。同时依赖全局学习率 $\eta$ 。

这里我们使用更小的学习率η = 0.5来对比Adagrad和RMSProp的收敛效果：

在这里插入图片描述

图1. AdaGrad的收敛过程 (η = 0.5)

在这里插入图片描述

图2. RMSProp的收敛过程`(η = 0.5, β = 0.99)`

通过对比，我们可以看到，RMSProp减轻了因梯度不断累计导致学习率大幅降低的影响，防止学习过早结束。

在这里 $\theta_1$ 方向为什么出现了震荡？这是由于 $\theta_1$ 方向的学习率过大主因素以及衰减因子约束不够次因素共同作用的结果。

下面给出RMSProp的matlab代码：

% Writen by weichen GU, data 4/23th/2020
clear, clf, clc;
data = linspace(-20,20,100);                    % x range
col = length(data);                             % Obtain the number of x
data = [data;0.5*data + wgn(1,100,1)+10];       % Generate dataset - y = 0.5 * x + wgn^2 + 10;
X = [ones(1, col); data(1,:)]';                 % X ->[1;X];

t1=-40:0.1:50;
t2=-4:0.1:4;
[meshX,meshY]=meshgrid(t1,t2);
meshZ = getTotalCost(X, data(2,:)', col, meshX,meshY);

theta =[-30;-4];        % Initialize parameters
LRate = 0.5;  % Learning rate
thresh = 0.5;           % Threshold of loss for jumping iteration
iteration = 50;        % The number of teration

lineX = linspace(-30,30,100);
[row, col] = size(data)                                     % Obtain the size of dataset
lineMy = [lineX;theta(1)*lineX+theta(2)];                   % Fitting line
hLine = plot(lineMy(1,:),lineMy(2,:),'c','linewidth',2);    % draw fitting line

loss = getLoss(X,data(2,:)',col,theta);                     % Obtain current loss value

subplot(2,2,1);
plot(data(1,:),data(2,:),'r.','MarkerSize',10);
title('Data fiting using Univariate LR');
axis([-30,30,-10,30])
xlabel('x');
ylabel('y');
hold on;

% Draw 3d loss surfaces
subplot(2,2,2)
mesh(meshX,meshY,meshZ)
xlabel('θ_0');
ylabel('θ_1');
title('3D surfaces for loss')
hold on;
scatter3(theta(1),theta(2),loss,'r*');

% Draw loss contour figure
subplot(2,2,3)
contour(meshX,meshY,meshZ)
xlabel('θ_0');
ylabel('θ_1');
title('Contour figure for loss')
hold on;
plot(theta(1),theta(2),'r*')

% Draw loss with iteration
subplot(2,2,4)
hold on;
title('Loss when using RMSProp');
xlabel('iter');
ylabel('loss');
plot(0,loss,'b*');

set(gca,'XLim',[0 iteration]);
%set(gca,'YLim',[0 4000]);
hold on;

batchSize = 32;

s = 0;
beta = 0.99;

for iter = 1 : iteration
    delete(hLine) % set(hLine,'visible','off')
    
    [thetaOut,s] = MBGD(X,data(2,:)',theta,LRate,batchSize,s,beta);
    subplot(2,2,3);
    line([theta(1),thetaOut(1)],[theta(2),thetaOut(2)],'color','k')

    theta = thetaOut;
    loss = getLoss(X,data(2,:)',col,theta); % Obtain losw
    
    
    lineMy(2,:) = theta(2)*lineX+theta(1); % Fitting line
    subplot(2,2,1);
    hLine = plot(lineMy(1,:),lineMy(2,:),'c','linewidth',2); % draw fitting line
    %legend('training data','linear regression');
    
    subplot(2,2,2);
    scatter3(theta(1),theta(2),loss,'r*');
    
    subplot(2,2,3);
    plot(theta(1),theta(2),'r*')
    
    
    subplot(2,2,4)
    plot(iter,loss,'b*');
    
    drawnow();
    
    if(loss < thresh)
        break;
    end
end

hold off


function [Z] = getTotalCost(X,Y, num,meshX,meshY);
    [row,col] = size(meshX);
    Z = zeros(row, col);
    for i = 1 : row
        theta = [meshX(i,:); meshY(i,:)];
        Z(i,:) =  1/(2*num)*sum((X*theta-repmat(Y,1,col)).^2);   
    end

end


function [Z] = getLoss(X,Y, num,theta)
    Z= 1/(2*num)*sum((X*theta-Y).^2);
end

function [thetaOut] = GD(X,Y,theta,eta)
    dataSize = length(X);                       % Obtain the number of data
    dx = 1/dataSize.*(X'*(X*theta-Y));          % Obtain the gradient of Loss function
    thetaOut = theta -eta.*dx;                  % Update parameters(theta)
end

function [thetaOut,s] = AdaGrad(X,Y,theta,eta,s)
    [dataSize,col] = size(X);                       % Obtain the number of data
    eps = 10^-7*ones(col,1);
    dx = 1/dataSize.*(X'*(X*theta-Y));          % Obtain the gradient of Loss function
    s = s + dx.*dx;
    thetaOut = theta -eta./sqrt(eps+s).*dx;                  % Update parameters(theta)
end
function [thetaOut,s] = RMSProp(X,Y,theta,eta,s,beta)
    [dataSize,col] = size(X);                       % Obtain the number of data
    eps = 10^-7*ones(col,1);
    dx = 1/dataSize.*(X'*(X*theta-Y));          % Obtain the gradient of Loss function
    s = beta*s + (1-beta)*dx.*dx;
    thetaOut = theta -eta./sqrt(eps+s).*dx;                  % Update parameters(theta)
end

% @ Depscription: 
%       Mini-batch Gradient Descent (MBGD)
%       Stochastic Gradient Descent(batchSize = 1) (SGD)
% @ param:
%       X - [1 X_] X_ is actual X; Y - actual Y
%       theta - theta for univariate linear regression y_pred = theta_0 + theta1*x
%       eta - learning rate;
%
function [thetaOut,s] = MBGD(X,Y,theta, eta,batchSize,s,beta) 
    dataSize = length(X);           % obtain the number of data 
    k = fix(dataSize/batchSize);    % obtain the number of batch which has absolutely same size: k = batchNum-1;    
    batchIdx = randperm(dataSize);  % randomly sort for every epoch for achiving sample diversity
    
    batchIdx1 = reshape(batchIdx(1:k*batchSize),k,batchSize);   % batches which has absolutely same size
    batchIdx2 = batchIdx(k*batchSize+1:end);                    % ramained batch
    
    for i = 1 : k
        %[thetaOut,s] = AdaGrad(X(batchIdx1(i,:),:),Y(batchIdx1(i,:)),theta,eta,s);
        [thetaOut,s] = RMSProp(X(batchIdx1(i,:),:),Y(batchIdx1(i,:)),theta,eta,s,beta);
    end
    if(~isempty(batchIdx2))
        %[thetaOut,s] = AdaGrad(X(batchIdx1(i,:),:),Y(batchIdx1(i,:)),theta,eta,s);
        [thetaOut,s] = RMSProp(X(batchIdx2,:),Y(batchIdx2),thetaOut,eta,s,beta);
    end
end

自适应增量（AdaDelta）

AdaDelta是针对于Adagrad的另一种优化，它相对于RMSProp，使用参数θ变化量的指数加权移动平方和替换了全局学习率 $\eta$ 。其思想是利用一阶方法近似模拟二阶牛顿法。其更新公式如下：

${s}_g := \beta\cdot {s}_g + (1-\beta)\cdot\nabla_\theta J(\theta)\odot\nabla_\theta J(\theta)$ $\Delta \theta := \frac{RMS[\Delta \theta]}{RMS[\nabla_\theta J(\theta)]}\cdot\nabla_\theta J(\theta) = \sqrt{\frac{s_{\Delta \theta+\epsilon}}{s_g+\epsilon}}\cdot\nabla_\theta J(\theta)$ $s_{\Delta \theta} := \beta \cdot s_{\Delta \theta}+(1-\beta)\cdot \Delta \theta\odot\Delta \theta$ $\theta := \theta-\Delta \theta$ ${s}_g$ 为关于梯度的指数加权移动平方和， ${s}_{\Delta\theta}$ 是关于参数θ变化量的指数加权移动平方和。二者初始值设为0。 $\epsilon$ 是维持数据稳定的常数，一般设置为 $10^{-6}$ 。

在AdaDelta优化中，分子可以看成一个动量加速项，通过指数加权方式累积先前的梯度变化量。分母项则是与RMSProp一样，所以也可以将RMSProp看成是AdaDelta的一种特殊情况。

优点：不需要人工设置学习率。

这里我们使用衰减因子β=0.5的AdaDelta来观察迭代过程，因为我的数据在 $\theta_1$ 方向方差较低，我在这里设置ε=10-4，使得学习率不会太低。下面是AdaDelta的迭代过程：

在这里插入图片描述

这里附上AdaDelta的matlab代码，大家可以尝试调整超参数 $\beta$ 以及 $\epsilon$ 的值来观察收敛效果：

% Writen by weichen GU, data 4/23th/2020
clear, clf, clc;
data = linspace(-20,20,100);                    % x range
col = length(data);                             % Obtain the number of x
data = [data;0.5*data + wgn(1,100,1)+10];       % Generate dataset - y = 0.5 * x + wgn^2 + 10;
X = [ones(1, col); data(1,:)]';                 % X ->[1;X];

t1=-40:0.1:50;
t2=-4:0.1:4;
[meshX,meshY]=meshgrid(t1,t2);
meshZ = getTotalCost(X, data(2,:)', col, meshX,meshY);

theta =[-30;-4];        % Initialize parameters
%LRate = 0.5;  % Learning rate
thresh = 0.5;           % Threshold of loss for jumping iteration
iteration = 300;        % The number of teration

lineX = linspace(-30,30,100);
[row, col] = size(data)                                     % Obtain the size of dataset
lineMy = [lineX;theta(1)*lineX+theta(2)];                   % Fitting line
hLine = plot(lineMy(1,:),lineMy(2,:),'c','linewidth',2);    % draw fitting line

loss = getLoss(X,data(2,:)',col,theta);                     % Obtain current loss value

subplot(2,2,1);
plot(data(1,:),data(2,:),'r.','MarkerSize',10);
title('Data fiting using Univariate LR');
axis([-30,30,-10,30])
xlabel('x');
ylabel('y');
hold on;

% Draw 3d loss surfaces
subplot(2,2,2)
mesh(meshX,meshY,meshZ)
xlabel('θ_0');
ylabel('θ_1');
title('3D surfaces for loss')
hold on;
scatter3(theta(1),theta(2),loss,'r*');

% Draw loss contour figure
subplot(2,2,3)
contour(meshX,meshY,meshZ)
xlabel('θ_0');
ylabel('θ_1');
title('Contour figure for loss')
hold on;
plot(theta(1),theta(2),'r*')

% Draw loss with iteration
subplot(2,2,4)
hold on;
title('Loss when using AdaDelta');
xlabel('iter');
ylabel('loss');
plot(0,loss,'b*');

set(gca,'XLim',[0 iteration]);

hold on;

batchSize = 32;

s_g = 0; s_t = 0;
beta = 0.5;

for iter = 1 : iteration
    delete(hLine) % set(hLine,'visible','off')
    
    [thetaOut,s_g,s_t] = MBGD(X,data(2,:)',theta,batchSize,s_g,s_t,beta);
    subplot(2,2,3);
    line([theta(1),thetaOut(1)],[theta(2),thetaOut(2)],'color','k')

    theta = thetaOut;
    loss = getLoss(X,data(2,:)',col,theta); % Obtain losw
    
    
    lineMy(2,:) = theta(2)*lineX+theta(1); % Fitting line
    subplot(2,2,1);
    hLine = plot(lineMy(1,:),lineMy(2,:),'c','linewidth',2); % draw fitting line
    %legend('training data','linear regression');
    
    subplot(2,2,2);
    scatter3(theta(1),theta(2),loss,'r*');
    
    subplot(2,2,3);
    plot(theta(1),theta(2),'r*')
    
    
    subplot(2,2,4)
    plot(iter,loss,'b*');
    
    drawnow();
    
    if(loss < thresh)
        break;
    end
end

hold off


function [Z] = getTotalCost(X,Y, num,meshX,meshY);
    [row,col] = size(meshX);
    Z = zeros(row, col);
    for i = 1 : row
        theta = [meshX(i,:); meshY(i,:)];
        Z(i,:) =  1/(2*num)*sum((X*theta-repmat(Y,1,col)).^2);   
    end

end


function [Z] = getLoss(X,Y, num,theta)
    Z= 1/(2*num)*sum((X*theta-Y).^2);
end

function [thetaOut] = GD(X,Y,theta,eta)
    dataSize = length(X);                       % Obtain the number of data
    dx = 1/dataSize.*(X'*(X*theta-Y));          % Obtain the gradient of Loss function
    thetaOut = theta -eta.*dx;                  % Update parameters(theta)
end

function [thetaOut,s] = AdaGrad(X,Y,theta,eta,s)
    [dataSize,col] = size(X);                       % Obtain the number of data
    eps = 10^-7*ones(col,1);
    dx = 1/dataSize.*(X'*(X*theta-Y));          % Obtain the gradient of Loss function
    s = s + dx.*dx;
    thetaOut = theta -eta./sqrt(eps+s).*dx;                  % Update parameters(theta)
end
function [thetaOut,s] = RMSProp(X,Y,theta,eta,s,beta)
    [dataSize,col] = size(X);                       % Obtain the number of data
    eps = 10^-7*ones(col,1);
    dx = 1/dataSize.*(X'*(X*theta-Y));          % Obtain the gradient of Loss function
    s = beta*s + (1-beta)*dx.*dx;
    thetaOut = theta -eta./sqrt(eps+s).*dx;                  % Update parameters(theta)
end
function [thetaOut,s_g, s_t] = AdaDelta(X,Y,theta,s_g,s_t,beta)
    [dataSize,col] = size(X);                       % Obtain the number of data
    eps = 10^-4*ones(col,1);
    dx = 1/dataSize.*(X'*(X*theta-Y));          % Obtain the gradient of Loss function
    s_g = beta*s_g + (1-beta)*dx.*dx; 
    dt = sqrt((s_t+eps)./(s_g+eps)).*dx;
    s_t = beta*s_t + (1-beta)*dt.*dt;
    thetaOut = theta - dt; % Update parameters(theta)
end

% @ Depscription: 
%       Mini-batch Gradient Descent (MBGD)
%       Stochastic Gradient Descent(batchSize = 1) (SGD)
% @ param:
%       X - [1 X_] X_ is actual X; Y - actual Y
%       theta - theta for univariate linear regression y_pred = theta_0 + theta1*x
%       etaK, etaB - learning rate;
%
function [thetaOut,s_g, s_t] = MBGD(X,Y,theta,batchSize,s_g,s_t,beta) 
    dataSize = length(X);           % obtain the number of data 
    k = fix(dataSize/batchSize);    % obtain the number of batch which has absolutely same size: k = batchNum-1;    
    batchIdx = randperm(dataSize);  % randomly sort for every epoch for achiving sample diversity
    
    batchIdx1 = reshape(batchIdx(1:k*batchSize),k,batchSize);   % batches which has absolutely same size
    batchIdx2 = batchIdx(k*batchSize+1:end);                    % ramained batch
    
    for i = 1 : k
        [thetaOut,s_g, s_t] = AdaDelta(X(batchIdx1(i,:),:),Y(batchIdx1(i,:)),theta,s_g,s_t,beta);
    end
    if(~isempty(batchIdx2))
        [thetaOut,s_g, s_t] = AdaDelta(X(batchIdx2,:),Y(batchIdx2),thetaOut,s_g,s_t,beta);
    end
end

自适应矩估计（Adam / Adaptive Momentum Estimation）

Adam融合了RMSProp及Momentum的思想，做到了学习率自适应和动量加速收敛的效果。关于动量梯度下降，可以参考上一篇。其具体公式如下：
$\gamma\cdot m + (1-\gamma)\cdot\nabla_\theta J(\theta)$ $\beta\cdot s+(1-\beta)\cdot\nabla_\theta J(\theta)\odot\nabla_\theta J(\theta)$ $\hat{m} := \frac{m}{1-\gamma^t}$ $\hat s := \frac{s}{1-\beta^t}$ $\theta := \theta - \frac{\eta}{\sqrt{\hat s+\epsilon}}\odot \hat m$ $\hat s$ 和 $\hat m$ 为 $s$ 和 $m$ 偏差修正的值，使得过去梯度权值和为1，防止值过小。
超参数一般设置为β=0.999, γ=0.9, ε=10^-8。

下图展示了使用Adam的迭代收敛过程。这里我们使用和上文RMSProp相同的学习率η = 0.5及衰减因子β=0.99，同时设置动量衰减因子γ=0.9。通过与RMSProp相对比，具有动量特性的Adam在 $\theta_0$ 方向加速收敛，在 $\theta1$ 方向抑制了震荡。

在这里插入图片描述

这里附上Adam相关代码实现：

% Writen by weichen GU, data 4/19th/2020
clear, clf, clc;
data = linspace(-20,20,100);                    % x range
col = length(data);                             % Obtain the number of x
data = [data;0.5*data + wgn(1,100,1)+10];       % Generate dataset - y = 0.5 * x + wgn^2 + 10;
X = [ones(1, col); data(1,:)]';                 % X ->[1;X];

t1=-40:0.1:50;
t2=-4:0.1:4;
[meshX,meshY]=meshgrid(t1,t2);
meshZ = getTotalCost(X, data(2,:)', col, meshX,meshY);

theta =[-30;-4];        % Initialize parameters
LRate = 0.5;  % Learning rate
thresh = 0.5;           % Threshold of loss for jumping iteration
iteration = 100;        % The number of teration

lineX = linspace(-30,30,100);
[row, col] = size(data)                                     % Obtain the size of dataset
lineMy = [lineX;theta(1)*lineX+theta(2)];                   % Fitting line
hLine = plot(lineMy(1,:),lineMy(2,:),'c','linewidth',2);    % draw fitting line

loss = getLoss(X,data(2,:)',col,theta);                     % Obtain current loss value

subplot(2,2,1);
plot(data(1,:),data(2,:),'r.','MarkerSize',10);
title('Data fiting using Univariate LR');
axis([-30,30,-10,30])
xlabel('x');
ylabel('y');
hold on;

% Draw 3d loss surfaces
subplot(2,2,2)
mesh(meshX,meshY,meshZ)
xlabel('θ_0');
ylabel('θ_1');
title('3D surfaces for loss')
hold on;
scatter3(theta(1),theta(2),loss,'r*');

% Draw loss contour figure
subplot(2,2,3)
contour(meshX,meshY,meshZ)
xlabel('θ_0');
ylabel('θ_1');
title('Contour figure for loss')
hold on;
plot(theta(1),theta(2),'r*')

% Draw loss with iteration
subplot(2,2,4)
hold on;
title('Loss when using Adam');
xlabel('iter');
ylabel('loss');
plot(0,loss,'b*');

set(gca,'XLim',[0 iteration]);
%set(gca,'YLim',[0 4000]);
hold on;

batchSize = 32;

s = 0;
beta = 0.99;
momentum = 0;
gamma = 0.9;
cnt = 0;
for iter = 1 : iteration
    cnt = cnt+1;
    delete(hLine) % set(hLine,'visible','off')
    
    [thetaOut,s,momentum] = MBGD(X,data(2,:)',theta,LRate,batchSize,s,beta,momentum,gamma,cnt);
    subplot(2,2,3);
    line([theta(1),thetaOut(1)],[theta(2),thetaOut(2)],'color','k')

    theta = thetaOut;
    loss = getLoss(X,data(2,:)',col,theta); % Obtain losw
    
    
    lineMy(2,:) = theta(2)*lineX+theta(1); % Fitting line
    subplot(2,2,1);
    hLine = plot(lineMy(1,:),lineMy(2,:),'c','linewidth',2); % draw fitting line
    %legend('training data','linear regression');
    
    subplot(2,2,2);
    scatter3(theta(1),theta(2),loss,'r*');
    
    subplot(2,2,3);
    plot(theta(1),theta(2),'r*')
    
    
    subplot(2,2,4)
    plot(iter,loss,'b*');
    
    drawnow();
    
    if(loss < thresh)
        break;
    end
end

hold off


function [Z] = getTotalCost(X,Y, num,meshX,meshY);
    [row,col] = size(meshX);
    Z = zeros(row, col);
    for i = 1 : row
        theta = [meshX(i,:); meshY(i,:)];
        Z(i,:) =  1/(2*num)*sum((X*theta-repmat(Y,1,col)).^2);   
    end

end


function [Z] = getLoss(X,Y, num,theta)
    Z= 1/(2*num)*sum((X*theta-Y).^2);
end

function [thetaOut] = GD(X,Y,theta,eta)
    dataSize = length(X);                       % Obtain the number of data
    dx = 1/dataSize.*(X'*(X*theta-Y));          % Obtain the gradient of Loss function
    thetaOut = theta -eta.*dx;                  % Update parameters(theta)
end

function [thetaOut,s] = AdaGrad(X,Y,theta,eta,s)
    [dataSize,col] = size(X);                       % Obtain the number of data
    eps = 10^-7*ones(col,1);
    dx = 1/dataSize.*(X'*(X*theta-Y));          % Obtain the gradient of Loss function
    s = s + dx.*dx;
    thetaOut = theta -eta./sqrt(eps+s).*dx;                  % Update parameters(theta)
end
function [thetaOut,s] = RMSProp(X,Y,theta,eta,s,decay)
    [dataSize,col] = size(X);                       % Obtain the number of data
    eps = 10^-7*ones(col,1);
    dx = 1/dataSize.*(X'*(X*theta-Y));          % Obtain the gradient of Loss function
    s = decay*s + (1-decay)*dx.*dx;
    thetaOut = theta -eta./sqrt(eps+s).*dx;                  % Update parameters(theta)
end
function [thetaOut,s, momentum] = Adam(X,Y,theta,eta,s,beta,momentum,gamma,cnt)
    [dataSize,col] = size(X);                       % Obtain the number of data
    eps = 10^-7*ones(col,1);
    dx = 1/dataSize.*(X'*(X*theta-Y));              % Obtain the gradient of Loss function
    s = beta*s + (1-beta)*dx.*dx;                   % Update s
    momentum = gamma*momentum + (1-gamma).*dx;       % Update momentum
    momentum_bar = momentum/(1-gamma^cnt);
    s_bar = s /(1-beta^cnt);
    thetaOut = theta - eta./sqrt(eps+s_bar).*momentum_bar;   % Update parameters(theta)
end


% @ Depscription: 
%       Mini-batch Gradient Descent (MBGD)
%       Stochastic Gradient Descent(batchSize = 1) (SGD)
% @ param:
%       X - [1 X_] X_ is actual X; Y - actual Y
%       theta - theta for univariate linear regression y_pred = theta_0 + theta1*x
%       etaK, etaB - learning rate;
%
function [thetaOut,s,momentum] = MBGD(X,Y,theta, eta,batchSize,s,beta,momentum,gamma,cnt) 
    dataSize = length(X);           % obtain the number of data 
    k = fix(dataSize/batchSize);    % obtain the number of batch which has absolutely same size: k = batchNum-1;    
    batchIdx = randperm(dataSize);  % randomly sort for every epoch for achiving sample diversity
    
    batchIdx1 = reshape(batchIdx(1:k*batchSize),k,batchSize);   % batches which has absolutely same size
    batchIdx2 = batchIdx(k*batchSize+1:end);                    % ramained batch
    
    for i = 1 : k
        %[thetaOut,s] = AdaGrad(X(batchIdx1(i,:),:),Y(batchIdx1(i,:)),theta,eta,s);
        [thetaOut,s,momentum] = Adam(X(batchIdx1(i,:),:),Y(batchIdx1(i,:)),theta,eta,s,beta,momentum,gamma,cnt);
    end
    if(~isempty(batchIdx2))
        %[thetaOut,s] = AdaGrad(X(batchIdx1(i,:),:),Y(batchIdx1(i,:)),theta,eta,s);
        [thetaOut,s,momentum] = Adam(X(batchIdx2,:),Y(batchIdx2),thetaOut,eta,s,beta,momentum,gamma,cnt);
    end
end

Nesterov自适应矩估计(Nadam)

Nadam类似于带有Nesterov动量项的Adam，即使用超前梯度更新当前梯度，可以参考上一节NAG。其更新公式如下：

$\gamma\cdot m + (1-\gamma)\cdot\nabla_\theta J(\theta)$ $\beta\cdot s+(1-\beta)\cdot\nabla_\theta J(\theta)\odot\nabla_\theta J(\theta)$ $\hat g = \frac{\nabla_\theta J(\theta)}{1-\gamma^t}$ $\hat{m} := \frac{m}{1-\gamma^t}$ $\hat s := \frac{s}{1-\beta^t}$ $\bar{m} = (1-\gamma)\cdot\hat g+\gamma\cdot\hat{m}$ $\theta := \theta - \frac{\eta}{\sqrt{\hat s+\epsilon}}\odot \bar m$
这里给出一个知乎用户-溪亭日暮的证明：
根据Adam，有
$:=\gamma\cdot\dot{m}+(1-\gamma)\cdot\nabla_\theta J(\theta)$ $\hat{m} = \frac{m}{1-\gamma^t}$ $\theta :=\theta-\frac{\eta}{\sqrt{\hat{s}+\epsilon}}\odot m$ 这里 $\dot{m}$ 指当前的累计动量。于是有
$\theta := \theta-\frac{\eta}{\sqrt{\hat{s}+\epsilon}}\odot ({\frac{\gamma\cdot \dot{m}}{1-\gamma^t}}+\frac{(1-\gamma)\cdot\nabla_\theta J(\theta)}{1-\gamma^t})$
因为 $\frac{\gamma}{1-\gamma^t}$ 是偏差修正项，所以可以转化为
$\theta := \theta-\frac{\eta}{\sqrt{\hat{s}+\epsilon}}\odot (\gamma\cdot \hat{\dot{m}}+\frac{(1-\gamma)\cdot\nabla_\theta J(\theta)}{1-\gamma^t})$ 对比传统动量算法展开项
$\theta :=\theta-(\gamma\cdot\dot{m}+\eta\cdot g_t)$ 这里 $\dot{m}$ 指当前累计动量。使用Nesterov动量替换当前累计动量后，我们得到
$\theta := \theta-\frac{\eta}{\sqrt{\hat{s}+\epsilon}}\odot (\gamma\cdot \hat{m}+\frac{(1-\gamma)\cdot\nabla_\theta J(\theta)}{1-\gamma^t})$
得证。为了与之前公式相一致，我没有引入 $m_{t-1}$ 代替 $\dot{m}$ ，大家如果看不明白可以直接查看链接。

下面展示了使用Nadam的迭代收敛过程：

在这里插入图片描述
可以看到，Nadam相比于Adam抑制了部分震荡。

以下是Nadam相关matlab代码：

% Writen by weichen GU, data 4/19th/2020
clear, clf, clc;
data = linspace(-20,20,100);                    % x range
col = length(data);                             % Obtain the number of x
data = [data;0.5*data + wgn(1,100,1)+10];       % Generate dataset - y = 0.5 * x + wgn^2 + 10;
X = [ones(1, col); data(1,:)]';                 % X ->[1;X];

t1=-40:0.1:50;
t2=-4:0.1:4;
[meshX,meshY]=meshgrid(t1,t2);
meshZ = getTotalCost(X, data(2,:)', col, meshX,meshY);

theta =[-30;-4];        % Initialize parameters
LRate = 0.5;  % Learning rate
thresh = 0.5;           % Threshold of loss for jumping iteration
iteration = 100;        % The number of teration

lineX = linspace(-30,30,100);
[row, col] = size(data)                                     % Obtain the size of dataset
lineMy = [lineX;theta(1)*lineX+theta(2)];                   % Fitting line
hLine = plot(lineMy(1,:),lineMy(2,:),'c','linewidth',2);    % draw fitting line

loss = getLoss(X,data(2,:)',col,theta);                     % Obtain current loss value

subplot(2,2,1);
plot(data(1,:),data(2,:),'r.','MarkerSize',10);
title('Data fiting using Univariate LR');
axis([-30,30,-10,30])
xlabel('x');
ylabel('y');
hold on;

% Draw 3d loss surfaces
subplot(2,2,2)
mesh(meshX,meshY,meshZ)
xlabel('θ_0');
ylabel('θ_1');
title('3D surfaces for loss')
hold on;
scatter3(theta(1),theta(2),loss,'r*');

% Draw loss contour figure
subplot(2,2,3)
contour(meshX,meshY,meshZ)
xlabel('θ_0');
ylabel('θ_1');
title('Contour figure for loss')
hold on;
plot(theta(1),theta(2),'r*')

% Draw loss with iteration
subplot(2,2,4)
hold on;
title('Loss when using Adam');
xlabel('iter');
ylabel('loss');
plot(0,loss,'b*');

set(gca,'XLim',[0 iteration]);
%set(gca,'YLim',[0 4000]);
hold on;

batchSize = 32;

s = 0;
beta = 0.99;
momentum = 0;
gamma = 0.9;

cnt = 0;

for iter = 1 : iteration
    cnt = cnt+1;
    delete(hLine) % set(hLine,'visible','off')
    
    [thetaOut,s,momentum] = MBGD(X,data(2,:)',theta,LRate,batchSize,s,beta,momentum,gamma,cnt);
    subplot(2,2,3);
    line([theta(1),thetaOut(1)],[theta(2),thetaOut(2)],'color','k')

    theta = thetaOut;
    loss = getLoss(X,data(2,:)',col,theta); % Obtain losw
    
    
    lineMy(2,:) = theta(2)*lineX+theta(1); % Fitting line
    subplot(2,2,1);
    hLine = plot(lineMy(1,:),lineMy(2,:),'c','linewidth',2); % draw fitting line
    %legend('training data','linear regression');
    
    subplot(2,2,2);
    scatter3(theta(1),theta(2),loss,'r*');
    
    subplot(2,2,3);
    plot(theta(1),theta(2),'r*')
    
    
    subplot(2,2,4)
    plot(iter,loss,'b*');
    
    drawnow();
    
    if(loss < thresh)
        break;
    end
end

hold off


function [Z] = getTotalCost(X,Y, num,meshX,meshY);
    [row,col] = size(meshX);
    Z = zeros(row, col);
    for i = 1 : row
        theta = [meshX(i,:); meshY(i,:)];
        Z(i,:) =  1/(2*num)*sum((X*theta-repmat(Y,1,col)).^2);   
    end

end


function [Z] = getLoss(X,Y, num,theta)
    Z= 1/(2*num)*sum((X*theta-Y).^2);
end

function [thetaOut] = GD(X,Y,theta,eta)
    dataSize = length(X);                       % Obtain the number of data
    dx = 1/dataSize.*(X'*(X*theta-Y));          % Obtain the gradient of Loss function
    thetaOut = theta -eta.*dx;                  % Update parameters(theta)
end

function [thetaOut,s] = AdaGrad(X,Y,theta,eta,s)
    [dataSize,col] = size(X);                       % Obtain the number of data
    eps = 10^-7*ones(col,1);
    dx = 1/dataSize.*(X'*(X*theta-Y));          % Obtain the gradient of Loss function
    s = s + dx.*dx;
    thetaOut = theta -eta./sqrt(eps+s).*dx;                  % Update parameters(theta)
end
function [thetaOut,s] = RMSProp(X,Y,theta,eta,s,decay)
    [dataSize,col] = size(X);                       % Obtain the number of data
    eps = 10^-7*ones(col,1);
    dx = 1/dataSize.*(X'*(X*theta-Y));          % Obtain the gradient of Loss function
    s = decay*s + (1-decay)*dx.*dx;
    thetaOut = theta -eta./sqrt(eps+s).*dx;                  % Update parameters(theta)
end
function [thetaOut,s, momentum] = Adam(X,Y,theta,eta,s,beta,momentum,gamma,cnt)
    [dataSize,col] = size(X);                       % Obtain the number of data
    eps = 10^-7*ones(col,1);
    dx = 1/dataSize.*(X'*(X*theta-Y));              % Obtain the gradient of Loss function
    s = beta*s + (1-beta)*dx.*dx;                   % Update s
    momentum = gamma*momentum + (1-gamma).*dx;       % Update momentum
    momentum_bias_correction = momentum/(1-gamma^cnt);
    s_bias_correction = s /(1-beta^cnt);
    thetaOut = theta - eta./sqrt(eps+s_bias_correction).*momentum_bias_correction;   % Update parameters(theta)
end


function [thetaOut,s, momentum] = Nadam(X,Y,theta,eta,s,beta,momentum,gamma,cnt)
    [dataSize,col] = size(X);                       % Obtain the number of data
    eps = 10^-7*ones(col,1);
    dx = 1/dataSize.*(X'*(X*theta-Y));              % Obtain the gradient of Loss function
    g_hat = dx/(1-gamma^cnt);
    s = beta*s + (1-beta)*dx.*dx;                   % Update s
    momentum = gamma*momentum + (1-gamma).*dx;       % Update momentum
    momentum_hat = momentum/(1-gamma^cnt);
    s_hat = s /(1-beta^cnt);
    m_bar = (1-gamma)*g_hat+gamma*momentum_hat;
    thetaOut = theta - eta./sqrt(eps+s_hat).*m_bar;   % Update parameters(theta)
end


% @ Depscription: 
%       Mini-batch Gradient Descent (MBGD)
%       Stochastic Gradient Descent(batchSize = 1) (SGD)
% @ param:
%       X - [1 X_] X_ is actual X; Y - actual Y
%       theta - theta for univariate linear regression y_pred = theta_0 + theta1*x
%       etaK, etaB - learning rate;
%
function [thetaOut,s,momentum] = MBGD(X,Y,theta, eta,batchSize,s,beta,momentum,gamma,cnt) 
    dataSize = length(X);           % obtain the number of data 
    k = fix(dataSize/batchSize);    % obtain the number of batch which has absolutely same size: k = batchNum-1;    
    batchIdx = randperm(dataSize);  % randomly sort for every epoch for achiving sample diversity
    
    batchIdx1 = reshape(batchIdx(1:k*batchSize),k,batchSize);   % batches which has absolutely same size
    batchIdx2 = batchIdx(k*batchSize+1:end);                    % ramained batch

    for i = 1 : k
        %[thetaOut,s] = AdaGrad(X(batchIdx1(i,:),:),Y(batchIdx1(i,:)),theta,eta,s);
        %[thetaOut,s,momentum] = Adam(X(batchIdx1(i,:),:),Y(batchIdx1(i,:)),theta,eta,s,beta,momentum,gamma,cnt);
        [thetaOut,s,momentum] = Nadam(X(batchIdx1(i,:),:),Y(batchIdx1(i,:)),theta,eta,s,beta,momentum,gamma,cnt);
    end
    if(~isempty(batchIdx2))
        %[thetaOut,s] = AdaGrad(X(batchIdx1(i,:),:),Y(batchIdx1(i,:)),theta,eta,s);
        %[thetaOut,s,momentum] = Adam(X(batchIdx2,:),Y(batchIdx2),thetaOut,eta,s,beta,momentum,gamma,cnt);
        [thetaOut,s,momentum] = Nadam(X(batchIdx2,:),Y(batchIdx2),thetaOut,eta,s,beta,momentum,gamma,cnt);
    end
end