深度学习引论（三）：损失函数、BP算法

最新推荐文章于 2024-07-03 13:33:41 发布

谦虚的海绵

最新推荐文章于 2024-07-03 13:33:41 发布

阅读量5.5k

点赞数 1

分类专栏：深度学习文章标签：深度学习神经网络

本文链接：https://blog.csdn.net/qq_25436597/article/details/78820033

版权

深度学习专栏收录该内容

45 篇文章 4 订阅

订阅专栏

损失函数（cost function）

经过上一节课的学习，我们已经对神经网络有了一定的概念。而评价一个神经网络的性能，往往使用损失函数来评价。

在有监督的学习中，每一次的训练，我们都已知目标输出，通常将神经网络真实的输出与已知目标输出的距离作为损失函数，两者相差越小，则认为神经网络性能越好，若两者相差很大，则认为神经网络性能不好。

设目标输出为 $y^L$ ，神经网络的输出为 $a^L$ （两者都是n*1的列向量）
令e = $y^L - a^L$
cost function : J = ${1}\over{2}$ $\sum_{j=1}^{n_L} e_j^2$

cost function并不惟一
常用的还有：J = $1\over m$ $\sum_{i = 1}^{m}L(y^L, a^L)$ , 其中 $L(y^L, a^L) = a^Llogy^L + (1 - a^L)log(1 - y^L)$

一个网络有好的性能意味着找到了最适合的权值（ $w^1, w^2, ..., w^{L-1}$ ），使损失函数J最小。这个过程就是神经网络学习的过程，可使用最速梯度法查找。

最速梯度法（Steepest Gradient Method）

（血崩。。。写到一半手贱去点开了另一篇博客的编辑，之前写好的这部分没保存，打开新的这里就直接没了，好气啊）

在之前的作业中，连接权w是直接给定了的，实际应用中也有根据经验来确定w的情况，但更多的是让神经网络自己去学习，直到找到最优的w为止。

梯度下降法让这个查找的过程始终沿着导数 $∂ J \over ∂ w$ ，即w变化最快的方向去接近最优值。并设置学习率 $\alpha$ ，表示每一步变化的程度。

公式： $w_{ji}^l$ <—— $w_{ji}^l$ - $\alpha$ $∂ J \over ∂ w$

举个简单的例子：（需要注意的是，实际情况中J和w的关系往往更复杂）
这里写图片描述
（这个图超级丑的，不过一看就是原创啊！哈哈哈）

图中的C点是我们要找的最优解，使用最速梯度法，令 $\alpha$ = 1
当初始点在A点时， $∂ J \over ∂ w$ 为负（A点的斜率），负负得正，根据公式， $w_{ji}^l$ 应加上一个值，即A点往右边移动，直到到达C点或C点附近为止（与 $\alpha$ 的选取有关）。
当初始点在B点时， $∂ J \over ∂ w$ 为正（B点的斜率）， $w_{ji}^l$ 应减去一个值，即B点往左边移动，直到到达C点或C点附近为止。

Back Propagation 反向传播算法

前向计算
$z^{l+1} = w^la^l$ (1)
$a^{l + 1} = f(z^{l+1})$ (2)
具体说明请参照上一篇博客
计算cost
J = $1\over2$ $(a^L - y^L)^2$ (3)
由第1步一直算到神经网络的最后一层，直到获得 $a^L$ 。使用该公式来衡量神经网络的输出值与目标输出的差距，该公式不惟一。
反向计算
需计算 $∂ J \over ∂ w$ ，按公式： $w_{ji}^l$ <—— $w_{ji}^l$ - $\alpha$ $∂ J \over ∂ w$ 更新参数，以保证调节参数的过程始终沿着w变化最快的方向。
根据导数链式法则 (3)–>(2)–>(1)
有 $∂J\over∂w^L$ = $∂J\over∂a^L$ · $∂a^L\over∂z^L$ · $∂z^L\over∂w^L$ = ( $a^L - y^L$ ) · f`( $z^L$ ) · $a^L$ 。 (4)
细心的你可能发现了，上面的公式计算的是最后一层的 $∂J\over∂w^L$ ，接下来我们引入一个新的变量来帮助我们计算每一层的 $∂J\over∂w^l$ 。
设 $\delta^L$ = ( $a^L - y^L) · f`(z^L$ ) （即(4)式等号最右边但不包括 $a^L$ ）
$\delta^{l+1}$ 和 $\delta^l$ 具有如下关系:

综上所述
$∂J\over∂w^l$ = $\delta^{l+1}$ · $a^l$ 。 (5)

觉得上面的推导过程较复杂的可直接忽略，只需记住(1)(2)(3)(5)的公式，并可编程实现即可。

作业：下载地址

Instructions

Task 0: implement feedforward and backward computation

in fc.m, implement the forward computing (in either component or vector form), return both the activation and the net input
in bc.m, implement the backward computing (in either component or vector form)

Task 1: implement online BP algorithm

in bp_online.m:
1. calculate activations a1, a2, a3, and net input z2, z3
2. calculate cost function J
3. calculate sensitivity delta3, delta2
4. calculate gradient with respect to weights dw1, dw2
5. update weights w1, w2

Task 2: implement batch BP algorithm

in bp_batch.m:
1. calculate activations a1, a2, a3, and net input z2, z3
2. calculate cost function J
3. calculate sensitivity delta3, delta2
4. cumulate gradient with respect to weights dw1, dw2
5. update weights w1, w2

fc.m

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Course:  Understanding Deep Neural Networks
%
% Lab 3 - BP algorithms
%
% Task 0: implement feedforward and backward computation
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function [a_next, z_next] = fc(w, a)
    % define the activation function
    f = @(s) 1 ./ (1 + exp(-s));

    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    % Your code BELOW
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    % forward computing (in either component or vector form)
    a = [a; 1];
    z_next = w * a;
    a_next = f(z_next);

    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    % Your code ABOVE
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%
end

bc.m

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Course:  Understanding Deep Neural Networks
%
% Lab 3 - BP algorithms
%
% Task 0: implement feedforward and backward computation
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function delta = bc(w, z, delta_next)
    % define the activation function
    f = @(s) 1 ./ (1 + exp(-s));
    % define the derivative of activation function
    df = @(s) f(s) .* (1 - f(s));

    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    % Your code BELOW
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    % backward computing (in either component or vector form)
    delta = df(z) * (sum(w * delta_next));

    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    % Your code ABOVE
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%
end

bp_online.m

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Course:  Understanding Deep Neural Networks
%
% Lab 3 - BP algorithms
%
% Task 1: implement online BP algorithm
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% clear the workspace
clear

% define the activation function
f = @(s) 1 ./ (1 + exp(-s));
% define the derivative of activation function
df = @(s) f(s) .* (1 - f(s));

% prepare the training data set
data   = [1 0 0 1
          0 1 0 1]; % samples
labels = [1 1 0 0]; % labels
m = size(data, 2);

% choose parameters, initialize the weights
alpha = 0.15;
epochs = 50000;
w1 = randn(2,3);
w2 = randn(1,3);
J = zeros(1,epochs);

% loop until weights converge
for t = 1:epochs

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Your code BELOW
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% for each samples
    for i = 1:m

% forward calculation (invoke fc)
        a1 = data(:, i);
        [a2, z2] = fc(w1, a1);
        [a3, z3] = fc(w2, a2);

% calculate cost function
        J(t) = 0.5 * (a3 - labels(i)) * (a3 - labels(i));

% backwork calculation (invoke bc)
        delta3 = (a3 - labels(i)) * df(z3);
        delta2 = bc(w2, z2, delta3);

% calculate the gradients
        dw1 = delta2 * ([a1;1])';
        dw2 = delta3 * ([a2;1])';

% update weights
        w1 = w1 - alpha * dw1;
        w2 = w2 - alpha * dw2;

% end for each sample
    end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Your code ABOVE
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% end loop
    if mod(t,100) == 0
        fprintf('%i/%i epochs: J=%.4f\n', t, epochs, J(t));
    end
end

% display the result
for i = 1:4
    a1 = data(:,i);
    [a2, z2] = fc(w1, a1);
    [a3, z3] = fc(w2, a2);
    fprintf('Sample [%i %i] (%i) is classified as %i.\n', data(1,i), data(2,i), labels(i), a3>0.5);
end

bp_batch.m

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Course:  Understanding Deep Neural Networks
%
% Lab 3 - BP algorithms
%
% Task 2: implement batch BP algorithm
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% clear the workspace
clear

% define the activation function
f = @(s) 1 ./ (1 + exp(-s));
% define the derivative of activation function
df = @(s) f(s) .* (1 - f(s));

% prepare the training data set
data   = [1 0 0 1
          0 1 0 1]; % samples
labels = [1 1 0 0]; % labels
m = size(data, 2);

% choose parameters, initialize the weights
alpha = 0.15;
epochs = 50000;
w1 = randn(2,3);
w2 = randn(1,3);
J = zeros(1,epochs);

% loop until weights converge
for t = 1:epochs
    % reset the total gradients
    dw1 = 0;
    dw2 = 0;

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Your code BELOW
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% for all samples
    for i = 1:m

% forward calculation (invoke fc)
        a1 = data(:, i);
        [a2, z2] = fc(w1, a1);
        [a3, z3] = fc(w2, a2);

% calculate cost function
        J(t) = 0.25 * 0.5 * dot((a3 - labels(i)), (a3 - labels(i)));

% backwork calculation (invoke bc)
        delta3 = (a3 - labels(i)) * df(z3);
        delta2 = bc(w2, z2, delta3);

% cumulate the total gradients
        dw1 = dw1 + delta2 * ([a1;1])';
        dw2 = dw2 + delta3 * ([a2;1])';

% end for all samples
    end

% update weights
    w1 = w1 - alpha * dw1;
    w2 = w2 - alpha * dw2;

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Your code ABOVE
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% end loop
    if mod(t,100) == 0
        fprintf('%i/%i epochs: J=%.4f\n', t, epochs, J(t));
    end
end

% display the result
for i = 1:4
    a1 = data(:,i);
    [a2, z2] = fc(w1, [a1]);
    [a3, z3] = fc(w2, [a2]);
    fprintf('Sample [%i %i] (%i) is classified as %i.\n', data(1,i), data(2,i), labels(i), a3>0.5);
end