【吴恩达机器学习】第五周课程精简笔记——代价函数和反向传播

辰阳星宇

已于 2023-01-06 21:42:43 修改

阅读量883

点赞数

分类专栏：机器学习文章标签：概率论线性代数

于 2022-02-10 19:12:12 首次发布

本文链接：https://blog.csdn.net/qq_41094332/article/details/122855633

版权

机器学习专栏收录该内容

17 篇文章 0 订阅

订阅专栏

Cost Function and Backpropagation（代价函数和反向传播）

1. Cost Function

Let’s first define a few variables that we will need to use:

L = total number of layers in the network
s_l = number of units (not counting bias unit) in layer l
K = number of output units/classes
$h_{\Theta}(x)_k$ is being as a hypothesis that results in the k^th output.

Recall that the cost function for regularized logistic regression was:
$J(\theta) = - \frac{1}{m} \sum^m_{i=1} [y^{(i)}log(h_\theta(x^{(i)}))+(1-y^{(i)})log(1-h_\theta(x^{(i)}))] + \frac{\lambda}{2m} \sum^n_{j=1} \theta^2_j$
在这里插入图片描述

For neural networks, it is going to be slightly more complicated:
$J(\theta) = - \frac{1}{m} \sum^m_{i=1}\sum^K_{k=1} [y^{(i)}_klog(h_\theta(x^{(i)})_k)+(1-y^{(i)}_k)log(1-(h_\theta(x^{(i)}))_k)] + \frac{\lambda}{2m} \sum^{L-1}_{l=1}\sum^{sl}_{i=1}\sum^{sl+1}_{j=1} (\Theta^{(l)}_{j,i})^2$

We have added a few nested summations to account for our multiple output nodes. In the first part of the equation, before the square brackets, we have an additional nested summation that loops through the number of output nodes. K here means there are K units of output. That is the final layer of neural network.

In the regularization part, after the square brackets, we must account for multiple theta matrices. The number of columns in our current theta matrix is equal to the number of nodes in our current layer (including the bias unit). The number of rows in our current theta matrix is equal to the number of nodes in the next layer (excluding the bias unit). As before with logistic regression, we square every term.

Note:

the double sum simply adds up the logistic regression costs calculated for each cell in the output layer.
the triple sum simply adds up the squares of all the individual Θs in the entire network.
the i in the triple sum does not refer to training example i.

让我们首先定义一些我们需要用到的变量：

L = 网络中全部层数
s_L= 第L层的神经元个数（偏置单元不算在内）
输出单元/类别个数
$h_{\Theta}(x)_k$ 作为第k个输出的假设结果

回想一下正则化对数几率回归的代价函数：
$J(\theta) = - \frac{1}{m} \sum^m_{i=1} [y^{(i)}log(h_\theta(x^{(i)}))+(1-y^{(i)})log(1-h_\theta(x^{(i)}))] + \frac{\lambda}{2m} \sum^n_{j=1} \theta^2_j$
在这里插入图片描述

对于神经网络，它会稍微复杂一些:
$J(\theta) = - \frac{1}{m} \sum^m_{i=1}\sum^K_{k=1} [y^{(i)}_klog(h_\theta(x^{(i)})_k)+(1-y^{(i)}_k)log(1-(h_\theta(x^{(i)}))_k)] + \frac{\lambda}{2m} \sum^{L-1}_{l=1}\sum^{sl}_{i=1}\sum^{sl+1}_{j=1} (\Theta^{(l)}_{j,i})^2$
我们添加了一些嵌套的求和来说明我们的多个输出节点。在方程的第一部分，方括号之前，我们有一个额外的嵌套求和，它循环遍历输出节点的数量。这里的K意味着有K个输出单元，也就是神经网络的最后一层。

在正则化部分，方括号之后，我们必须考虑多个矩阵。当前矩阵中的列数等于当前层(包括偏置单元)中的节点数。当前矩阵的行数等于下一层的节点数(不包括偏置单元)。和以前使用逻辑回归时一样，我们将每一项平方。

注意：

双求和只是将输出层中每个单元计算出的对数几率回归的代价相加
三重求和只是将整个网络中各个 $\theta$ 取平方后相加。
三重求和中的i并不指训练样例i.

2. Backpropagation Algorithm

“Backpropagation” is neural-network terminology for minimizing our cost function, just like what we were doing with gradient descent in logistic and linear regression. Our goal is to compute:
$min_{\Theta} J(\Theta)$

That is, we want to minimize our cost function J using an optimal set of parameters in theta. In this section we’ll look at the equations we use to compute the partial derivative of J(Θ):
$\frac{\partial J(\Theta)}{\partial \Theta^{(l)}_{i,j}}$

To do so, we use the following algorithm:
在这里插入图片描述
Backpropagation Algorithm
Given training set $\{(x^{(1)},y^{(1)}) \cdots (x^{(m)},y^{(m)}) \}$

set $\Delta^{(l)}_{i,j}:=0$ for all (l,i,j), (hence you end up having a matrix full of zeros)

For training example t = 1 to m:

Set $a^{(1)}:=x^{(t)}$
Perform forward propagation to compute a^(l)for l = 2,3,…,L
Using $y^{(t)}$ , compute $\delta^{(L)}=a^{(L)}-y^{(t)}$
Where L is our total number of layers and $a^{(L)}$ is the vector of outputs of the activation units for the last layer. So our “error values” for the last layer are simply the differences of our actual results in the last layer and the correct outputs in y.To get the delta values of the layers before the last layer, we can use an equation that steps us back from right to left:
Compute $\delta^{(L-1)},\delta^{(L-2)},...,\delta^{2},$ using $\delta^{(l)}=((\Theta^{(l)})^T\delta^{(l+1)}).*a^{(l)}.*(1-a^{(l)})$
The delta values of layer l are calculated by multiplying the delta values in the next layer with the theta matrix of layer l. We then element-wise multiply that with a function called g’, or g-prime, which is the derivative of the activation function g evaluated with the input values given by $z^{(l)}$ .

The g-prime derivative terms can also be written out as:
$g'(z^{(l)}) = a^{(l)}.*(1-a^{(l)}) \\ (\frac{1}{1+e^{-x}})' = \frac{1}{1+e^{-x}}.*(1-\frac{1}{1+e^{-x}})$
$\Delta^{(l)}_{i,j} := \Delta^{(l)}_{i,j}+a^{(l)}_j\delta^{(l+1)}_i$ or with vectorization, $\Delta^{(l)} := \Delta^{(l)}+\delta^{(l+1)}(a^{(l)})^T$
Hence we update our new $\Delta$ matrix.
- $D^{(l)}_{i,j} := \frac{1}{m}(\Delta^{(l)}_{i,j}+\lambda\Theta^{(l)}_{i,j})$ , if j ≠ 0
- $D^{(l)}_{i,j} := \frac{1}{m}\Delta^{(l)}_{i,j}$ , if j ＝ 0
  The capital-delta matrix D is used as an “accumulator” to add up our values as we go along and eventually compute our partial derivative. Thus we get $\frac{\partial J(\Theta)}{\partial\Theta^{(l)}_{ij}}=D^{(l)}_{ij}$

“反向传播”是神经网络术语，用于最小化我们的代价函数，就像在对数几率回归和线性回归中做的梯度下降一样。我们的目标是计算：
$min_{\Theta} J(\Theta)$

也就是说，我们想要使用一组 $\Theta$ 的最优参数集来最小化我们的代价函数J。在这一部分，我们将会看到我们使用计算J(Θ) 的偏导数的方程：
$\frac{\partial J(\Theta)}{\partial \Theta^{(l)}_{i,j}}$

为此，我们使用一下算法：
在这里插入图片描述
反向传播算法
对于给定的训练集 $\{(x^{(1)},y^{(1)}) \cdots (x^{(m)},y^{(m)}) \}$

对所有的 (l,i,j)，设置 $\Delta^{(l)}_{i,j}:=0$ , （因此你最后得到一个全部为0的矩阵），其中 $\Delta$ 是大写的 $\delta$ 。

对于训练集 t = 1 to m:

Set $a^{(1)}:=x^{(t)}$
执行前向传播计算 a^(l) for l = 2,3,…,L
使用 $y^{(t)}$ , 计算 $\delta^{(L)}=a^{(L)}-y^{(t)}$
这里的L为总层数， $a^{L}$ 是最后一层激活单元的输出向量。因此最后一层的实际结果与y中正确输出（结果）之差是我们最后一层的“ 误差值 $\delta$ ”。 $\delta^{(L)}$ 表示第L层的误差。

为了得到最后一层之前各层里的 $\delta$ 值，我们可以使用一个从右至左的方程：
通过使用 $\delta^{(l)}=((\Theta^{(l)})^T\delta^{(l+1)}).*a^{(l)}.*(1-a^{(l)})$ ，计算 $\delta^{(L-1)},\delta^{(L-2)},...,\delta^{2}$
通过将第 l+1层的 $\delta$ 值与第 l 层的 $\Theta$ 矩阵相乘，得到第 l 层的 $\delta$ 值。然后，用一个叫做 g’ 的函数乘以它，其中g’是激活函数g的导数，由 $z^{(l)}$ 给出的输入值求解。

g '导数项也可以写成:
$g'(z^{(l)}) = a^{(l)}.*(1-a^{(l)}) \\ (\frac{1}{1+e^{-x}})' = \frac{1}{1+e^{-x}}.*(1-\frac{1}{1+e^{-x}})$
$\Delta^{(l)}_{i,j} := \Delta^{(l)}_{i,j}+a^{(l)}_j\delta^{(l+1)}_i$ 或者向量化后的 $\Delta^{(l)} := \Delta^{(l)}+\delta^{(l+1)}(a^{(l)})^T$
因此我们更新我们的 $\Delta$ 矩阵。
- $D^{(l)}_{i,j} := \frac{1}{m}(\Delta^{(l)}_{i,j}+\lambda\Theta^{(l)}_{i,j})$ , if j ≠ 0
- $D^{(l)}_{i,j} := \frac{1}{m}\Delta^{(l)}_{i,j}$ , if j ＝ 0
  大写的矩阵D在计算过程中被作为“累加器”使用，累加我们的值，并最终计算出我们的偏导数。因此，我们可以得到 $\frac{\partial J(\Theta)}{\partial\Theta^{(l)}_{ij}}=D^{(l)}_{ij}$

3. Backpropagation Intution

The following figure shows the process diagram of forward propagation algorithm.
在这里插入图片描述
Recall that the cost function for a neural network is:
$J(\theta) = - \frac{1}{m} \sum^m_{i=1}\sum^K_{k=1} [y^{(i)}_klog(h_\theta(x^{(i)})_k)+(1-y^{(i)}_k)log(1-(h_\theta(x^{(i)}))_k)] + \frac{\lambda}{2m} \sum^{L-1}_{l=1}\sum^{sl}_{i=1}\sum^{sl+1}_{j=1} (\Theta^{(l)}_{j,i})^2$
If we consider simple non-multiclass classification(k = 1) and disregard regularization, the cost is computed with:
$y^{(t)}log(h_\Theta(x^{(t)}))+(1-y^{(t)})log(1-h_\Theta(x^{(t)}))$

Intuitively, $\delta^{(l)}_j$ is the “error” for $a^{(l)}_j$ (unit j in layer l). More formally, the delta values are actually the derivate of the cost funcion:
$\delta^{(l)}_j = \frac{\partial cost(t)}{\partial z^{(l)}_j}$
Recall that our derivative is the slope of a line tangent to the cost funcion, so the steeper the slope the more incorrect we are. Let us consider the following neural network below and see how we could calculate some $\delta^{(l)}_j$ :
在这里插入图片描述

In the image above, to calculate $\delta^{(2)}_2$ , we multiply the weights $\Theta^{(2)}_{12}$ and $\Theta^{(2)}_{22}$ by their respective $\delta$ values found to the right of each edge. So we get $\delta_2^{(2)} = \Theta^{(2)}_{12} * \delta^{(3)}_1 + \Theta^{(2)}_{22} * \delta^{(3)}_2$ . To calculate every single possible $\delta^{(l)}_j$ , we could start from the right of our diagram. We can think of our edges as our $\Theta_{ij}$ . Going from right to left, to calculate the values of $\delta^{(l)}_j$ , you can just take the over all sum of each weight times the $\delta$ it is coming from. Hence, another example would be $\delta^{(3)}_2 = \Theta^{(2)}_{12}*\delta^{(4)}_1$

下图为前向传播算法过程图
在这里插入图片描述

回想一下神经网络的代价函数：
$J(\theta) = - \frac{1}{m} \sum^m_{i=1}\sum^K_{k=1} [y^{(i)}_klog(h_\theta(x^{(i)})_k)+(1-y^{(i)}_k)log(1-(h_\theta(x^{(i)}))_k)] + \frac{\lambda}{2m} \sum^{L-1}_{l=1}\sum^{sl}_{i=1}\sum^{sl+1}_{j=1} (\Theta^{(l)}_{j,i})^2$
如果我们只考虑非多类别分类（k=1）并且忽视正则化，代价函数便为：
$y^{(t)}log(h_\Theta(x^{(t)}))+(1-y^{(t)})log(1-h_\Theta(x^{(t)}))$

凭直觉可发现, $\delta^{(l)}_j$ 是 $a^{(l)}_j$ (l层的单元j)的误差。更正式的说法， $\delta$ 实际上是代价函数的导数：
$\delta^{(l)}_j = \frac{\partial cost(t)}{\partial z^{(l)}_j}$
回想一下，我们的导数是代价函数切线的斜率，所以斜率越陡，我们的错误就越大。让我们考虑下面的神经网络并看一看如何计算一些 $\delta^{(l)}_j$ :
在这里插入图片描述

在上图中, 为了计算 $\delta^{(2)}_2$ , 我们将 $\Theta^{(2)}_{12}$ 和 $\Theta^{(2)}_{22}$ 乘上它们在每条边各自右边的 $\delta$ 值。因此，我们可以得到 $\delta_2^{(2)} = \Theta^{(2)}_{12} * \delta^{(3)}_1 + \Theta^{(2)}_{22} * \delta^{(3)}_2$ . 为了计算每一个可能的 $\delta^{(l)}_j$ , 我们可以从图的右边开始。我们可以把边看作 $\Theta_{ij}$ 。从右到左，为了计算 $\delta^{(l)}_j$ 值, 你可以使用每个权值乘以来自它的 $\delta$ ，再将其求和。因此，另一个例子是 $\delta^{(3)}_2 = \Theta^{(2)}_{12}*\delta^{(4)}_1$ 。

Backpropagation in Practice

1. Implementation Note: Unrolling Parameters

With neural networks, we are working with sets of matrices:
在这里插入图片描述
In order to use optimizaing functions such as “fminuc()”, we will want to “unroll” all the elements and put them into one long vector:

thetaVector = [ Theta1(:); Theta2(:); Theta3(:); ]
deltaVector = [ D1(:); D2(:); D3(:) ]

If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11, then we can get back our original matrices from “unrolled” versions as follows:

Theta1 = reshape(thetaVector(1:110),10,11)
Theta2 = reshape(thetaVector(111:220),10,11)
Theta3 = reshape(thetaVector(221:231),1,11)

在这里插入图片描述
To summarization：

在神经网络中，我们使用的是一系列的矩阵:
在这里插入图片描述
为了使用像"fminuc()"这样的优化函数，我们需要"展开"所有元素，并将它们放入一个长向量中:

thetaVector = [ Theta1(:); Theta2(:); Theta3(:); ]
deltaVector = [ D1(:); D2(:); D3(:) ]

如果Theta1的维数是10x11, Theta2的维数是10x11, Theta3的维数是1x11，那么我们可以从“展开”的版本中得到我们的原始矩阵，如下所示:

Theta1 = reshape(thetaVector(1:110),10,11)
Theta2 = reshape(thetaVector(111:220),10,11)
Theta3 = reshape(thetaVector(221:231),1,11)

在这里插入图片描述
总结如下：

2. Gradient Checking

在这里插入图片描述

Gradient checking will assure that our backpropagation works as intended. We can approximate the derivative of our cost function with:
$\frac{\partial J(\Theta)}{\partial\Theta} ≈ \frac{J(\Theta + \epsilon)-J(\Theta - \epsilon)}{2\epsilon}$
With multiple theta matrices, we can approximate the derivative with respect to $Θ_j$ as follows:
$\frac{\partial J(\Theta)}{\partial\Theta_j} ≈ \frac{J(\Theta_1,...,\Theta_j + \epsilon,... ,\Theta_n)-J(\Theta_1, ...,\Theta_j - \epsilon,...,\Theta_n)}{2\epsilon}$
A small value for ${\epsilon}$ such as $\epsilon = 10^{-4}$ , guarantees that the math works out properly. If the value for $\epsilon$ is too small, we can end up with numerical problems.

Hence, we are only adding or subtracting epsilon to the $\Theta_j$ matrix. In octave we can do it as follows:

epsilon = 1e-4;
for i = 1:n,
  thetaPlus = theta;
  thetaPlus(i) += epsilon;
  thetaMinus = theta;
  thetaMinus(i) -= epsilon;
  gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)
end;

We previously saw how to calculate the deltaVector. So once we compute our gradApprox vector, we can check that gradApprox ≈ deltaVector.

Once you have verified once that your backpropagation algorithm is correct, you don’t need to compute gradApprox again. The code to compute gradApprox can be very slow.
在这里插入图片描述

在这里插入图片描述

梯度检查将确保我们的反向传播工作如预期。我们可以用以下方法近似代价函数的导数:
$\frac{\partial J(\Theta)}{\partial\Theta} ≈ \frac{J(\Theta + \epsilon)-J(\Theta - \epsilon)}{2\epsilon}$
${\epsilon}$ 的一个小值，例如 $\epsilon = 10^{-4}$ ，可以保证数学计算正确。如果 $\epsilon$ 的值太小，我们就会遇到数值问题。

因此，我们只是对 $\Theta_j$ 矩阵加或减。在Octave我们可以这样做：

epsilon = 1e-4;
for i = 1:n,
  thetaPlus = theta;
  thetaPlus(i) += epsilon;
  thetaMinus = theta;
  thetaMinus(i) -= epsilon;
  gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)
end;

我们之前看到了如何计算deltaVector。一旦我们计算了我们的gradApprox向量，我们可以检查gradApprox≈deltaVector。

一旦你验证了你的反向传播算法是正确的，你就不需要再计算gradApprox了。计算gradApprox的代码可能非常慢。
在这里插入图片描述

3. Random Initialization

在这里插入图片描述

Initializing all theta weights to zero does not work with neural networks. When we backpropagate, all nodes will update to the same value repeatedly. Instead we can randomly initialize our weights for our $\Theta$ matrices using the following method:
在这里插入图片描述
Hence, we initialize each $\Theta^{(l)}_{ij}$ to a random value between $[-\epsilon,\epsilon]$ . Using the above formula guarantees that we get the desired bound. The same procedure applies to all the $\Theta$ ’s. Below is some working code you could use to experiment.

If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.

Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;

rand(x,y) is just a function in octave that will initialize a matrix of random real numbers between 0 and 1.

(Note: the epsilon used above is unrelated to the epsilon from Gradient Checking)

在这里插入图片描述

将所有的权值初始化为零并不适用于神经网络。当我们反向传播时，所有节点将重复更新为相同的值。相反，我们可以使用以下方法随机初始化 $\Theta$ 矩阵的权值:
在这里插入图片描述
因此，我们可以将每一个 $\Theta^{(l)}_{ij}$ 初始化为 $[-\epsilon,\epsilon]$ 之间的一个随机值。使用上面的公式，可以保证我们得到想要的边界。同样的过程适用于所有的 $\Theta$ 。下面是一些可以用于实验的工作代码。

If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.

Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;

rand(x, y)仅是一个在Octave中的函数，它将初始化一个由0到1之间的随机实数组成的矩阵。

(注:上面使用的与梯度检查中的无关)

4. Putting it Together

First, pick a network architecture; choose the layout of your neural network, including how many hidden units in each layer and how many layers in total you want to have.

Number of input units = dimension of features $x^{(i)}$
Number of output units = number of classes
Number of hidden units per layer = usually more the better (must balance with cost of computation as it increases with more hidden units)
Defaults: 1 hidden layer. If you have more than 1 hidden layer, then it is recommended that you have the same number of units in every hidden layer.

Training a Neural Network

Randomly initialize the weights
Implement forward propagation to get $h_\Theta(x^{(i)})$ for any $x^{(i)}$
Implement the cost function
Implement backpropagation to compute partial derivatives
Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
Use gradient descent or a built-in optmization function to minimize the cost function with the weghts in delta.

When we perform forward and back propagation, we loop on every training example:

for i = 1:m,
   Perform forward propagation and backpropagation using example (x(i),y(i))
   (Get activations a(l) and delta terms d(l) for l = 2,...,L

The following image gives us an intuition of what is happening as we are implementing our neural network:
在这里插入图片描述
Ideally, you want $h_\Theta(x^{(i)})≈y^{(i)}$ . This will minimize our cost function. However, keep in mind that $J(\theta)$ is not convex and thus we can end up in a local minimum instead.

首先，选择一个网络架构; 选择你的神经网络的布局，包括每一层有多少隐藏单元，以及你想要总共有多少层。

输入单元个数 = 特征 $x^{(i)}$ 的维数
输出单元个数 = 类别个数
每个隐藏层单元的数量 = 通常越多越好（因为计算成本会随着隐藏单元的个数而增加，因此必须与计算成本相平衡）
默认值： 1个隐藏层。如果你设置的隐藏层的个数超过一，建议把各个隐藏层中单元个数都设置为相同个数。

训练神经网络

随机地初始化权重
实现前向传播，对于任意 $x^{(i)}$ 得到 $h_\ (x^{(i)})$
实现代价函数
实现反向传播计算偏导数
使用梯度检查来确认反向传播是有否效的。然后禁用梯度检查。
使用梯度下降或内置的优化函数来最小化权值为 $\delta$ 的代价函数。

当我们执行前向和后向传播时，我们对每个训练示例进行循环:

for i = 1:m,
   Perform forward propagation and backpropagation using example (x(i),y(i))
   (Get activations a(l) and delta terms d(l) for l = 2,...,L

下图让我们直观地了解了在我们实现神经网络的过程中发生了什么：
在这里插入图片描述
理想情况下，你需要让 $h_\ (x^{(i)})≈y^{(i)}$ 。这样做会使代价函数最小化。然而，请记住 $J (θ)$ 不是凸（函数）的，因此最终我们得到了一个局部最小值。

Exercise 4：反向传播算法

【吴恩达机器学习】Week5 编程作业ex4——神经网络学习

拓展资料

一文搞懂反向传播算法

反向传播算法（过程及公式推导）

深度学习——以图读懂反向传播

深度学习笔记三：反向传播（backpropagation）算法

辰阳星宇

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
【吴恩达机器学习】第五周课程精简笔记——代价函数和反向传播

Cost Function and Backpropagation（代价函数和反向传播）1. Cost FunctionLet’s first define a few variables that we will need to use:L = total number of layers in the networksl = number of units (not counting bias unit) in layer lK = number of output units/classe
复制链接

扫一扫