吴恩达机器学习（十）—— ex4：Neural Networks Learning（MATLAB+Python）

最新推荐文章于 2024-06-24 23:43:50 发布

大彤小忆

最新推荐文章于 2024-06-24 23:43:50 发布

阅读量2.8k

点赞数 9

分类专栏：机器学习文章标签：机器学习神经网络

本文链接：https://blog.csdn.net/HUAI_BI_TONG/article/details/108869628

版权

机器学习专栏收录该内容

26 篇文章

订阅专栏

本文详细介绍了吴恩达的神经网络和深度学习课程中的核心概念，包括前馈传播、反向传播、代价函数计算以及正则化。通过MATLAB和Python实现，演示了如何使用fmincg优化器训练神经网络，展示了训练数据的可视化和隐藏层特征。最后，讨论了如何通过调整正则化参数和迭代次数影响模型性能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

吴恩达机器学习系列内容的学习目录 $\rightarrow$ 吴恩达机器学习系列内容汇总。

一、神经网络
二、反向传播
三、可视化隐藏层
- 3.1 可选的练习
四、MATLAB实现
- 4.1 ex4.m
五、Python实现
- 5.1 ex4.py

本次练习对应的基础知识总结 $\rightarrow$ 神经网络：Learning。

本次练习对应的文档说明和提供的MATLAB代码 $\rightarrow$ 提取码：12eo 。

本次练习对应的完整代码实现(MATLAB + Python版本) $\rightarrow$ Github链接。

一、神经网络

在练习ex3中，我们实现了神经网络的前馈传播，并使用它来预测提供了权重的手写数字。在本练习中，我们将实现反向传播算法，以学习神经网络的参数。
练习提供的脚本ex4.m将帮助我们逐步完成此练习。

1.1 可视化数据

在ex4.m的第一部分中，代码将加载数据并通过调用函数displayData将其显示在二维图上，如图1所示。
在这里插入图片描述

图1 数据集中的样本

这与我们在练习ex3中使用的数据集相同。ex3data1.mat中有5000个训练样本，其中每个训练样本都是像素 $20\times20$ 的数字灰度图像。每个像素由一个浮点数表示，该数字表示该位置的灰度强度。将 $20\times20$ 像素网格“展开”为400维向量。这些训练样本中的每一个都成为数据矩阵X中的一行。这为我们提供了一个 $5000\times400$ 的矩阵 $X$ ，其中每一行都是一个手写数字图像的训练样本。
$X=\begin{bmatrix} -\left ( x^{\left ( 1 \right )} \right )^{T}-\\ -\left ( x^{\left ( 2 \right )} \right )^{T}-\\ \vdots \\ -\left ( x^{\left ( m \right )} \right )^{T}- \end{bmatrix}$

训练集的第二部分是一个5000维向量 $y$ ，其中包含训练集的标签。因为 MATLAB中没有零索引，所以我们将数字零映射到值10。因此，将“ 0”数字标记为“ 10”，而将数字“ 1”至“ 9”按照其自然顺序标记为“ 1”至“ 9”。

1.2 模型表示

我们的神经网络如图2所示，它有3层——一个输入层，一个隐藏层和一个输出层，输入是数字图像的像素值。由于图像的尺寸为 $20\times20$ ，因此给了我们400个输入层单元（不算始终输出+1的额外偏置单位）。训练数据将通过脚本ex4.m加载到变量 $X$ 和 $y$ 中。

在这里插入图片描述

图2 神经网络模型

本练习的资料中已经为我们提供了一组网络参数 $(\Theta^{(1)},\Theta^{(2)})$ 。它们存储在ex4weights.mat中，并由ex4.m加载到Theta1和Theta2中，实现代码如下所示。参数的大小适用于神经网络，第二层有25个单元，输出层有10个单元（对应于10位数字类别）。

% Load saved matrices from file
load('ex4weights.mat');
% The matrices Theta1 and Theta2 will now be in your workspace
% Theta1 has size 25 x 401
% Theta2 has size 10 x 26

1.3 前馈和代价函数

现在，我们将实现神经网络的代价函数和梯度。首先，完成nnCostFunction.m中的代码以返回代价。回想一下，神经网络的代价函数（无正则化）为
$J(\theta )=\frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^{K}\left [-y_{k}^{(i)}\times log(h_{\theta}(x^{(i)}))_{k}-(1-y_{k}^{(i)})\times log(1-h_{\theta}(x^{(i)}))_{k}\right ]$

其中 $h_{\theta}(x^{(i)})$ 的计算如图2所示，K = 10是可能的标签的总数。注意， $h_{\theta}(x^{(i)})_{k}=a_{k}^{(3)}$ 是第k个输出单元的激活（输出值）。另外，回想一下，虽然原始标签（在变量y中）是1，2，…，10，但是为了训练神经网络，我们需要将标签重新编码为只包含值0或1的向量
$y=\begin{bmatrix}1\\ 0\\ 0\\ \vdots \\ 0\end{bmatrix}， \begin{bmatrix}0\\ 1\\ 0\\ \vdots \\ 0\end{bmatrix}，\cdots ， \begin{bmatrix}0\\ 0\\ 0\\ \vdots \\ 1\end{bmatrix}。$

例如，如果 $x^{(i)}$ 是数字5的图像，则对应的 $y^{(i)}$ 应该是 $y_{5} = 1$ 的10维向量，其他元素均为0。
我们应该实现前馈计算，该计算为每个样本 $i$ 计算 $h_{\theta}(x^{(i)})$ 并对所有样本的代价进行求和。我们的代码也应适用于具有任意数量标签和任何大小的数据集（我们可以假定始终至少有K≥3个标签）。
完成本部分对应的nnCostFunction.m时需要填写以下代码：

%正向传播计算激活项
X = [ones(m, 1) X];%X变为5000x401
temp = sigmoid(X * Theta1');%temp为5000x25
temp = [ones(m, 1) temp];%temp变为5000x26
h = sigmoid(temp * Theta2');%h为5000x10
temp_y = zeros(size(h));%temp_y为5000x10的零矩阵
for i = 1:m
    temp_y(i, y(i)) = 1; %完成label的one-hot编码
end
J = (1 / m) * sum(sum(-temp_y .* log(h) - (1 - temp_y) .* log(1 - h))); %计算非正则化的部分

完成后，ex4.m将使用已加载的Theta1和Theta2参数集调用nnCostFunction，我们应该看到代价约为0.287629。

Feedforward Using Neural Network ...
Cost at parameters (loaded from ex4weights): 0.287629 
(this value should be about 0.287629)

1.4 正则化代价函数

神经网络的正则化代价函数为
$\begin{matrix} J(\theta )=\frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^{K}\left [-y_{k}^{(i)}\times log(h_{\theta}(x^{(i)}))_{k}-(1-y_{k}^{(i)})\times log(1-h_{\theta}(x^{(i)}))_{k}\right ]\\ +\frac{\lambda }{2m}\left [ \sum_{j=1}^{25}\sum_{k=1}^{400} \left (\Theta _{j,k}^{(1)} \right )^{2}+ \sum_{j=1}^{10}\sum_{k=1}^{25} \left (\Theta _{j,k}^{(2)} \right )^{2}\right ] \ _{}\ _{}\ \ _{}\ _{}\ _{}\ _{}\ \ _{}\ _{}\ _{}\ _{}\ \ _{}\ _{}\ _{}\ _{}\ _{}\ \end{matrix}$

我们可以假设神经网络只有3层——一个输入层，一个隐藏层和一个输出层，但是我们编写的代码应适用于任意数量的输入单元，隐藏单元和输出单元。为了清晰起见，上面的公式已经明确列出了 $\Theta ^{(1)}$ 和 $\Theta ^{(2)}$ 的索引，但是要注意，我们的代码通常应该适用任何大小的 $\Theta ^{(1)}$ 和 $\Theta ^{(2)}$ 。
还应注意的是，我们不应正则化偏差项。在矩阵Theta1和Theta2中，偏差项对应于每个矩阵的第一列。现在，我们应该将正则化添加到代价函数中。我们可以先使用现有的nnCostFunction.m计算未正则化的成本函数J，然后再为正则化项加入代价。
完成本部分对应的nnCostFunction.m时，我们需要在上部分的基础上填写以下代码：

%计算正则化的部分（不正则化偏置对应的那一项，即矩阵的第一列）
Theta1_temp = Theta1(:, 2:end);%Theta1_temp为25x400
Theta2_temp = Theta2(:, 2:end);%Theta2_temp为10x25
J = J + (lambda / 2 / m) * (sum(sum(Theta1_temp .* Theta1_temp)) + sum(sum(Theta2_temp .* Theta2_temp)));

完成后，ex4.m将使用已加载的Theta1和Theta2参数和 $\lambda = 1$ 来调用nnCostFunction，我们应该看到代价约为0.383770。

Checking Cost Function (w/ Regularization) ... 
Cost at parameters (loaded from ex4weights): 0.383770 
(this value should be about 0.383770)

二、反向传播

在本部分练习中，我们将实现反向传播算法以计算神经网络代价函数的梯度。我们将需要完成nnCostFunction.m，以便它为grad返回合适的值。计算完梯度后，我们便可以使用高级优化器（例如fmincg）通过最小化代价函数 $J(\Theta)$ 来训练神经网络。
我们将首先实现反向传播算法，以计算（非正则化的）神经网络参数的梯度。我们验证了在非正则化情况下计算的梯度是正确的之后，我们将实现正则化神经网络的梯度。

2.1 Sigmoid梯度

为了帮助我们开始本部分练习，我们将首先实现Sigmoid梯度函数。Sigmoid函数的梯度计算公式为
$g'(z)=\frac{d}{dz}g(z)=g(z)\left (1-g(z)\right )$

其中， $sigmoid(z)=g(z)=\frac{1}{1+e^{-z}}$ 。
完成后，尝试通过在MATLAB命令行中调用sigmoidGradient(z)来测试一些值。对于较大的z值（正值和负值），梯度应接近0；当z = 0时，梯度应恰好为0.25。
完成sigmoidGradient.m时，我们需要填写以下代码：

g = sigmoid(z) .* (1 - sigmoid(z));

完成后，ex4.m通过调用sigmoidGradient函数可以分别得到-1、-0.5 、0、 0.5、 1对应的Sigmoid梯度测试结果。

Evaluating sigmoid gradient...
Sigmoid gradient evaluated at [-1 -0.5 0 0.5 1]:
  0.196612 0.235004 0.250000 0.235004 0.196612

2.2 随机初始化

在训练神经网络时，随机初始化参数可以使对称失效。一种有效的随机初始化策略是在 $[-\epsilon _{init},\epsilon _{init}]$ 范围内均匀地随机选择 $\Theta^{(l)}$ 的值。本次我们应该使用 $\epsilon _{init}=0.12$ 。此值范围可确保参数保持较小并提高学习效率。
我们的工作是完成randInitializeWeights.m来初始化 $\Theta$ 的权重。修改文件并填写以下代码：

epsilon_init = 0.12;
W = rand(L_out, 1 + L_in) * (2 * epsilon_init) - epsilon_init;%W里元素的范围为（-epsilon_init,epsilon_init)

2.3 反向传播

在这里插入图片描述

图3 反向传播更新

现在，我们将实现反向传播算法。反向传播算法的思想：给定一个训练样本 $x^{(t)},y^{(t)})$ ，我们将首先运行“前向传播”以计算整个网络的所有激活，包括假设 $h_{\Theta}(x)$ 的输出值。然后，对于第 $l$ 层中的每个节点 $j$ ，我们想计算一个“误差项” $δ^{(l)}_{j}$ 用于测量该节点对输出中的误差影响多少。
对于输出节点，我们可以直接测量网络激活和真实目标值之间的差异，并使用它来定义 $δ^{(3)}_{j}$ （因为第3层是输出层）。对于隐藏单元，我们将基于 $(l + 1)$ 层中节点的误差项的加权平均值来计算 $δ^{(l)}_{j}$ 。
我们应该在一个循环中实现步骤1至4，该循环一次处理一个样本。具体来说，我们应该为 $t = 1 ： m$ 实现一个for循环，并在for循环内放置下面的1-4步，然后使用第t次的迭代对第t个样本 $x^{(t)},y^{(t)})$ 进行计算。第5步将累积的梯度除以m，以获得神经网络代价函数的梯度。

Step1：将输入层的值 $a^{(1)})$ 设置为第t个训练样本 $x^{(t)}$ 。执行前馈传播（如图2所示），计算第2层和第3层的激活值 $z^{(2)},a^{(2)},z^{(3)},a^{(3)})$ 。此外，我们还需要添加+1项确保 $a^{(1)}$ 和 $a^{(2)}$ 的激活向量包括偏置单元。在MATLAB中，如果a_1是列向量，则加1对应于a_1 = [1; a_1]。
Step2：对于第3层（输出层）中的每个输出单元 $k$ ，设置 $\delta_{k}^{(3)}=(a_{k}^{(3)}-y_{k})$ 其中 $y_{k}\in\left \{ 0,1 \right \}$ 表示当前的训练样本是属于类别 $k(y_k = 1)$ ，还是属于另一个类别 $y_k = 0)$ 。
Step3：对于隐藏层 $l = 2$ ，设置 $\delta^{(2)}=(\Theta^{(2)})^{T}\delta^{(3)}.*g'(z^{(2)})$
Step4：使用以下公式从样本中累加梯度。要注意，我们应该跳过或删除 $\delta_{0}^{(2)}$ 。在 MATLAB中，删除 $\delta_{0}^{(2)}$ 对应于delta_2 = delta_2(2：end)。 $\Delta ^{(l)}=\Delta ^{(l)}+\delta ^{(l+1)}\left ( a^{(l)} \right )^{T}$
Step5：通过将累加的梯度除以m来得到神经网络代价函数的（未正则化的）梯度： $\frac{\partial }{\partial \Theta _{ij}^{(l)}}J(\Theta )=D_{ij}^{(l)}=\frac{1}{m}\Delta _{ij}^{(l)}$

实施反向传播算法后，脚本ex4.m将继续在我们的实现中运行梯度检验，梯度检验将使得我们更加确信自己的代码正在正确计算梯度。

2.4 梯度检验

在我们的神经网络中，我们正在最小化代价函数 $J(\Theta)$ 。要对参数执行梯度检验，我们可以想象将参数 $\Theta ^{(1)}$ 和 $\Theta ^{(2)}$ “展开”到一个长向量 $θ$ 中。这样，我们可以将代价函数认为是 $J(\theta)$ ，并使用以下梯度检验过程。
假设我们有一个函数 $f_{i}(\theta)$ ，据称该函数可计算 $\frac{\partial }{\partial \theta _{i}}J(\theta )$ ，我们想检查 $f_{i}$ 是否在输出正确的导数值。
$\theta ^{(i+)}=\theta +\begin{bmatrix}0\\ 0\\ \vdots \\ \epsilon \\ \vdots \\ 0\end{bmatrix}，\theta ^{(i-)}=\theta -\begin{bmatrix}0\\ 0\\ \vdots \\ \epsilon \\ \vdots \\ 0\end{bmatrix}$

因此， $\theta ^{(i+)}$ 与 $θ$ 相同，不同之处在于第 $i$ 个元素的增量为 $\epsilon$ 。类似地， $\theta ^{(i-)}$ 是第 $i$ 个元素对应的向量减少了 $\epsilon$ 。现在，我们可以通过检查每个 $i$ 来以数值的方式验证 $f_i(θ)$ 的正确性：
$f_i(\theta)\approx \frac{J(\theta^{(i+)})-J(\theta^{(i-)})}{2\epsilon }$

这两个值彼此近似的程度取决于 $J$ 的细节。假设 $\epsilon=10^{-4}$ ，通常会发现上式的左边和右边至少要有4位有效数字（并且通常还有更多有效数字）。
本练习的资料已经在computeNumericalGradient.m中为我们实现了计算数值梯度的函数。在ex4.m的下一步中，将运行提供的函数checkNNGradients.m，创建一个小型神经网络和数据集，用于检验我们的梯度。如果反向传播实现正确，则相对误差应小于 $10^{-9}$ 。

完成本部分对应的nnCostFunction.m时，我们需要在前向传播部分的基础上填写以下代码：

%反向传播计算误差项
delta_3 = h - temp_y; 
delta_2 = delta_3 * Theta2 .* sigmoidGradient([ones(m, 1) X * Theta1']);
delta_2 = delta_2(:, 2:end);
D2 = delta_3' * temp;
D1 = delta_2' * X;
Theta2_grad = 1/m * D2;
Theta1_grad = 1/m * D1;

完成后，运行ex4.m可以得到如下结果：

Checking Backpropagation... 
   -0.0093   -0.0093
    0.0089    0.0089
   -0.0084   -0.0084
    0.0076    0.0076
   -0.0067   -0.0067
   -0.0000   -0.0000
    0.0000    0.0000
   -0.0000   -0.0000
    0.0000    0.0000
   -0.0000   -0.0000
   -0.0002   -0.0002
    0.0002    0.0002
   -0.0003   -0.0003
    0.0003    0.0003
   -0.0004   -0.0004
   -0.0001   -0.0001
    0.0001    0.0001
   -0.0001   -0.0001
    0.0002    0.0002
   -0.0002   -0.0002
    0.3145    0.3145
    0.1111    0.1111
    0.0974    0.0974
    0.1641    0.1641
    0.0576    0.0576
    0.0505    0.0505
    0.1646    0.1646
    0.0578    0.0578
    0.0508    0.0508
    0.1583    0.1583
    0.0559    0.0559
    0.0492    0.0492
    0.1511    0.1511
    0.0537    0.0537
    0.0471    0.0471
    0.1496    0.1496
    0.0532    0.0532
    0.0466    0.0466
    
The above two columns you get should be very similar.
(Left-Your Numerical Gradient, Right-Analytical Gradient)

If your backpropagation implementation is correct, then 
the relative difference will be small (less than 1e-9). 

Relative Difference: 2.33553e-11

2.5 正则化神经网络

成功实现反向传播算法后，我们将对梯度添加正则化。为了考虑正则化，事实证明我们可以在使用反向传播计算梯度之后将其作为附加项添加。具体来说，在使用反向传播计算 $\Delta_{ij}^{(l)}$ 后，应使用下式添加正则化：

$\begin{matrix} \frac{\partial }{\partial \Theta _{ij}^{(l)}}J(\Theta )=D_{ij}^{(l)}=\frac{1}{m}\Delta_{ij}^{(l)}\ _{} \ _{}\ _{} \ _{}\ _{} \ _{}\ _{} \ _{}\ _{} \ _{}\ _{} \ _{}\ _{} \ _{}\ _{} \ _{}for\ _{} \ _{}j=0\\ \frac{\partial }{\partial \Theta _{ij}^{(l)}}J(\Theta )=D_{ij}^{(l)}=\frac{1}{m}\Delta_{ij}^{(l)}+\frac{\lambda }{m}\Theta _{ij}^{(l)}\ _{} \ _{}\ _{} \ _{}for\ _{} \ _{}j\geqslant 0 \end{matrix}$

注意，我们不应将正则化用于 $\Theta ^{(l)}$ 的第一列(偏差项)。此外，在参数 $\Theta_{ij} ^{(l)}$ 中， $i$ 从1开始索引为， $j$ 从0开始索引。因此有 $\Theta ^{(l)}=\begin{bmatrix} \Theta _{1,0}^{(l)} &\Theta _{1,1}^{(l)} &\cdots \\ \Theta _{2,0}^{(l)} & \Theta _{2,1}^{(l)} & \\ \vdots & & \ddots \end{bmatrix}$

MATLAB中的索引从1（对于 $i$ 和 $j$ ）开始，因此Theta1(2，1)实际上对应于 $\Theta _{2,0}^{(l)}$ 。现在修改我们在nnCostFunction中计算grad的代码以考虑正则化。
完成本部分对应的nnCostFunction.m时，我们需要在之前的基础上填写以下代码：

%正则化神经网络的梯度
Theta1_temp = Theta1;
Theta1_temp(:, 1) = 0;
Theta2_temp = Theta2;
Theta2_temp(:, 1) = 0;
Theta1_grad = Theta1_grad + lambda / m * Theta1_temp; %j=1的时候lambda / m * Theta1_temp为0，Theta1_grad = Theta1_grad；
                                                      %j不等于1的时候，Theta1_grad = Theta1_grad + lambda / m * Theta1_temp;
Theta2_grad = Theta2_grad + lambda / m * Theta2_temp;

完成后，ex4.m脚本将继续在我们的运行过程中调用梯度检验。如果我们的代码正确，则应该期望看到的相对误差小于 $10^{-9}$ 。

Checking Backpropagation (w/ Regularization) ... 
   -0.0093   -0.0093
    0.0089    0.0089
   -0.0084   -0.0084
    0.0076    0.0076
   -0.0067   -0.0067
   -0.0168   -0.0168
    0.0394    0.0394
    0.0593    0.0593
    0.0248    0.0248
   -0.0327   -0.0327
   -0.0602   -0.0602
   -0.0320   -0.0320
    0.0249    0.0249
    0.0598    0.0598
    0.0386    0.0386
   -0.0174   -0.0174
   -0.0576   -0.0576
   -0.0452   -0.0452
    0.0091    0.0091
    0.0546    0.0546
    0.3145    0.3145
    0.1111    0.1111
    0.0974    0.0974
    0.1187    0.1187
    0.0000    0.0000
    0.0337    0.0337
    0.2040    0.2040
    0.1171    0.1171
    0.0755    0.0755
    0.1257    0.1257
   -0.0041   -0.0041
    0.0170    0.0170
    0.1763    0.1763
    0.1131    0.1131
    0.0862    0.0862
    0.1323    0.1323
   -0.0045   -0.0045
    0.0015    0.0015

The above two columns you get should be very similar.
(Left-Your Numerical Gradient, Right-Analytical Gradient)

If your backpropagation implementation is correct, then 
the relative difference will be small (less than 1e-9). 

Relative Difference: 2.25401e-11

完成nnCostFunction.m时，我们需要填写的所有代码如下：

%Part1：神经网络正向传播，并计算代价函数
%正向传播计算激活项
X = [ones(m, 1) X];%X变为5000x401
temp = sigmoid(X * Theta1');%temp为5000x25
temp = [ones(m, 1) temp];%temp变为5000x26
h = sigmoid(temp * Theta2');%h为5000x10
temp_y = zeros(size(h));%temp_y为5000x10的零矩阵
for i = 1:m
    temp_y(i, y(i)) = 1; %完成label的one-hot编码
end
J = (1 / m) * sum(sum(-temp_y .* log(h) - (1 - temp_y) .* log(1 - h))); %计算非正则化的部分
%计算正则化的部分（不正则化偏置对应的那一项，即矩阵的第一列）
Theta1_temp = Theta1(:, 2:end);%Theta1_temp为25x400
Theta2_temp = Theta2(:, 2:end);%Theta2_temp为10x25
J = J + (lambda / 2 / m) * (sum(sum(Theta1_temp .* Theta1_temp)) + sum(sum(Theta2_temp .* Theta2_temp)));

%Part2:实现反向传播并计算下降的梯度
%反向传播计算误差项
delta_3 = h - temp_y; 
delta_2 = delta_3 * Theta2 .* sigmoidGradient([ones(m, 1) X * Theta1']);
delta_2 = delta_2(:, 2:end);
D2 = delta_3' * temp;
D1 = delta_2' * X;
Theta2_grad = 1/m * D2;
Theta1_grad = 1/m * D1;

%Part3：正则化神经网络的梯度
Theta1_temp = Theta1;
Theta1_temp(:, 1) = 0;
Theta2_temp = Theta2;
Theta2_temp(:, 1) = 0;
Theta1_grad = Theta1_grad + lambda / m * Theta1_temp; %j=1的时候lambda / m * Theta1_temp为0，Theta1_grad = Theta1_grad；
                                                      %j不等于1的时候，Theta1_grad = Theta1_grad + lambda / m * Theta1_temp;
Theta2_grad = Theta2_grad + lambda / m * Theta2_temp;

2.6 使用fmincg学习参数

成功实现神经网络的代价函数和梯度计算后，ex4.m脚本的下一步将使用fmincg学习好的设置参数。
训练完成后，ex4.m脚本将通过计算正确样本的百分比来反映分类器的训练准确度。如果我们的实现正确，则应该看到显示的训练准确度约为95.3％（由于随机初始化，其准确性可能相差1％左右）。通过训练神经网络进行更多迭代，可以获得更高的训练精度。我们可以尝试训练神经网络进行更多迭代（例如将MaxIter设置为400），并更改正则化参数λ。通过正确的学习设置，可以使神经网络完全适合训练集。
运行ex4.m脚本将可以得到如下结果：

Training Neural Network... 
Iteration     1 | Cost: 3.306937e+00
Iteration     2 | Cost: 3.260973e+00
Iteration     3 | Cost: 3.215675e+00
Iteration     4 | Cost: 3.189654e+00
Iteration     5 | Cost: 3.151840e+00
Iteration     6 | Cost: 2.501444e+00
Iteration     7 | Cost: 2.029860e+00
Iteration     8 | Cost: 1.915830e+00
Iteration     9 | Cost: 1.766400e+00
Iteration    10 | Cost: 1.637602e+00
Iteration    11 | Cost: 1.503083e+00
Iteration    12 | Cost: 1.226819e+00
Iteration    13 | Cost: 1.125884e+00
Iteration    14 | Cost: 1.048719e+00
Iteration    15 | Cost: 9.816318e-01
Iteration    16 | Cost: 9.214108e-01
Iteration    17 | Cost: 8.817893e-01
Iteration    18 | Cost: 8.600574e-01
Iteration    19 | Cost: 8.270909e-01
Iteration    20 | Cost: 7.972135e-01
Iteration    21 | Cost: 7.763120e-01
Iteration    22 | Cost: 7.475535e-01
Iteration    23 | Cost: 7.246727e-01
Iteration    24 | Cost: 7.041597e-01
Iteration    25 | Cost: 6.757799e-01
Iteration    26 | Cost: 6.616298e-01
Iteration    27 | Cost: 6.573272e-01
Iteration    28 | Cost: 6.417071e-01
Iteration    29 | Cost: 6.365653e-01
Iteration    30 | Cost: 6.323066e-01
Iteration    31 | Cost: 6.146402e-01
Iteration    32 | Cost: 5.994475e-01
Iteration    33 | Cost: 5.805295e-01
Iteration    34 | Cost: 5.646673e-01
Iteration    35 | Cost: 5.558670e-01
Iteration    36 | Cost: 5.486594e-01
Iteration    37 | Cost: 5.316996e-01
Iteration    38 | Cost: 5.236084e-01
Iteration    39 | Cost: 5.147914e-01
Iteration    40 | Cost: 5.112560e-01
Iteration    41 | Cost: 5.064823e-01
Iteration    42 | Cost: 5.024433e-01
Iteration    43 | Cost: 4.975403e-01
Iteration    44 | Cost: 4.930847e-01
Iteration    45 | Cost: 4.858497e-01
Iteration    46 | Cost: 4.784534e-01
Iteration    47 | Cost: 4.602653e-01
Iteration    48 | Cost: 4.532307e-01
Iteration    49 | Cost: 4.449817e-01
Iteration    50 | Cost: 4.414897e-01

Training Set Accuracy: 96.520000

三、可视化隐藏层

了解我们的神经网络正在学习什么的一种方法是可视化隐藏单元捕获的表示。给定一个特定的隐藏单元，一种可视化其计算结果的方法是找到一个将导致其激活的输入 $x$ （即具有接近1的激活值 $a_{i}^{(l)}$ 。对于我们训练的神经网络，请注意 $\Theta^{(1)}$ 的第 $i$ 行为401维向量，表示第 $i$ 个隐藏单元的参数。如果我们丢弃偏差项，则会得到一个400维向量，该向量表示从每个输入像素到隐藏单元的权重。
因此，一种可视化隐藏单元捕获的“表示”的方法是将这个400维向量重塑为 $20 \times 20$ 的图像并显示出来。ex4.m的下一步是使用displayData函数进行此操作，它将为我们显示一张具有25个单元的图像（类似于图4），每个单元对应于网络中的一个隐藏单元。
在训练有素的网络中，我们应该发现隐藏单元大致对应于在输入中查找笔画和其他图案的检测器。
在这里插入图片描述

图4 隐藏单元的可视化

3.1 可选的练习

在练习的这一部分中，我们将尝试为神经网络尝试不同的学习设置，以了解神经网络的性能如何随正则化参数 $λ$ 和训练步数的变化而变化（MaxIter选项使用fmincg时）。
神经网络是非常强大的模型，可以形成高度复杂的决策边界。如果不进行正则化，则神经网络可能会“过度拟合”训练集，以使其在训练集上获得接近100％的准确性，但在以前从未见过的新样本中却不如以前。我们可以将正则化参数 $λ$ 设置为较小的值，并将MaxIter参数设置为较高的迭代数，以便亲自查看。
当我们更改学习参数 $λ$ 和MaxIter时，还可以亲自查看隐藏单元的可视化更改。

四、MATLAB实现

4.1 ex4.m

%% Machine Learning Online Class - Exercise 4 Neural Network Learning

%  Instructions
%  ------------
% 
%  This file contains code that helps you get started on the
%  linear exercise. You will need to complete the following functions 
%  in this exericse:
%
%     sigmoidGradient.m
%     randInitializeWeights.m
%     nnCostFunction.m
%
%  For this exercise, you will not need to change any code in this file,
%  or any other files other than those mentioned above.
%

%% Initialization
clear ; close all; clc

%% Setup the parameters you will use for this exercise
input_layer_size  = 400;  % 20x20 Input Images of Digits
hidden_layer_size = 25;   % 25 hidden units
num_labels = 10;          % 10 labels, from 1 to 10   
                          % (note that we have mapped "0" to label 10)

%% =========== Part 1: Loading and Visualizing Data =============
%  We start the exercise by first loading and visualizing the dataset. 
%  You will be working with a dataset that contains handwritten digits.
%

% Load Training Data
fprintf('Loading and Visualizing Data ...\n')

load('ex4data1.mat');
m = size(X, 1);%m = size(X, 1)=5000为X的行数

% Randomly select 100 data points to display
sel = randperm(size(X, 1));%随机选取100个数据
sel = sel(1:100);

displayData(X(sel, :));

fprintf('Program paused. Press enter to continue.\n');
pause;


%% ================ Part 2: Loading Parameters ================
% In this part of the exercise, we load some pre-initialized 
% neural network parameters.

fprintf('\nLoading Saved Neural Network Parameters ...\n')

% Load the weights into variables Theta1 and Theta2
load('ex4weights.mat');

% Unroll parameters 
nn_params = [Theta1(:) ; Theta2(:)];%nn_params为25x401+10x26=10285维向量

%% ================ Part 3: Compute Cost (Feedforward) ================
%  To the neural network, you should first start by implementing the
%  feedforward part of the neural network that returns the cost only. You
%  should complete the code in nnCostFunction.m to return cost. After
%  implementing the feedforward to compute the cost, you can verify that
%  your implementation is correct by verifying that you get the same cost
%  as us for the fixed debugging parameters.
%
%  We suggest implementing the feedforward cost *without* regularization
%  first so that it will be easier for you to debug. Later, in part 4, you
%  will get to implement the regularized cost.
%
fprintf('\nFeedforward Using Neural Network ...\n')

% Weight regularization parameter (we set this to 0 here).
lambda = 0;%先不正则化，使lambda = 0，计算代价函数

J = nnCostFunction(nn_params, input_layer_size, hidden_layer_size, num_labels, X, y, lambda);

fprintf(['Cost at parameters (loaded from ex4weights): %f '...
         '\n(this value should be about 0.287629)\n'], J);

fprintf('\nProgram paused. Press enter to continue.\n');
pause;

%% =============== Part 4: Implement Regularization ===============
%  Once your cost function implementation is correct, you should now
%  continue to implement the regularization with the cost.
%

fprintf('\nChecking Cost Function (w/ Regularization) ... \n')

% Weight regularization parameter (we set this to 1 here).
lambda = 1;%加上正则化，使lambda = 1，计算代价函数

J = nnCostFunction(nn_params, input_layer_size, hidden_layer_size, num_labels, X, y, lambda);

fprintf(['Cost at parameters (loaded from ex4weights): %f '...
         '\n(this value should be about 0.383770)\n'], J);

fprintf('Program paused. Press enter to continue.\n');
pause;


%% ================ Part 5: Sigmoid Gradient  ================
%  Before you start implementing the neural network, you will first
%  implement the gradient for the sigmoid function. You should complete the
%  code in the sigmoidGradient.m file.
%

fprintf('\nEvaluating sigmoid gradient...\n')

g = sigmoidGradient([-1 -0.5 0 0.5 1]);
fprintf('Sigmoid gradient evaluated at [-1 -0.5 0 0.5 1]:\n  ');
fprintf('%f ', g);
fprintf('\n\n');

fprintf('Program paused. Press enter to continue.\n');
pause;


%% ================ Part 6: Initializing Pameters ================
%  In this part of the exercise, you will be starting to implment a two
%  layer neural network that classifies digits. You will start by
%  implementing a function to initialize the weights of the neural network
%  (randInitializeWeights.m)

fprintf('\nInitializing Neural Network Parameters ...\n')

initial_Theta1 = randInitializeWeights(input_layer_size, hidden_layer_size);%initial_Theta1为25x401
initial_Theta2 = randInitializeWeights(hidden_layer_size, num_labels);%initial_Theta2为10x26

% Unroll parameters（展开参数）
initial_nn_params = [initial_Theta1(:) ; initial_Theta2(:)];%initial_nn_params为25x401+10x26=10285维向量


%% =============== Part 7: Implement Backpropagation ===============
%  Once your cost matches up with ours, you should proceed to implement the
%  backpropagation algorithm for the neural network. You should add to the
%  code you've written in nnCostFunction.m to return the partial
%  derivatives of the parameters.
%
fprintf('\nChecking Backpropagation... \n');

%  Check gradients by running checkNNGradients
checkNNGradients;

fprintf('\nProgram paused. Press enter to continue.\n');
pause;


%% =============== Part 8: Implement Regularization ===============
%  Once your backpropagation implementation is correct, you should now
%  continue to implement the regularization with the cost and gradient.
%

fprintf('\nChecking Backpropagation (w/ Regularization) ... \n')

%  Check gradients by running checkNNGradients
lambda = 3;
checkNNGradients(lambda);

% Also output the costFunction debugging values
debug_J  = nnCostFunction(nn_params, input_layer_size, hidden_layer_size, num_labels, X, y, lambda);

fprintf(['\n\nCost at (fixed) debugging parameters (w/ lambda = %f): %f ' ...
         '\n(for lambda = 3, this value should be about 0.576051)\n\n'], lambda, debug_J);

fprintf('Program paused. Press enter to continue.\n');
pause;


%% =================== Part 9: Training NN ===================
%  You have now implemented all the code necessary to train a neural 
%  network. To train your neural network, we will now use "fmincg", which
%  is a function which works similarly to "fminunc". Recall that these
%  advanced optimizers are able to train our cost functions efficiently as
%  long as we provide them with the gradient computations.
%
fprintf('\nTraining Neural Network... \n')

%  After you have completed the assignment, change the MaxIter to a larger
%  value to see how more training helps.
options = optimset('MaxIter', 50);  %MaxIter:最大迭代次数

%  You should also try different values of lambda
lambda = 1;

% Create "short hand" for the cost function to be minimized
%costFunction = @(p) nnCostFunction(p, input_layer_size, hidden_layer_size,num_labels, X, y, lambda);%@表示构造一个匿名函数,p为自变量

% Now, costFunction is a function that takes in only one argument (the neural network parameters)
%costFunction是一个只接受一个参数（神经网络参数）的函数
%[nn_params, cost] = fmincg(costFunction, initial_nn_params, options);

[nn_params, cost] = fmincg (@(p) nnCostFunction(p, input_layer_size, hidden_layer_size, num_labels, X, y, lambda), initial_nn_params, options);

% Obtain Theta1 and Theta2 back from nn_params
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), hidden_layer_size, (input_layer_size + 1));

Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), num_labels, (hidden_layer_size + 1));

fprintf('Program paused. Press enter to continue.\n');
pause;


%% ================= Part 10: Visualize Weights =================
%  You can now "visualize" what the neural network is learning by 
%  displaying the hidden units to see what features they are capturing in 
%  the data.

fprintf('\nVisualizing Neural Network... \n')

displayData(Theta1(:, 2:end));

fprintf('\nProgram paused. Press enter to continue.\n');
pause;

%把输出层的权值可视化出来：
% figure;
% displayData(Theta2(:, 2:end));
%% ================= Part 11: Implement Predict =================
%  After training the neural network, we would like to use it to predict
%  the labels. You will now implement the "predict" function to use the
%  neural network to predict the labels of the training set. This lets
%  you compute the training set accuracy.

pred = predict(Theta1, Theta2, X);

fprintf('\nTraining Set Accuracy: %f\n', mean(double(pred == y)) * 100);

五、Python实现

5.1 ex4.py

import numpy as np
import matplotlib.pylab as plt
import scipy.io as sio
import math
import scipy.linalg as slin
import scipy.optimize as op

input_layer_size = 400
hidden_layer_size = 25
num_labels = 10

# =========== Part 1: Loading and Visualizing Data =============
# 显示随机100个图像, 疑问：最后的数组需要转置才会显示正的图像
def displayData(x):
    width = round(math.sqrt(np.size(x, 1)))
    m, n = np.shape(x)
    height = int(n/width)
    # 显示图像的数量
    drows = math.floor(math.sqrt(m))
    dcols = math.ceil(m/drows)

    pad = 1
    # 建立一个空白“背景布”
    darray = -1*np.ones((pad+drows*(height+pad), pad+dcols*(width+pad)))

    curr_ex = 0
    for j in range(drows):
        for i in range(dcols):
            if curr_ex >= m:
                break
            max_val = np.max(np.abs(X[curr_ex, :]))
            darray[pad+j*(height+pad):pad+j*(height+pad)+height, pad+i*(width+pad):pad+i*(width+pad)+width]\
                = x[curr_ex, :].reshape((height, width))/max_val
            curr_ex += 1
        if curr_ex >= m:
            break

    plt.imshow(darray.T, cmap='gray')
    plt.show()


print('Loading and Visualizing Data ...')
datainfo = sio.loadmat('ex4data1.mat')
X = datainfo['X']
Y = datainfo['y'][:, 0]
m = np.size(X, 0)
rand_indices = np.random.permutation(m)
sel = X[rand_indices[0:100], :]
displayData(sel)
_ = input('Press [Enter] to continue.')

# ================ Part 2: Loading Parameters ================
print('Loading Saved Neural Network Parameters ...')
thetainfo = sio.loadmat('ex4weights.mat')
theta1 = thetainfo['Theta1']
theta2 = thetainfo['Theta2']
nn_params = np.concatenate((theta1.flatten(), theta2.flatten()))#nn_params为25x401+10x26维向量


# ================ Part 3: Compute Cost (Feedforward) ================
# sigmoid函数
def sigmoid(z):
    g = 1/(1+np.exp(-1*z))
    return g

# sigmoid函数导数
def sigmoidGradient(z):
    g = sigmoid(z)*(1-sigmoid(z))
    return g

# 损失函数
def nnCostFun(params, input_layer_size, hidden_layer_size, num_labels, x, y, lamb):
    theta1 = params[0:hidden_layer_size * (input_layer_size + 1)].reshape(hidden_layer_size, input_layer_size + 1)#theta1 = params[0:25*401].reshape((25, 401))为25x401
    theta2 = params[hidden_layer_size * (input_layer_size + 1):].reshape(num_labels, hidden_layer_size + 1)#theta2 = params[25*401:].reshape((10, 26))为10x26
    m = np.size(x, 0)

    # 前向传播 --- 下标：0代表1， 9代表10
    a1 = np.concatenate((np.ones((m, 1)), x), axis=1)
    z2 = a1.dot(theta1.T);
    l2 = np.size(z2, 0)
    a2 = np.concatenate((np.ones((l2, 1)), sigmoid(z2)), axis=1)
    z3 = a2.dot(theta2.T)
    a3 = sigmoid(z3)
    yt = np.zeros((m, num_labels))
    yt[np.arange(m), y - 1] = 1
    j = np.sum(-yt * np.log(a3) - (1 - yt) * np.log(1 - a3))
    j = j / m
    reg_cost = np.sum(np.power(theta1[:, 1:], 2)) + np.sum(np.power(theta2[:, 1:], 2))
    j = j + 1 / (2 * m) * lamb * reg_cost
    return j

# 梯度函数
def nnGradFun(params, input_layer_size, hidden_layer_size, num_labels, x, y, lamb):
    theta1 = params[0:hidden_layer_size * (input_layer_size + 1)].reshape(hidden_layer_size, input_layer_size + 1)
    theta2 = params[(hidden_layer_size * (input_layer_size + 1)):].reshape(num_labels, hidden_layer_size + 1)
    m = np.size(x, 0)
    # 前向传播 --- 下标：0代表1， 9代表10
    a1 = np.concatenate((np.ones((m, 1)), x), axis=1)
    z2 = a1.dot(theta1.T);
    l2 = np.size(z2, 0)
    a2 = np.concatenate((np.ones((l2, 1)), sigmoid(z2)), axis=1)
    z3 = a2.dot(theta2.T)
    a3 = sigmoid(z3)
    yt = np.zeros((m, num_labels))
    yt[np.arange(m), y - 1] = 1
    # 向后传播
    delta3 = a3 - yt
    delta2 = delta3.dot(theta2) * sigmoidGradient(np.concatenate((np.ones((l2, 1)), z2), axis=1))
    theta2_grad = delta3.T.dot(a2)
    theta1_grad = delta2[:, 1:].T.dot(a1)

    theta2_grad = theta2_grad / m
    theta2_grad[:, 1:] = theta2_grad[:, 1:] + lamb / m * theta2[:, 1:]
    theta1_grad = theta1_grad / m
    theta1_grad[:, 1:] = theta1_grad[:, 1:] + lamb / m * theta1[:, 1:]

    grad = np.concatenate((theta1_grad.flatten(), theta2_grad.flatten()))
    return grad


print('Feedforward Using Neural Network ...')
lamb = 0#先不正则化，使lambda = 0，计算代价函数
j= nnCostFun(nn_params, input_layer_size, hidden_layer_size, num_labels, X, Y, lamb)
print('Cost at parameters (loaded from ex4weights): %f \n(this value should be about 0.287629)' % j)
_ = input('Press [Enter] to continue.')

# =============== Part 4: Implement Regularization ===============
print('Checking Cost Function (w/ Regularization) ...')
lamb = 1#加上正则化，使lambda = 1，计算代价函数

j = nnCostFun(nn_params, input_layer_size, hidden_layer_size, num_labels, X, Y, lamb)
print('Cost at parameters (loaded from ex4weights): %f \n(this value should be about 0.383770)' % j)
_ = input('Press [Enter] to continue.')

# ================ Part 5: Sigmoid Gradient  ================
print('Evaluating sigmoid gradient...')
g = sigmoidGradient(np.array([1, -0.5, 0, 0.5, 1]))
print(g)
_ = input('Press [Enter] to continue.')

# ================ Part 6: Initializing Pameters ================
# 随机确定初始theta参数
def randInitializeWeight(lin, lout):
    epsilon_init = 0.12
    w = np.random.rand(lout, lin+1)*2*epsilon_init-epsilon_init;#w里元素的范围为（-epsilon_init,epsilon_init)
    return w

print('Initializing Neural Network Parameters ...')
init_theta1 = randInitializeWeight(input_layer_size, hidden_layer_size)#initial_theta1为25x401
init_theta2 = randInitializeWeight(hidden_layer_size, num_labels)#initial_theta2为10x26

init_nn_params = np.concatenate((init_theta1.flatten(), init_theta2.flatten()))#initial_nn_params为25x401+10x26维向量

# =============== Part 7: Implement Backpropagation ===============
# 调试时的参数初始化
def debugInitWeights(fout, fin):
    w = np.sin(np.arange(fout*(fin+1))+1).reshape(fout, fin+1)/10 #%使用“sin”初始化w，这将确保w始终具有相同的值，并将对调试有用
    return w

# 数值法计算梯度
def computeNumericalGradient(J, theta, args):
    numgrad = np.zeros(np.size(theta))
    perturb = np.zeros(np.size(theta))
    epsilon = 1e-4
    for i in range(np.size(theta)):
        perturb[i] = epsilon
        loss1 = J(theta-perturb, *args)
        loss2= J(theta+perturb, *args)
        numgrad[i] = (loss2-loss1)/(2*epsilon)
        perturb[i] = 0
    return numgrad


# 检查神经网络的梯度
def checkNNGradients(lamb):
    input_layer_size = 3
    hidden_layer_size = 5
    num_labels = 3
    m = 5

    #% 创造一些随机的训练集（随机初始化参数）
    theta1 = debugInitWeights(hidden_layer_size, input_layer_size)
    theta2 = debugInitWeights(num_labels, hidden_layer_size)

    x = debugInitWeights(m, input_layer_size-1)#重用函数debugInitializeWeights去生成 x 训练集
    y = 1+(np.arange(m)+1) % num_labels #这里产生的y数组很显然是元素小于等于num_labels的正数的列向量

    nn_params = np.concatenate((theta1.flatten(), theta2.flatten()))

    cost = nnCostFun(nn_params, input_layer_size, hidden_layer_size, num_labels, x, y, lamb)
    grad = nnGradFun(nn_params, input_layer_size, hidden_layer_size, num_labels, x, y, lamb)
    numgrad = computeNumericalGradient(nnCostFun, nn_params,(input_layer_size, hidden_layer_size, num_labels, x, y, lamb))
    print(numgrad, '\n', grad)
    print('The above two columns you get should be very similar.\n \
    (Left-Your Numerical Gradient, Right-Analytical Gradient)')
    diff = slin.norm(numgrad-grad)/slin.norm(numgrad+grad)#norm(A),A是一个向量，那么我们得到的结果就是A中的元素平方相加之后开根号
    print('If your backpropagation implementation is correct, then \n\
         the relative difference will be small (less than 1e-9). \n\
         \nRelative Difference: ', diff)

print('Checking Backpropagation...')
checkNNGradients(0)#不包含正则化 
_ = input('Press [Enter] to continue.')

# =============== Part 8: Implement Regularization ===============
print('Checking Backpropagation (w/ Regularization) ...')

lamb = 3
checkNNGradients(lamb)#包含正则化

debug_j=nnCostFun(nn_params, input_layer_size, hidden_layer_size, num_labels, X, Y, lamb)
print('Cost at (fixed) debugging parameters (w/ lambda = 10): %f \n(this value should be about 0.576051)' % debug_j)
_ = input('Press [Enter] to continue.')

# =================== Part 9: Training NN ===================
print('Training Neural Network...')
lamb = 1
param = op.fmin_cg(nnCostFun, init_nn_params, fprime=nnGradFun, args=(input_layer_size, hidden_layer_size, num_labels, X, Y, lamb), maxiter=50)

theta1 = param[0: hidden_layer_size*(input_layer_size+1)].reshape(hidden_layer_size, input_layer_size+1)
theta2 = param[hidden_layer_size*(input_layer_size+1):].reshape(num_labels, hidden_layer_size+1)
_ = input('Press [Enter] to continue.')

# ================= Part 10: Visualize Weights =================
print('Visualizing Neural Network...')
displayData(theta1[:, 1:])
_ = input('Press [Enter] to continue.')

# ================= Part 11: Implement Predict =================
# 预测函数
def predict(t1, t2, x):
    m = np.size(x, 0)
    x = np.concatenate((np.ones((m, 1)), x), axis=1)
    temp1 = sigmoid(x.dot(theta1.T))
    temp = np.concatenate((np.ones((m, 1)), temp1), axis=1)
    temp2 = sigmoid(temp.dot(theta2.T))
    p = np.argmax(temp2, axis=1)+1
    return p

pred = predict(theta1, theta2, X)
print('Training Set Accuracy: ', np.sum(pred == Y)/np.size(Y, 0))