UFLDL教程: Exercise: Implement deep networks for digit classification

最新推荐文章于 2018-12-05 20:59:06 发布

帅气的弟八哥

最新推荐文章于 2018-12-05 20:59:06 发布

阅读量1.2k

点赞数

分类专栏：机器学习 UFLDL教程 matlab

本文链接：https://blog.csdn.net/jiandanjinxin/article/details/73204330

版权

机器学习同时被 3 个专栏收录

67 篇文章 2 订阅

订阅专栏

matlab

31 篇文章 1 订阅

订阅专栏

UFLDL教程

9 篇文章 0 订阅

订阅专栏

Deep networks

Deep Learning and Unsupervised Feature Learning Tutorial Solutions

深度网络的优势

比单层神经网络能学习到更复杂的表达。
不同层的网络学习到的特征是由最底层到最高层慢慢上升的。比如在图像的学习中，第一个隐含层网络可能学习的是边缘特征，第二隐含层就学习到的是轮廓特征，后面的就会更高级，有可能是图像目标中的一个部位。也就是说底层隐含层学习底层特征，高层隐含层学习高层特征。

训练深度网络的困难

1. 数据获取问题

需要依赖于有标签的数据才能进行训练。然而有标签的数据通常是稀缺的，因此对于许多问题，我们很难获得足够多的样本来拟合一个复杂模型的参数。考虑到深度网络具有强大的表达能力，在不充足的数据上进行训练将会导致过拟合。

2. 局部极值问题

使用监督学习方法来对浅层网络（只有一个隐藏层）进行训练通常能够使参数收敛到合理的范围内. 使用监督学习方法训练神经网络时，通常会涉及到求解一个高度非凸的优化问题（例如最小化训练误差，其中参数是要优化的参数。对深度网络而言，这种非凸优化问题的搜索区域中充斥着大量“坏”的局部极值，因而使用梯度下降法（或者像共轭梯度下降法，L-BFGS等方法）效果并不好。

3. 梯度弥散问题

梯度下降法（以及相关的L-BFGS算法等）在使用随机初始化权重的深度网络上效果不好的技术原因是：梯度会变得非常小。具体而言，当使用反向传播方法计算导数的时候，随着网络的深度的增加，反向传播的梯度（从输出层到网络的最初几层）的幅度值会急剧地减小。结果就造成了整体的损失函数相对于最初几层的权重的导数非常小。这样，当使用梯度下降法的时候，最初几层的权重变化非常缓慢，以至于它们不能够从样本中进行有效的学习。这种问题通常被称为“梯度的弥散”.

与梯度弥散问题紧密相关的问题是：当神经网络中的最后几层含有足够数量神经元的时候，可能单独这几层就足以对有标签数据进行建模，而不用最初几层的帮助,也就是说无法起到对网络的前几层结构起到学习的作用。
因此，对所有层都使用随机初始化的方法训练得到的整个网络的性能将会与训练得到的浅层网络（仅由深度网络的最后几层组成的浅层网络）的性能相似。

逐层贪婪训练方法

逐层贪婪算法的主要思路

每次只训练网络中的一层，即我们首先训练一个只含一个隐藏层的网络，仅当这层网络训练结束之后才开始训练一个有两个隐藏层的网络，以此类推。
在每一步中，我们把已经训练好的前 k-1 层固定，然后增加第 k 层（也就是将我们已经训练好的前 k-1 的输出作为输入）。每一层的训练可以是有监督的（例如，将每一步的分类误差作为目标函数），但更通常使用无监督方法（例如自动编码器）。
这些各层单独训练所得到的权重被用来初始化最终（或者说全部）的深度网络的权重，然后对整个网络进行“微调”（即把所有层放在一起来优化有标签训练集上的训练误差）.

逐层贪婪的训练方法优势

1. 数据获取

虽然获取有标签数据的代价是昂贵的，但获取大量的无标签数据是容易的。
自学习方法（self-taught learning）的潜力在于它能通过使用大量的无标签数据来学习到更好的模型。
具体而言，该方法使用无标签数据来学习得到所有层（不包括用于预测标签的最终分类层）的最佳初始权重。相比纯监督学习方法，这种自学习方法能够利用多得多的数据，并且能够学习和发现数据中存在的模式。

2. 更好的局部极值

当用无标签数据训练完网络后，相比于随机初始化而言，各层初始权重会位于参数空间中较好的位置上。然后我们可以从这些位置出发进一步微调权重。
从经验上来说，以这些位置为起点开始梯度下降更有可能收敛到比较好的局部极值点，这是因为无标签数据已经提供了大量输入数据中包含的模式的先验信息。 所以此时的参数初始化值一般都能得到最终比较好的局部最优解。

备注

当训练深度网络的时候，每一层隐层应该使用非线性的激活函数 f(x)。这是因为多层的线性函数组合在一起本质上也只有线性函数的表达能力（例如，将多个线性方程组合在一起仅仅产生另一个线性方程）。因此，在激活函数是线性的情况下，相比于单隐藏层神经网络，包含多隐藏层的深度网络并没有增加表达能力。

从自我学习到深层网络

预训练与微调

预训练（pre-training）：在训练获得模型最初参数（利用自动编码器训练第一层，利用 logistic/softmax 回归训练第二层）；

微调（fine-tune）：我们可以进一步修正模型参数，进而降低训练误差。

在什么时候应用微调？

通常仅在有大量已标注训练数据的情况下使用。在这样的情况下，微调能显著提升分类器性能。
然而，如果有大量未标注数据集（用于非监督特征学习/预训练），却只有相对较少的已标注训练集，微调的作用非常有限。这时可用Self-Taught Learning_Exercise（斯坦福大学深度学习教程UFLDL）中介绍的方法。

实验内容

Exercise: Implement deep networks for digit classification。利用深度网络完成MNIST手写数字数据库中手写数字的识别。

用6万个已标注数据（即：6万张28*28的图像块（patches）），作为训练数据集，然后把它输入到栈式自编码器中，它的第一层自编码器提取出训练数据集的一阶特征，接着把这个一阶特征输入到第二层自编码器中提取出二阶特征，然后把把这个二阶特征输入到softmax分类器，再用原始数据的标签和二阶特征来训练softmax分类器，最后利用BP算法对整个网络的权重值进行微调以更好地学习数据，
再用1万个已标注数据（即：1万张28*28的图像块（patches））作为测试数据集，用前面训练好的softmax分类器对测试数据集进行分类，并计算分类的正确率。本节整个网络结构如下：

这里写图片描述

实验步骤

1.初始化参数，加载MNIST手写数字数据库。
2.利用训练样本集训练第一个稀疏编码器，得到它的权重参数值sae1OptTheta，通过sae1OptTheta可得到原始数据的一阶特征sae1Features。
3.利用一阶特征sae1Features训练第二个自编码器，得到它的权重参数值sae2OptTheta，通过sae2OptTheta可得到原始数据的二阶特征sae2Features。
4.利用二阶特征sae2Features和原始数据的标签来训练softmax分类器，得到softmax分类器的权重参数saeSoftmaxOptTheta。
5.利用误差反向传播进行微调,利用前面得到的所有权重参数sae1OptTheta、sae2OptTheta、saeSoftmaxOptTheta，得到微调前整个网络的权重参数stackedAETheta，然后在利用原始数据及其标签的基础上通过BP算法对stackedAETheta进行微调，得到微调后的整个网络的权重参数stackedAEOptTheta。
6.利用测试样本集对得到的分类器进行精度测试.通过微调前整个网络的权重参数stackedAETheta和微调后的整个网络的权重参数stackedAEOptTheta，分别对测试数据进行分类，得到两者的分类准确率。

stackedAEExercise.m

%% CS294A/CS294W Stacked Autoencoder Exercise

%  Instructions
%  ------------
% 
%  This file contains code that helps you get started on the
%  sstacked autoencoder exercise. You will need to complete code in
%  stackedAECost.m
%  You will also need to have implemented sparseAutoencoderCost.m and 
%  softmaxCost.m from previous exercises. You will need the initializeParameters.m
%  loadMNISTImages.m, and loadMNISTLabels.m files from previous exercises.
%  
%  For the purpose of completing the assignment, you do not need to
%  change the code in this file. 
%
%%======================================================================
%% STEP 0: Here we provide the relevant parameters values that will
%  allow your sparse autoencoder to get good filters; you do not need to 
%  change the parameters below.
 %设置多层自编码器的相关参数
 % 整个网络的输入输出结构
inputSize = 28 * 28;
numClasses = 10;

% 稀疏自编码器结构

hiddenSizeL1 = 200;    % Layer 1 Hidden Size
hiddenSizeL2 = 200;    % Layer 2 Hidden Size
sparsityParam = 0.1;   % desired average activation of the hidden units.
                       % (This was denoted by the Greek alphabet rho, which looks like a lower-case "p",
% 一些权值                     %  in the lecture notes). 
lambda = 3e-3;         % weight decay parameter       
beta = 3;              % weight of sparsity penalty term       

%%======================================================================
%% STEP 1: Load data from the MNIST database
%载入MNSIT数据集及标签集
%  This loads our training data from the MNIST database files.

% Load MNIST database files
DISPLAY = true;

addpath mnist/
trainData = loadMNISTImages('mnist/train-images.idx3-ubyte');
trainLabels = loadMNISTLabels('mnist/train-labels.idx1-ubyte');

trainLabels(trainLabels == 0) = 10; % Remap 0 to 10 since our labels need to start from 1

%%======================================================================
%% STEP 2: Train the first sparse autoencoder
%  This trains the first sparse autoencoder on the unlabelled STL training
%  images.
%  If you've correctly implemented sparseAutoencoderCost.m, you don't need
%  to change anything here.

%训练第一个稀疏自编码器（训练样本集为trainData，看作是无标签训练样本集）

%  Randomly initialize the parameters
sae1Theta = initializeParameters(hiddenSizeL1, inputSize);

%% ---------------------- YOUR CODE HERE  ---------------------------------
%  Instructions: Train the first layer sparse autoencoder, this layer has
%                an hidden size of "hiddenSizeL1"
%                You should store the optimal parameters in sae1OptTheta

%  利用无标签样本集对稀疏自编码器进行学习，学习到的参数存放在向量sae1OptTheta中
% 优化函数的一些参数设置

addpath minFunc/;
options = struct;
options.Method = 'lbfgs';
options.maxIter = 400;
options.display = 'on';

% 调用优化函数，得到优化向量sae1OptTheta
[sae1OptTheta, cost] =  minFunc(@(p)sparseAutoencoderCost(p,...
    inputSize,hiddenSizeL1,lambda,sparsityParam,beta,trainData),sae1Theta,options);%训练出第一层网络的参数%输入维数、输出维数
save('saves/step2.mat', 'sae1OptTheta');

if DISPLAY
  W1 = reshape(sae1OptTheta(1:hiddenSizeL1 * inputSize), hiddenSizeL1, inputSize);
  display_network(W1');
end

% -------------------------------------------------------------------------



%%======================================================================
%% STEP 2: Train the second sparse autoencoder训练第二个稀疏自编码器（训练数据是第一个自编码器提取到的特征）
%  This trains the second sparse autoencoder on the first autoencoder
%  featurse.
%  If you've correctly implemented sparseAutoencoderCost.m, you don't need
%  to change anything here.

%  利用第一个稀疏自编码器的权重参数sae1OptTheta，得到输入数据的一阶特征表示 
% 求解第一个自编码器的输出sae1Features（维数为hiddenSizeL1）
[sae1Features] = feedForwardAutoencoder(sae1OptTheta, hiddenSizeL1, ...
                                        inputSize, trainData);

%  Randomly initialize the parameters
sae2Theta = initializeParameters(hiddenSizeL2, hiddenSizeL1);

%% ---------------------- YOUR CODE HERE  ---------------------------------
%  Instructions: Train the second layer sparse autoencoder, this layer has
%                an hidden size of "hiddenSizeL2" and an inputsize of
%                "hiddenSizeL1"
%
%                You should store the optimal parameters in sae2OptTheta


% 开始训练第二个自编码器，输入维数是hiddenSizeL1，输出维数是hiddenSizeL2，优化向量存放在sae2OptTheta中
[sae2OptTheta, cost] =  minFunc(@(p)sparseAutoencoderCost(p,...
    hiddenSizeL1,hiddenSizeL2,lambda,sparsityParam,beta,sae1Features),sae2Theta,options);%训练出第二层网络的参数
save('saves/step3.mat', 'sae2OptTheta');

figure;
if DISPLAY
  W11 = reshape(sae1OptTheta(1:hiddenSizeL1 * inputSize), hiddenSizeL1, inputSize);
  W12 = reshape(sae2OptTheta(1:hiddenSizeL2 * hiddenSizeL1), hiddenSizeL2, hiddenSizeL1);
  % TODO(zellyn): figure out how to display a 2-level network
%  display_network(log(W11' ./ (1-W11')) * W12');
%   W12_temp = W12(1:196,1:196);
%   display_network(W12_temp');
%   figure;
%   display_network(W12_temp');
end









% -------------------------------------------------------------------------


%%======================================================================
%% STEP 3: Train the softmax classifier%用二阶特征训练softmax分类器
%训练softmax classifier（它的输入为第二个自编码器提取到的特征sae2Features）
%  This trains the sparse autoencoder on the second autoencoder features.
%  If you've correctly implemented softmaxCost.m, you don't need
%  to change anything here.


%  利用第二个稀疏自编码器的权重参数sae2OptTheta，得到输入数据的二阶特征表示 
% 求解第二个自编码器的输出sae1Features（维数为hiddenSizeL2）
[sae2Features] = feedForwardAutoencoder(sae2OptTheta, hiddenSizeL2, ...
                                        hiddenSizeL1, sae1Features);

%  Randomly initialize the parameters
saeSoftmaxTheta = 0.005 * randn(hiddenSizeL2 * numClasses, 1);


%% ---------------------- YOUR CODE HERE  ---------------------------------
%  Instructions: Train the softmax classifier, the classifier takes in
%                input of dimension "hiddenSizeL2" corresponding to the
%                hidden layer size of the 2nd layer.
%
%                You should store the optimal parameters in saeSoftmaxOptTheta 
%
%  NOTE: If you used softmaxTrain to complete this part of the exercise,
%        set saeSoftmaxOptTheta = softmaxModel.optTheta(:);

% 开始优化softmax classifier，得到优化向量

softmaxLambda = 1e-4;
numClasses = 10;
softoptions = struct;
softoptions.maxIter = 400;
softmaxModel = softmaxTrain(hiddenSizeL2,numClasses,softmaxLambda,...
                            sae2Features,trainLabels,softoptions);
saeSoftmaxOptTheta = softmaxModel.optTheta(:);%得到softmax分类器的权重参数

save('saves/step4.mat', 'saeSoftmaxOptTheta');








% -------------------------------------------------------------------------



%%======================================================================
%% STEP 5: Finetune softmax model微调多层自编码器

% Implement the stackedAECost to give the combined cost of the whole model
% then run this cell.


% 利用稀疏自编码(stack)和softmax分类器(saeSoftmaxOptTheta)学习到的参数作为微调模型的初始值
% 稀疏自编码的参数stack

% Initialize the stack using the parameters learned
stack = cell(2,1);%存放稀疏自编码器参数的元胞
stack{1}.w = reshape(sae1OptTheta(1:hiddenSizeL1*inputSize), ...
                     hiddenSizeL1, inputSize);
stack{1}.b = sae1OptTheta(2*hiddenSizeL1*inputSize+1:2*hiddenSizeL1*inputSize+hiddenSizeL1);
stack{2}.w = reshape(sae2OptTheta(1:hiddenSizeL2*hiddenSizeL1), ...
                     hiddenSizeL2, hiddenSizeL1);
stack{2}.b = sae2OptTheta(2*hiddenSizeL2*hiddenSizeL1+1:2*hiddenSizeL2*hiddenSizeL1+hiddenSizeL2);

% Initialize the parameters for the deep model
[stackparams, netconfig] = stack2params(stack);%所有stack转化为向量形式，并提取稀疏自编码器的结构

% 整个模型参数（saeSoftmaxOptTheta+stack）
stackedAETheta = [ saeSoftmaxOptTheta ; stackparams ];

%% ---------------------- YOUR CODE HERE  ---------------------------------
%  Instructions: Train the deep network, hidden size here refers to the '
%                dimension of the input to the classifier, which corresponds 
%                to "hiddenSizeL2".
%
%
%  用BP算法微调，得到微调后的整个网络参数stackedAEOptTheta

[stackedAEOptTheta, cost] =  minFunc(@(p)stackedAECost(p,inputSize,hiddenSizeL2,...
                         numClasses, netconfig,lambda, trainData, trainLabels),...
                        stackedAETheta,options);%训练出第三层网络的参数
save('saves/step5.mat', 'stackedAEOptTheta');

figure;
if DISPLAY
  optStack = params2stack(stackedAEOptTheta(hiddenSizeL2*numClasses+1:end), netconfig);
  W11 = optStack{1}.w;
  W12 = optStack{2}.w;
  % TODO(zellyn): figure out how to display a 2-level network
  % display_network(log(1 ./ (1-W11')) * W12');
end


% -------------------------------------------------------------------------



%%======================================================================
%% STEP 6: Test 
%  Instructions: You will need to complete the code in stackedAEPredict.m
%                before running this part of the code
%

% Get labelled test images
% Note that we apply the same kind of preprocessing as the training set

% 获取有标签样本集
testData = loadMNISTImages('mnist/t10k-images-idx3-ubyte');
testLabels = loadMNISTLabels('mnist/t10k-labels-idx1-ubyte');

testLabels(testLabels == 0) = 10; % Remap 0 to 10


% 进行预测（微调前的）
[pred] = stackedAEPredict(stackedAETheta, inputSize, hiddenSizeL2, ...
                          numClasses, netconfig, testData);

acc = mean(testLabels(:) == pred(:));% 计算预测精度
fprintf('Before Finetuning Test Accuracy: %0.3f%%\n', acc * 100);

% 进行预测（微调后的）
[pred] = stackedAEPredict(stackedAEOptTheta, inputSize, hiddenSizeL2, ...
                          numClasses, netconfig, testData);

acc = mean(testLabels(:) == pred(:));% 计算预测精度
fprintf('After Finetuning Test Accuracy: %0.3f%%\n', acc * 100);

% Accuracy is the proportion of correctly classified images
% The results for our implementation were:
%
% Before Finetuning Test Accuracy: 87.7%
% After Finetuning Test Accuracy:  97.6%
%
% If your values are too low (accuracy less than 95%), you should check 
% your code for errors, and make sure you are training on the 
% entire data set of 60000 28x28 training images 
% (unless you modified the loading code, this should be the case)

stackedAECost.m

function [ cost, grad ] = stackedAECost(theta, inputSize, hiddenSize, ...
                                              numClasses, netconfig, ...
                                              lambda, data, labels)

% stackedAECost: Takes a trained softmaxTheta and a training data set with labels,
% and returns cost and gradient using a stacked autoencoder model. Used for
% finetuning.

 % 计算整个模型的代价函数及其梯度 
 % 注意：完成这个函数后最好用checkStackedAECost函数检查梯度计算是否正确 

% theta: trained weights from the autoencoder整个网络的权值向量
% visibleSize: the number of input units网络的输入层维数
% hiddenSize:  the number of hidden units *at the 2nd layer*最后一个稀疏自编码器的隐藏层维数
% numClasses:  the number of categories类别总数
% netconfig:   the network configuration of the stack
% lambda:      the weight regularization penalty
% data: Our matrix containing the training data as columns.  So, data(:,i) is the i-th training example. 训练样本集
% labels: A vector containing labels, where labels(i) is the label for the训练样本集的标签
% i-th training example


%% Unroll softmaxTheta parameter

% We first extract the part which compute the softmax gradient
softmaxTheta = reshape(theta(1:hiddenSize*numClasses), numClasses, hiddenSize);

% Extract out the "stack"
stack = params2stack(theta(hiddenSize*numClasses+1:end), netconfig);

% You will need to compute the following gradients
softmaxThetaGrad = zeros(size(softmaxTheta));
stackgrad = cell(size(stack));
for d = 1:numel(stack)
    stackgrad{d}.w = zeros(size(stack{d}.w));
    stackgrad{d}.b = zeros(size(stack{d}.b));
end

cost = 0; % You need to compute this

% You might find these variables useful
M = size(data, 2);
groundTruth = full(sparse(labels, 1:M, 1));


%% --------------------------- YOUR CODE HERE -----------------------------
%  Instructions: Compute the cost function and gradient vector for 
%                the stacked autoencoder.
%
%                You are given a stack variable which is a cell-array of
%                the weights and biases for every layer. In particular, you
%                can refer to the weights of Layer d, using stack{d}.w and
%                the biases using stack{d}.b . To get the total number of
%                layers, you can use numel(stack).
%
%                The last layer of the network is connected to the softmax
%                classification layer, softmaxTheta.
%
%                You should compute the gradients for the softmaxTheta,
%                storing that in softmaxThetaGrad. Similarly, you should
%                compute the gradients for each layer in the stack, storing
%                the gradients in stackgrad{d}.w and stackgrad{d}.b
%                Note that the size of the matrices in stackgrad should
%                match exactly that of the size of the matrices in stack.
%

depth = size(stack, 1);  % 隐藏层的数量
a = cell(depth+1, 1);    % 输入层和隐藏层的输出值，即：输入层的输出值和隐藏层的激活值
a{1} = data; % 输入层的输出值  
Jweight = 0; % 权重惩罚项  
m = size(data, 2); % 样本数  

% 计算隐藏层的激活值
for i=2:numel(a)  
     a{i} = sigmoid(stack{i-1}.w*a{i-1}+repmat(stack{i-1}.b, [1 size(a{i-1}, 2)]));  
     %Jweight = Jweight + sum(sum(stack{i-1}.w).^2);  
 end  

 M = softmaxTheta*a{depth+1};  
 M = bsxfun(@minus, M, max(M, [], 1));  %防止下一步计算指数函数时溢出
 M = exp(M);  
 p = bsxfun(@rdivide, M, sum(M));  

 Jweight = Jweight + sum(softmaxTheta(:).^2); 

 % 计算softmax分类器的代价函数，为什么它就是整个模型的代价函数？
 cost = -1/m .* groundTruth(:)'*log(p(:)) + lambda/2*Jweight;% 代价函数＝均方差项+权重衰减项（也叫：规则化项）   

 %计算softmax分类器代价函数的梯度，即输出层的梯度  
 softmaxThetaGrad = -1/m .* (groundTruth - p)*a{depth+1}' + lambda*softmaxTheta;  

 delta = cell(depth+1, 1);  %隐藏层和输出层的残差 

 %计算输出层的残差  
 delta{depth+1} = -softmaxTheta' * (groundTruth - p) .* a{depth+1} .* (1-a{depth+1});  

 %计算隐藏层的残差
 for i=depth:-1:2  
     delta{i} = stack{i}.w'*delta{i+1}.*a{i}.*(1-a{i});  
 end  

 % 通过前面得到的输出层和隐藏层的残差，计算隐藏层参数的梯度
 for i=depth:-1:1  
     stackgrad{i}.w = 1/m .* delta{i+1}*a{i}';  
     stackgrad{i}.b = 1/m .* sum(delta{i+1}, 2);  
 end  

% -------------------------------------------------------------------------

%% Roll gradient vector
grad = [softmaxThetaGrad(:) ; stack2params(stackgrad)];

end


% You might find this useful
function sigm = sigmoid(x)
    sigm = 1 ./ (1 + exp(-x));
end

stackedAEPredict.m

function [pred] = stackedAEPredict(theta, inputSize, hiddenSize, numClasses, netconfig, data)

% stackedAEPredict: Takes a trained theta and a test data set,
% and returns the predicted labels for each example.

% theta: trained weights from the autoencoder
% visibleSize: the number of input units
% hiddenSize:  the number of hidden units *at the 2nd layer*
% numClasses:  the number of categories
% data: Our matrix containing the training data as columns.  So, data(:,i) is the i-th training example. 

% Your code should produce the prediction matrix 
% pred, where pred(i) is argmax_c P(y(c) | x(i)).

%% Unroll theta parameter

% We first extract the part which compute the softmax gradient
softmaxTheta = reshape(theta(1:hiddenSize*numClasses), numClasses, hiddenSize);

% Extract out the "stack"
stack = params2stack(theta(hiddenSize*numClasses+1:end), netconfig);

%% ---------- YOUR CODE HERE --------------------------------------
%  Instructions: Compute pred using theta assuming that the labels start 
%                from 1.

%% 前向传播计算
depth = numel(stack);  
a = cell(depth+1);  
a{1} = data;  
m = size(data, 2);  

for i=2:depth+1  
     a{i} = sigmoid(stack{i-1}.w*a{i-1}+ repmat(stack{i-1}.b, [1 m]));  
end  


% % %% softmax模型的输出Htheta
% % softmaxData=a{depth+1};%softmax的输入即为stack自编码器最后一层的输出
% % M=softmaxTheta*softmaxData;%矩阵M
% % M=bsxfun(@minus,M,max(M));%减去行向量α，防止数据溢出
% % Htheta=bsxfun(@rdivide,exp(M),sum(exp(M)));%softmax模型的假设函数输出
% % %% 计算Htheta每一列最大元素所在位置，即为该列所对应样本的类别
% % [~,pred]=max(Htheta);

[prob pred] = max(softmaxTheta*a{depth+1}); 









% -----------------------------------------------------------

end


% You might find this useful
function sigm = sigmoid(x)
    sigm = 1 ./ (1 + exp(-x));
end

display_network.m

function [h, array] = display_network(A, opt_normalize, opt_graycolor, cols, opt_colmajor)
% This function visualizes filters in matrix A. Each column of A is a
% filter. We will reshape each column into a square image and visualizes
% on each cell of the visualization panel. 
% All other parameters are optional, usually you do not need to worry
% about it.
% opt_normalize: whether we need to normalize the filter so that all of
% them can have similar contrast. Default value is true.

% 是否需要归一化的参数。真：每个图像块归一化（即：每个图像块元素值除以该图像块中像素值绝对值的最大值）；
%假：整幅大图像一起归一化（即：每个图像块元素值除以整幅图像中像素值绝对值的最大值）。默认为真。

% opt_graycolor: whether we use gray as the heat map. Default is true.
% 该参数决定是否显示灰度图。
% 真：显示灰度图；假：不显示灰度图。默认为真。

% cols: how many columns are there in the display. Default value is the
% squareroot of the number of columns in A.该参数决定将要显示的整幅大图像每一行中小图像块的个数。默认为A列数的均方根。

% opt_colmajor: you can switch convention to row major for A. In that
% case, each row of A is a filter. Default value is false.
% 该参数决定将要显示的整个大图像中每个小图像块是按行从左到右依次排列，还是按列从上到下依次排列
% 真：整个大图像由每个小图像块按列从上到下依次排列组成；
% 假：整个大图像由每个小图像块按行从左到右依次排列组成。默认为假。



warning off all%关闭警告

% 参数的默认值
%exist(A),测试A是否存在，'var'表示只检测变量
if ~exist('opt_normalize', 'var') || isempty(opt_normalize)
    opt_normalize= true;
end

if ~exist('opt_graycolor', 'var') || isempty(opt_graycolor)
    opt_graycolor= true;
end

if ~exist('opt_colmajor', 'var') || isempty(opt_colmajor)
    opt_colmajor = false;
end

% rescale整幅大图像或整个数据0均值化
A = A - mean(A(:));

if opt_graycolor, colormap(gray); end %如果要显示灰度图，就把该图形的色图（即：colormap）设置为gray


% 计算整幅大图像中每一行中小图像块的个数和第一列中小图像块的个数，即列数n和行数m  compute rows, cols
% compute rows, cols
[L M]=size(A);% M即为小图像块的总数
sz=sqrt(L);% 每个小图像块内像素点的行数和列数
buf=1; % 用于把每个小图像块隔开，即小图像块之间的缓冲区。每个小图像块的边缘都是一行和一列像素值为-1的像素点。
if ~exist('cols', 'var') %没有给定列数的情况下 % 如变量cols不存在时
    if floor(sqrt(M))^2 ~= M %M不是平方数时  % 如果M的均方根不是整数，列数n就先暂时取值为M均方根的向右取整
        n=ceil(sqrt(M));
        while mod(M, n)~=0 && n<1.2*sqrt(M), n=n+1; end % 当M不是n的整数倍且n小于1.2倍的M均方根值时，列数n加1
        m=ceil(M/n);  %m是最终要的小patch图像的尺寸大小 % 行数m取值为小图像块总数M除以大图像中每一行中小图像块的个数n，再向右取整
    else
        n=sqrt(M);    % 如果M的均方根是整数，那m和n都取值为M的均方根
        m=n;
    end
else
    n = cols;  % 如果变量cols存在，就直接令列数n等于cols，行数m为M除以n后向右取整
    m = ceil(M/n);
end

array=-ones(buf+m*(sz+buf),buf+n*(sz+buf));%要保证每个小图像块的四周边缘都是单行和单列像素值为-1的像素点。所以得到这个目标矩阵

if ~opt_graycolor % 如果分隔区不显示黑色，而显示灰度，那就要是要保证：每个小图像块的四周边缘都是单行和单列像素值为-0.1的像素点
    array = 0.1.* array;
end


if ~opt_colmajor   % 如果opt_colmajor为假，即：整个大图像由每个小图像块按行从左到右依次排列组成
    k=1;            %第k个小图像块
    for i=1:m       % 行数
        for j=1:n   % 列数
            if k>M, 
                continue; 
            end
            clim=max(abs(A(:,k)));
            if opt_normalize
                array(buf+(i-1)*(sz+buf)+(1:sz),buf+(j-1)*(sz+buf)+(1:sz))=reshape(A(:,k),sz,sz)/clim; %从这可看是n是列数，m是行数
            else
                array(buf+(i-1)*(sz+buf)+(1:sz),buf+(j-1)*(sz+buf)+(1:sz))=reshape(A(:,k),sz,sz)/max(abs(A(:)));
            end
            k=k+1;
        end
    end
else            % 如果opt_colmajor为真，即：整个大图像由每个小图像块按列从上到下依次排列组成
    k=1;
    for j=1:n    %列数
        for i=1:m   %行数
            if k>M, 
                continue; 
            end
            clim=max(abs(A(:,k)));
            if opt_normalize
                array(buf+(i-1)*(sz+buf)+(1:sz),buf+(j-1)*(sz+buf)+(1:sz))=reshape(A(:,k),sz,sz)/clim;
            else
                array(buf+(i-1)*(sz+buf)+(1:sz),buf+(j-1)*(sz+buf)+(1:sz))=reshape(A(:,k),sz,sz);
            end
            k=k+1;
        end
    end
end

if opt_graycolor  % 要显示灰度图，此时每个小图像块的四周边缘都是单行和单列像素值为-1的像素点。
    h=imagesc(array,'EraseMode','none',[-1 1]);%这里讲EraseMode设置为none,表示重绘时不擦除任何像素点%图形的EraseMode属性设置为none：即为在该图像上不做任何擦除，直接在原来图形上绘制
else                                       % 不显示灰度图，此时每个小图像块的四周边缘都是单行和单列像素值为-0.1的像素点
    h=imagesc(array,'EraseMode','none',[-1 1]);
end
axis image off   %去掉坐标轴

drawnow;   %刷新屏幕，使图像可一点一点地显示

warning on all  %打开警告