# 【Deep Learning】1.深度学习：稀疏自编码器

1.介绍

监督性学习是人工智能当中最强大的工具之一，涉及到智能识别、无人车等等。但是监督性学习在如今还是被严重的限制住了。特别地，大多数监督性学习的应用仍要求人为地指定输入的特征。一旦得到一个好的特征表示，监督性学习通常能够很好的工作，然而在类似于计算机视觉、音频处理和自然语言处理中，成百上千的研究人员用了许多年头手工地在寻找最好的特征。尽管有很多非常机智的方法找到特征，但是人们不得不考虑是否能够做得更好。理想地，我们更希望能够拥有一种自动学习寻找特征的方法

这里介绍一种叫稀疏自编码器学习算法。这是一种能够自动从非标号数据中学习特征的方法。更进一步的，还有很多更加复杂的稀疏自编码器学习算法，就不一一介绍了。

我们先引入监督性学习的前向神经网络后传播算法。然后介绍如何通过这两个算法构造一个非监督性学习的自编码器。最后我们进一步得到一个稀疏的自编码器

2.神经网络

先回顾一个最简单的神经元

，由图可以看出，这两个函数的区别在于sigmoid函数为假时函数值趋于0，为真时函数值趋于1，而tanh函数为假时函数值趋于-1，为真时函数值趋于1，关于tanh函数，在Andrew Ngmachine learning的课上没有提到，所以这里应该是第一次提，不过接下来的内容还是以fsigmoid函数为主，因此不用担心与machine learning课上的有什么多大的变化。

2.1 神经网络公式化

还是放出一张图出来，感觉图出来了，就一目了然了

2.2后传播算法

2.3 梯度检查

3.自编码器和稀疏性

截止到目前我们只讨论了监督性学习的神经网络，接着假设我们只有一个无标识训练集，令目标结果集等价于训练集，即，我们就可以得到自编码器神经网络这一个非监督性学习算法（其实用的还是监督性学习那套思路，只不过这次是只有输入集没有结果集，所以我们就让结果集等于目标集了）。

下面给出一些新的记号

隐藏单元j的平均激活值，以及系数参数，理想地，我们希望。为了实现这一点，我们引入一个新的差值函数（这个函数长得十分像监督性学习中的分类问题的差值函数），记为），当=0.2时，这个函数的图像是这样的，

function numgrad = computeNumericalGradient(J, theta)
% theta: a vector of parameters
% J: a function that outputs a real-number. Calling y = J(theta) will return the
% function value at theta.

%% ---------- YOUR CODE HERE --------------------------------------
% Instructions:
% (See Section 2.3 of the lecture notes.)
% You should write code so that numgrad(i) is (the numerical approximation to) the
% partial derivative of J with respect to the i-th input argument, evaluated at theta.
% I.e., numgrad(i) should be the (approximately) the partial derivative of J with
% respect to theta(i).
%
% Hint: You will probably want to compute the elements of numgrad one at a time.

perturb = zeros(size(theta));
e = 1e-4;
for p = 1:numel(theta),
% Set perturbation vector
perturb(p) = e;
loss1 = J(theta - perturb);
loss2 = J(theta + perturb);
numgrad(p) = (loss2 - loss1) / (2*e);
perturb(p) = 0;
end;

%% ---------------------------------------------------------------
end

function patches = sampleIMAGES()
% sampleIMAGES
% Returns 10000 patches for training

patchsize = 8;  % we'll use 8x8 patches
numpatches = 10000;

% Initialize patches with zeros.  Your code will fill in this matrix--one
% column per patch, 10000 columns.
patches = zeros(patchsize*patchsize, numpatches);

%% ---------- YOUR CODE HERE --------------------------------------
%  Instructions: Fill in the variable called "patches" using data
%  from IMAGES.
%
%  IMAGES is a 3D array containing 10 images
%  For instance, IMAGES(:,:,6) is a 512x512 array containing the 6th image,
%  and you can type "imagesc(IMAGES(:,:,6)), colormap gray;" to visualize
%  it. (The contrast on these images look a bit off because they have
%  been preprocessed using using "whitening."  See the lecture notes for
%  more details.) As a second example, IMAGES(21:30,21:30,1) is an image
%  patch corresponding to the pixels in the block (21,21) to (30,30) of
%  Image 1

X = round(1+rand(numpatches,1)*504);
Y = round(1+rand(numpatches,1)*504);
Z = round(1+rand(numpatches,1)*9);

m = patchsize*patchsize;
for i = 1 : numpatches,
patches(:, i) = reshape(IMAGES(X(i):(X(i)+7), Y(i):(Y(i)+7), Z(i)), m, 1);
end;

%% ---------------------------------------------------------------
% For the autoencoder to work well we need to normalize the data
% Specifically, since the output of the network is bounded between [0,1]
% (due to the sigmoid activation function), we have to make sure
% the range of pixel values is also bounded between [0,1]
patches = normalizeData(patches);

end

%% ---------------------------------------------------------------
function patches = normalizeData(patches)

% Squash data to [0.1, 0.9] since we use sigmoid as the activation
% function in the output layer

% Remove DC (mean of images).
patches = bsxfun(@minus, patches, mean(patches));

% Truncate to +/-3 standard deviations and scale to -1 to 1
pstd = 3 * std(patches(:));
patches = max(min(patches, pstd), -pstd) / pstd;

% Rescale from [-1,1] to [0.1,0.9]
patches = (patches + 1) * 0.4 + 0.1;

end

<pre name="code" class="html">function [cost,grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, ...
lambda, sparsityParam, beta, data)

% visibleSize: the number of input units (probably 64)
% hiddenSize: the number of hidden units (probably 25)
% lambda: weight decay parameter
% sparsityParam: The desired average activation for the hidden units (denoted in the lecture
%                           notes by the greek alphabet rho, which looks like a lower-case "p").
% beta: weight of sparsity penalty term
% data: Our 64x10000 matrix containing the training data.  So, data(:,i) is the i-th training example.

% The input theta is a vector (because minFunc expects the parameters to be a vector).
% We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this
% follows the notation convention of the lecture notes.

W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);
W2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize, hiddenSize);
b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);
b2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end);

% Cost and gradient variables (your code needs to compute these values).
% Here, we initialize them to zeros.
cost = 0;

%% ---------- YOUR CODE HERE --------------------------------------
%  Instructions: Compute the cost/optimization objective J_sparse(W,b) for the Sparse Autoencoder,
%
% Note that W1grad has the same dimensions as W1, b1grad has the same dimensions
% as b1, etc.  Your code should set W1grad to be the partial derivative of J_sparse(W,b) with
% respect to W1.  I.e., W1grad(i,j) should be the partial derivative of J_sparse(W,b)
% with respect to the input parameter W1(i,j).  Thus, W1grad should be equal to the term
% [(1/m) \Delta W^{(1)} + \lambda W^{(1)}] in the last block of pseudo-code in Section 2.2
%
% Stated differently, if we were using batch gradient descent to optimize the parameters,
% the gradient descent update to W1 would be W1 := W1 - alpha * W1grad, and similarly for W2, b1, b2.
%

m = size(data, 2);
Z2 = W1*data+repmat(b1, 1, m);
A2 = sigmoid(Z2);
Z3 = W2*A2+repmat(b2, 1, m);
A3 = sigmoid(Z3);

tmp = A3-data;

sparsityParamTmp = sum(A2, 2)/m;
KLSum = sum(sparsityParam.*log(sparsityParam./sparsityParamTmp)+...
(1-sparsityParam).*log((1-sparsityParam)./(1-sparsityParamTmp)));

cost = sum(sum(tmp.^2))/(2*m);
cost = cost+lambda*sum([W1(:); W2(:)].^2)/2+beta*KLSum;

T2 = (W2'*T3+repmat(beta.*(-sparsityParam./sparsityParamTmp+(1-sparsityParam)./...

%-------------------------------------------------------------------
% After computing the cost and gradient, we will convert the gradients back
% to a vector format (suitable for minFunc).  Specifically, we will unroll

end

%-------------------------------------------------------------------
% Here's an implementation of the sigmoid function, which you may find useful
% in your computation of the costs and the gradients.  This inputs a (row or
% column) vector (say (z1, z2, z3)) and returns (f(z1), f(z2), f(z3)).

function sigm = sigmoid(x)

sigm = 1 ./ (1 + exp(-x));
end
%----------------------------------

end