引言
推导过神经网络和BP算法后,对UFLDL中的Sparse Autoencoder(稀疏编码器)进行学习和实现。稀疏编码器(Sparse Autoencoder)是一种比较特殊的神经网络,其支持非监督学习;输出设置的与输入相同,并且加上稀疏限制。训练也采用BP算法,唯一比较特殊的是稀疏性限制部分加入了KL对比散度的能量部分,推导时需对其求导。这里理论部分不再推导,请参考UFLDL里面的内容。另外,练习代码(Matlab实现)也可以在网页上找到,请自行下载,如需本文代码的完整实现请下载。
代码实现
下载代码并解压后,打开文件夹中的文件,发现该代码框架和文件头尾都已经帮你写好了,只需把关键部分填上即可(在“YOUR CODE HERE”下面写代码)。文件夹中的文件解释如下,其中斜体部分中有自己实现的部分:
- checkNumericalGradient.m 检查函数的梯度函数,里面用句柄回调实现了一个简单的函数(可以认为是一个梯度检查的示例)
- computeNumericalGradient.m 梯度检查,使用梯度定义求梯度值用于和BP计算的值进行对比
- display_network.m 显示编码结果
- IMAGES.mat 输入数据,实际上是10张做过预处理的图像组成的数组
- initializeParameters.m 初始化参数
- sampleIMAGES.m 生成随机样本
sparseAutoencoderCost.m 本练习的核心函数,实现对Sparse Autoencoder利用BP算法求梯度
train.m 训练主函数,修改其中的一些参数加快代码调试
样本生成,代码位于sampleIMAGES.m中,在 HXWXC 的3D数组中存放了10张输入图片,取样思路是随机生成1000组二维向量;循环取样,每次取样都以二维向量为中心点,取样patchsize x patchsize x 10的子图,抽成向量并保存。
function patches = sampleIMAGES()
% sampleIMAGES
% Returns 10000 patches for training
load IMAGES; % load images from disk
patchsize = 8; % we'll use 8x8 patches
numpatches = 10000;
% Initialize patches with zeros. Your code will fill in this matrix--one
% column per patch, 10000 columns.
patches = zeros(patchsize*patchsize, numpatches);
%% ---------- YOUR CODE HERE --------------------------------------
% Instructions: Fill in the variable called "patches" using data
% from IMAGES.
%
% IMAGES is a 3D array containing 10 images
% For instance, IMAGES(:,:,6) is a 512x512 array containing the 6th image,
% and you can type "imagesc(IMAGES(:,:,6)), colormap gray;" to visualize
% it. (The contrast on these images look a bit off because they have
% been preprocessed using using "whitening." See the lecture notes for
% more details.) As a second example, IMAGES(21:30,21:30,1) is an image
% patch corresponding to the pixels in the block (21,21) to (30,30) of
% Image 1
[H,~,C] = size(IMAGES);
R = rand(1000,2); %random numbers,for centers of sub-images
rand_pts = uint16(R*(H-patchsize)+patchsize/2); %bounding limits
for i=1:size(R,1)
indx = rand_pts(i,:); %random select center points
sub_img = IMAGES(indx(1)-patchsize/2+1:indx(1)+patchsize/2,indx(1)-patchsize/2+1:indx(1)+patchsize/2,:);
sub_arr = reshape(sub_img,[],C); %reshape to vectors
patches(:,(i-1)*C+1:i*C) = sub_arr;
end
%% ---------------------------------------------------------------
% For the autoencoder to work well we need to normalize the data
% Specifically, since the output of the network is bounded between [0,1]
% (due to the sigmoid activation function), we have to make sure
% the range of pixel values is also bounded between [0,1]
patches = normalizeData(patches);
end
%% ---------------------------------------------------------------
function patches = normalizeData(patches)
% Squash data to [0.1, 0.9] since we use sigmoid as the activation
% function in the output layer
% Remove DC (mean of images).
patches = bsxfun(@minus, patches, mean(patches));
% Truncate to +/-3 standard deviations and scale to -1 to 1
pstd = 3 * std(patches(:));
patches = max(min(patches, pstd), -pstd) / pstd;
% Rescale from [-1,1] to [0.1,0.9]
patches = (patches + 1) * 0.4 + 0.1;
end
梯度检查,位于computeNumericalGradient.m中,这个比较简单按照文档中的说明实现即可。定义epsilon是一个小的常量,改变参数中的每一维 ± epsilon,并通过句柄调用函数,最后根据定义求出这一维梯度的大小。checkNumericalGradient.m是通过一个简单的二次函数来确认你这个函数实现的是否正确。如果两个误差很小 10−9 以下,说明实现正确。
function numgrad = computeNumericalGradient(J, theta)
% numgrad = computeNumericalGradient(J, theta)
% theta: a vector of parameters
% J: a function that outputs a real-number. Calling y = J(theta) will return the
% function value at theta.
% Initialize numgrad with zeros
numgrad = zeros(size(theta));
%% ---------- YOUR CODE HERE --------------------------------------
% Instructions:
% Implement numerical gradient checking, and return the result in numgrad.
% (See Section 2.3 of the lecture notes.)
% You should write code so that numgrad(i) is (the numerical approximation to) the
% partial derivative of J with respect to the i-th input argument, evaluated at theta.
% I.e., numgrad(i) should be the (approximately) the partial derivative of J with
% respect to theta(i).
%
% Hint: You will probably want to compute the elements of numgrad one at a time.
epsilon = 1e-4;
epsilon_inv = 1/epsilon;
for i = 1:size(theta,1)
theta_up = theta;
theta_up(i)= theta_up(i)+epsilon;
theta_dw = theta;
theta_dw(i)= theta_dw(i)-epsilon;
numgrad(i) = .5*(J(theta_up)-J(theta_dw))*epsilon_inv;
end
%% ---------------------------------------------------------------
end
BP算法的实现在sparseAutoencoderCost.m,这部分最为重要,只有完全理解了文档中的BP算法还有Sparse Autoencoder公式才能实现出来。这里最好是按照他的说明一步一步实现梯度计算(参考train.m中的说明),然后每步做完后对代码进行梯度检查,如果误差很小则进行下一步。实现中的难点是为保证效率需要使用向量和矩阵来进行运算,而实现稀疏编码需要统计这一批(N个)样本的稀疏率rho_hat。最简单的办法是写两个循环,也可把所有的中间变量都保存下来,然后在循环结束后再计算残差和梯度中涉及到的稀疏项。本文只保留了必要的变量sum_a2和da2,然后利用矩阵和向量的线性性质在循环结束后对残差和梯度进行修正。
function [cost,grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, ...
lambda, sparsityParam, beta, data)
% visibleSize: the number of input units (probably 64)
% hiddenSize: the number of hidden units (probably 25)
% lambda: weight decay parameter
% sparsityParam: The desired average activation for the hidden units (denoted in the lecture
% notes by the greek alphabet rho, which looks like a lower-case "p").
% beta: weight of sparsity penalty term
% data: Our 64x10000 matrix containing the training data. So, data(:,i) is the i-th training example.
% The input theta is a vector (because minFunc expects the parameters to be a vector).
% We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this
% follows the notation convention of the lecture notes.
W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);
W2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize, hiddenSize);
b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);
b2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end);
% Cost and gradient variables (your code needs to compute these values).
% Here, we initialize them to zeros.
cost = 0;
W1grad = zeros(size(W1));
W2grad = zeros(size(W2));
b1grad = zeros(size(b1));
b2grad = zeros(size(b2));
%% ---------- YOUR CODE HERE --------------------------------------
% Instructions: Compute the cost/optimization objective J_sparse(W,b) for the Sparse Autoencoder,
% and the corresponding gradients W1grad, W2grad, b1grad, b2grad.
%
% W1grad, W2grad, b1grad and b2grad should be computed using backpropagation.
% Note that W1grad has the same dimensions as W1, b1grad has the same dimensions
% as b1, etc. Your code should set W1grad to be the partial derivative of J_sparse(W,b) with
% respect to W1. I.e., W1grad(i,j) should be the partial derivative of J_sparse(W,b)
% with respect to the input parameter W1(i,j). Thus, W1grad should be equal to the term
% [(1/m) \Delta W^{(1)} + \lambda W^{(1)}] in the last block of pseudo-code in Section 2.2
% of the lecture notes (and similarly for W2grad, b1grad, b2grad).
%
% Stated differently, if we were using batch gradient descent to optimize the parameters,
% the gradient descent update to W1 would be W1 := W1 - alpha * W1grad, and similarly for W2, b1, b2.
%
N = size(data,2); %number of samples
N_inv = 1/N;
sum_a2 = zeros(size(b1));
da2 = zeros(size(b1,1),N); %store sum of f'(a2) for all samples, to add sparse terms
for j=1:N
a1=data(:,j); % get input vector,a1 = x
%forward propagation
z2 = W1*a1+b1;
a2 = sigmoid(z2);
sum_a2 = sum_a2+a2; %sum over a2 for calculating rho_hat
z3 = W2*a2+b2;
a3 = sigmoid(z3);
% calculate the cost
cost = cost+(a3-a1)'*(a3-a1)*.5; %cost of one sample
%back propagation
d3 = (a3-a1).*a3.*(1-a3); % the output is same as input so y=a1,f'=f(1-f)|x=a3
b2grad = b2grad+d3;
W2grad = W2grad+d3*a2';
% d3 propagation to d2
da2(:,j) = a2.*(1-a2); %save f'(a2) to matrix to be used after the cycle
d2 = (W2'*d3).*da2(:,j);
b1grad = b1grad+d2;
W1grad = W1grad+d2*a1';
end
rho_hat = sum_a2*N_inv;
KL = kl_diverge(sparsityParam,rho_hat);% KL divergence vector
W_all = [W1(:);W2(:)]; % for calculate sum of all W^2
cost = cost*N_inv +beta*sum(KL)+ lambda*(W_all'*W_all)*.5;
% calculate grads
sparsity_vec = beta*((1-sparsityParam)./(1-rho_hat)-sparsityParam./rho_hat);
d2_hat = sparsity_vec.*sum(da2,2);
b1grad = b1grad + d2_hat;
W1grad = W1grad+repmat(sparsity_vec,1,N).*da2*data';
b1grad = b1grad.*N_inv;
W1grad = W1grad.*N_inv+lambda*W1;
b2grad = b2grad.*N_inv;
W2grad = W2grad.*N_inv+lambda*W2;
%-------------------------------------------------------------------
% After computing the cost and gradient, we will convert the gradients back
% to a vector format (suitable for minFunc). Specifically, we will unroll
% your gradient matrices into a vector.
grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)];
end
%-------------------------------------------------------------------
% Here's an implementation of the sigmoid function, which you may find useful
% in your computation of the costs and the gradients. This inputs a (row or
% column) vector (say (z1, z2, z3)) and returns (f(z1), f(z2), f(z3)).
function sigm = sigmoid(x)
sigm = 1 ./ (1 + exp(-x));
end
%KL divergence calculateion
function kl_vec = kl_diverge(rho,rho_hat)
kl_vec = rho*log(rho./rho_hat)+(1-rho)*log((1-rho)./(1-rho_hat));% KL_divergence
end
最后再啰嗦一句,代码完成后运行前请屏蔽train.m中的第二部分(梯度检查),否则会运行的很慢。本文代码的完整实现请自己下载运行!