这两三个月通过吴恩达老师的课程学习了深度学习,从零开始学理论,做编程任务。感觉学了很多知识。现在学到卷积神经网络,想把第一周的编程任务、其中的要点上传和编写,方便自己以后巩固。(注:吴恩达老师课程的编程任务是用Python来完成的,而我是用matlab从零开始编写。这样学的东西会多很多。)此编程为参考课程的Python代码编写的。
第一周的任务:
(这次的任务和第一课还是第二课的其中一个编程任务是一样的,区别在于这次使用的是卷积神经网络)
相关的数据和原Python代码:链接:https://github.com/stormstone/deeplearning.ai
(注:Python代码里参数的误差没有除以图片的数量;即dW/m_train,db/m_train。要除以m_train)
这一周的任务是通过图片来识别手势0~5(如上图),编程的流程是:数据的处理--->初始化参数--->前向传播--->计算损失值
---->反向传播----->使用Adm更新参数。其中用梯度检验检查反向传播。
这次的编程的困难是程序运行的时间长,要把原理弄透,尤其是梯度下降。
训练后的损失值和迭代次数图如下:
(横坐标为迭代次数 itra;纵坐标为损失值 cost)
train accuracy=98.98%;
test accuracy=89.17%; (应该过拟合了,因为没正则化)
对比:
梯度下降的结果:
difference =4.7906e-09
因为程序有多个多重循环,所以程序运行很慢,最好把循环程序部分转为mex文件。没转mex文件,程序可能要跑十多个小时,转mex文件跑了三四个小时。还有在程序运行前一定要用梯度检验检查反向传播是否出错。
1. 数据处理:
data_X_train=hdf5read('train_signs.h5','/train_set_x'); %读取训练样本图片
data_Y_train=hdf5read('train_signs.h5','/train_set_y'); %读取训练样本标签
data_X_test=hdf5read('test_signs.h5','/test_set_x'); %读取测试样本图片
data_Y_test=hdf5read('test_signs.h5','/test_set_y'); %读取测试样本标签
train_X_orig=permute(data_X_train,[3,2,1,4]); %训练图片转正
test_X_orig=permute(data_X_test,[3,2,1,4]); %测试图片转正
m=size(train_X_orig,4);
train_X=double(train_X_orig)/255; %训练图片归一化
test_X=double(test_X_orig)/255; %测试图片归一化
train_Y=[];
test_Y=[];
Class=6; %分类0~5六类
for i=1:m_train
a=zeros(Class,1);
a(data_Y_train(i,1)+1,1)=1;
train_Y=[train_Y,a];
end %把训练标签的0,1,2,3,4,5用类似[0;0;0;0;1;0]表示
for i=1:m_test
a=zeros(Class,1);
a(data_Y_test(i,1)+1,1)=1;
test_Y=[test_Y,a];
end %把测试标签的0,1,2,3,4,5用类似[0;0;0;0;1;0]表示
train_Y=double(train_Y);
2. 初始化参数:
(1)参数数初始化:
function
[W1,b1,W2,b2,W3,b3,n_H_Pool2,n_W_Pool2,num_filter2]=CNN_Initial_Parameters(X,filter1_row,filter1_col,num_filter1,filter2_row,filter2_col,num_filter2,pool1_f,pool1_s,pool2_f,pool2_s)
%上面为函数名,输入与输出。
% 输入:
% X表示训练样本
% filter1_row,filter1_col,num_filter1为第一个卷积核的行数、列数和核的个数;
% filter2_row,filter2_col,num_filter2为第二个卷积核的行数、列数和核的个数;
% pool1_f,pool1_s,pool2_f,pool2_s为第一、第二池化层核的大小和步长;
% 输出:
% n_H_Pool2,n_W_Pool2为第二个池化层池化后的大小;
randn('seed',5);
[n_H0,n_W0,num_input_cha,~]=size(X); %读取训练样本X的大小,num_input_cha为通道数。
W1=0.1*randn(filter1_row,filter1_col,num_input_cha,num_filter1); %第一个卷积核
b1=zeros(1,1,1,num_filter1); %第一层偏差
n_H_Pool1=(n_H0-pool1_f)/pool1_s+1; %计算第一个池化层池化后的行数。
n_W_Pool1=(n_W0-pool1_f)/pool1_s+1; %计算第一个池化层池化后的列数。
W2=0.1*randn(filter2_row,filter2_col,num_filter1,num_filter2); %第一个卷积核
b2=zeros(1,1,1,num_filter2); %第二层偏差
n_H_Pool2=(n_H_Pool1-pool2_f)/pool2_s+1; %计算第二个池化层池化后的行数。
n_W_Pool2=(n_W_Pool1-pool2_f)/pool2_s+1; %计算第二个池化层池化后的列数。
num_FC1_unites=n_H_Pool2*n_W_Pool2*num_filter2; %计算第一层全连接层神经元个数。
W3=sqrt(1/num_FC1_unites)*randn(6,num_FC1_unites); %全连接层与输出的参数
b3=zeros(6,1);
end
(2) Momentum梯度下降的初始化:
function [V_dW1,V_db1,V_dW2,V_db2,V_dW3,V_db3]=Momentum_Initialization(X,filter1_row,filter1_col,num_filter1,filter2_row,filter2_col,num_filter2,pool1_f,pool1_s,pool2_f,pool2_s)
% Momentum梯度下降的初始化。
[n_H0,n_W0,num_input_cha,m]=size(X);
V_dW1=zeros(filter1_row,filter1_col,num_input_cha,num_filter1);
V_db1=zeros(1,1,1,num_filter1);
n_H_Pool1=(n_H0-pool1_f)/pool1_s+1;
n_W_Pool1=(n_W0-pool1_f)/pool1_s+1;
V_dW2=zeros(filter2_row,filter2_col,num_filter1,num_filter2);
V_db2=zeros(1,1,1,num_filter2);
n_H_Pool2=(n_H_Pool1-pool2_f)/pool2_s+1;
n_W_Pool2=(n_W_Pool1-pool2_f)/pool2_s+1;
num_FC1_unites=n_H_Pool2*n_W_Pool2*num_filter2;
V_dW3=zeros(6,num_FC1_unites);
V_db3=zeros(6,1);
end
(3)RMSprop初始化
function [S_dW1,S_db1,S_dW2,S_db2,S_dW3,S_db3]=CNN_RMSprop_Initialization(X,filter1_row,filter1_col,num_filter1,filter2_row,filter2_col,num_filter2,pool1_f,pool1_s,pool2_f,pool2_s)
% RMSprop初始化
[n_H0,n_W0,num_input_cha,m]=size(X);
S_dW1=zeros(filter1_row,filter1_col,num_input_cha,num_filter1);
S_db1=zeros(1,1,1,num_filter1);
n_H_Pool1=(n_H0-pool1_f)/pool1_s+1;
n_W_Pool1=(n_W0-pool1_f)/pool1_s+1;
S_dW2=zeros(filter2_row,filter2_col,num_filter1,num_filter2);
S_db2=zeros(1,1,1,num_filter2);
n_H_Pool2=(n_H_Pool1-pool2_f)/pool2_s+1;
n_W_Pool2=(n_W_Pool1-pool2_f)/pool2_s+1;
num_FC1_unites=n_H_Pool2*n_W_Pool2*num_filter2;
S_dW3=zeros(6,num_FC1_unites);
S_db3=zeros(6,1);
end
3. 前向传播:
function [Z1,A1,P1,Z2,A2,P2,Z3,A3,Z4,A4]=CNN_foward(X,W1,W2,W3,b1,b2,b3,pool1_f,pool1_s,pool2_f,pool2_s)
% 输入:在初始化参数有介绍。
% 输出:
% Z1:第一次卷积后的输出; A1:卷积后再激活的输出; P1:池化后的输出;
% Z2:第二次卷积后的输出; A2:卷积后再激活的输出; P2:池化后的输出;
% Z3:最后一层池化后转全连接。 A3:A3=Z3; Z4:为第一层到输出层的线性输出。A4为激活后输出
m_train=size(X,4); %读取训练样本数量。
Z1=conv_forward(X,W1,b1,1); %第一层卷积层
A1=active_function(3,Z1); %第一层激活,选用Relu函数激活
P1=Pool_mex(A1,pool1_f,pool1_s,'max'); %第一层池化层
Z2=conv_forward(P1,W2,b2,1); %第二层卷积层
A2=active_function(3,Z2); %第二层激活
P2=Pool_mex(A2,pool2_f,pool2_s,'max'); %第二层池化层
Z3=reshape(P2,[],m_train); %全连接层
A3=Z3;
Z4=W3*A3+repmat(b3,[1,m_train]);
A4=active_function(4,Z4); %输出层。使用softmax函数分类
end
(1)前向传播卷积:
function Z=conv_forward(A_prev,W,b,s)
% 卷积(非真正卷积)函数名和输入输出。
% 输入: A_prev为输入图片;W:卷积核; b为偏差;s为卷积步长;
% 输出: Z为输出图片,与输入图片的大小一样大。
[n_H_prev,n_W_prev,n_C_prev,m]=size(A_prev); %读取输入数据的大小。
[~,f,~,n_C]=size(W); %读取卷积核的大小。
n_H=ceil(n_H_prev/s);
n_W=ceil(n_W_prev/s); %计算输出的大小。
Z=zeros(n_H,n_W,n_C,m);
pad=ceil((f-1)/2); % 计算padding大小
A_prev_pad=padarray(A_prev,[pad,pad]); %padding输入图片
for i=1:m
a_prev_pad=A_prev_pad(:,:,:,i); %顺序提取一个图片数据。
for h=1:n_H
for w=1:n_W
vert_start=(h-1)*s+1;
vert_end=vert_start+f-1;
horiz_start=(w-1)*s+1;
horiz_end=horiz_start+f-1;
a_slice_prev=a_prev_pad(vert_start:vert_end,horiz_start:horiz_end,:);
for c=1:n_C
Z(h,w,c,i)=conv_single_step(a_slice_prev,W(:,:,:,c),b(1,1,1,c));%实现卷积
end
end
end
end
end
(2)conv_single_step函数:
unction Z=conv_single_step(a_slice_prev,W,b)
s=a_slice_prev.*W;
Z=sum(sum(sum(s)));
Z=Z+b;
end
(3)池化:
function A=Pool(A_prev,f,s,mode)
% 池化层函数名和输入输出
% 输入:A_pre为输入图片数据;f,s为池化的大小和步长。
% 输出:A为输出图片数据。
[n_h_prev,n_w_prev,n_c_prev,m]=size(A_prev); %读取输入图片数据大小。
n_h=floor((n_h_prev-f)/s)+1;
n_w=floor((n_w_prev-f)/s)+1;
n_c=n_c_prev; %计算输出图片数据大小。
A_nopad=zeros(n_h,n_w,n_c,m);
for i=1:m
for h=1:n_h
for w=1:n_w
vert_start=(h-1)*s+1;
vert_end=vert_start+f-1;
horiz_start=(w-1)*s+1;
horiz_end=horiz_start+f-1;
for c=1:n_c
a_prev_slice=A_prev(vert_start:vert_end,horiz_start:horiz_end,c,i);
switch mode
case 'max'
A_nopad(h,w,c,i)=max(max(a_prev_slice)); % Max池化
case 'mean'
A_nopad(h,w,c,i)=mean(a_prev_slice(:)); % Average池化
end
end
end
end
end
A=A_nopad;
end
4. 计算损失值:
cost_function=sum(sum(-(log(A4).*train_Y_mini_batch)))/m_train;
% A4为输出
% train_Y_mini_batch为mini_batch的训练标签。
5. 反向传播:
function [dW1,db1,dW2,db2,dW3,db3,dZ1,dA1,dP1,dZ2,dA2,dP2,dZ3,dZ4]=CNN_backward(A4,A3,Y,W3,Z2,P1,W2,Z1,X,A2,A1,W1,pool1_f,pool1_s,pool2_f,pool2_s,n_H_Pool2,n_W_Pool2,num_filter2)
% 输入:前面代码有介绍。
% 输出:
% dW1,db1,dW2,db2,dW3,db3为参数的误差。
m_train=size(Y,2); %读取训练样本的数量。
dZ4=A4-Y;
dW3=dZ4*A3'/m_train;
db3=sum(dZ4,2)/m_train;
dZ3=W3'*dZ4; %与普通神经网络的梯度下降一样。
dP2=reshape(dZ3,n_H_Pool2,n_W_Pool2,num_filter2,m_train); %全连接层的误差转第二个池化层输出
% 误差
dA2=Pool_backward_mex(dP2,'max',A2,pool2_s,pool2_f); %第二个池化层的输入误差。
dZ2=dA2.*active_derivation(3,Z2); %第二个卷积层输出误差。
[dW2,db2,dP1]=conv_backward_mex(P1,W2,dZ2,1); %第二层参数误差和第二层卷积层输入误差
dA1=Pool_backward_mex(dP1,'max',A1,pool1_s,pool1_f); %第一个池化层的输入误差。
dZ1=dA1.*active_derivation(3,Z1); %第一个卷积层输出误差。
[dW1,db1,~]=conv_backward_mex(X,W1,dZ1,1); %第一层参数误差,~代表X误差。但没用就
% 屏蔽
end
(1)池化层反向传播:
function dA_prev=Pool_backward(dA,mode,A_prev,s,f)
% 池化层反向传播函数名和输入输出。
% 输入:dA池化层输出的误差;mode为选择Max池化、average池化;A_prev为池化层输入图片数据;
% s,f为池化层的池化大小和步长。
[n_H_prev,n_W_prev,n_C_prev,m_train]=size(A_prev); %池化层输入出图片数据的大小。
[n_H,n_W,n_C,m_train]=size(dA); %池化层出的大小。
dA_prev=zeros(n_H_prev,n_W_prev,n_C_prev,m_train);
for i=1:m_train
a_prev=A_prev(:,:,:,i);
for h=1:n_H
for w=1:n_W
vert_start=(h-1)*s+1;
vert_end=vert_start+f-1;
horiz_start=(w-1)*s+1;
horiz_end=horiz_start+f-1;
for c=1:n_C
switch mode
case 'max'
a_slice_prev=a_prev(vert_start:vert_end,horiz_start:horiz_end,c);
mask=create_mask_from_window(a_slice_prev); %标记最大值的位置。 dA_prev(vert_start:vert_end,horiz_start:horiz_end,c,i)=dA_prev(vert_start:vert_end,horiz_start:horiz_end,c,i)+ mask*dA(h,w,c,i);
case 'mean'
da=dA(h,w,c,i); dA_prev(vert_start:vert_end,horiz_start:horiz_end,c,i)=dA_prev(vert_start:vert_end,horiz_start:horiz_end,c,i)+distribute_value(da,f,f);
end
end
end
end
end
end
(2)create_mask_from_window标记最大值位置函数:
function mask=create_mask_from_window(x)
mask=zeros(size(x));
mask(x==max(max(x)))=1;
end
(3) 卷积层反向传播:
function [dW,db,dA_prev]=conv_backward(A_prev,W,dZ,s)
% 卷积层反向传播函数。
% 输入:A_prev为卷积层的输入图片数据;W为卷积核;s为步长;dZ为卷积层输出图片数据的误差
% 输出:dW,db为参数误差;dA_prev卷积层输入图片数据的误差。
[n_H_prev,n_W_prev,n_C_prev,m_train]=size(A_prev); %读取卷积层输入图片数据的大小。
[f,f,n_C_prev,n_C]=size(W); %卷积核的大小。
[n_H,n_W,n_C,m_train]=size(dZ); %卷积层输出的大小。
pad=ceil(((n_H-1)*s+f-n_H_prev)/2); %计算padding。
dA_prev_pad=zeros(n_H_prev+2*pad,n_W_prev+2*pad,n_C_prev,m_train);
A_prev_pad=zeros(n_H_prev+2*pad,n_W_prev+2*pad,n_C_prev,m_train);
A_prev_pad=padarray(A_prev,[pad,pad]); %padding。
dW=zeros(f,f,n_C_prev,n_C);
db=zeros(1,1,1,n_C);
for i=1:m_train
a_prev_pad=A_prev_pad(:,:,:,i);
da_prev_pad=dA_prev_pad(:,:,:,i);
for h=1:n_H
for w=1:n_W
vert_start=h;
vert_end=vert_start+f-1;
horiz_start=w;
horiz_end=horiz_start+f-1;
a_slice= a_prev_pad(vert_start:vert_end,horiz_start:horiz_end,:);
for c=1:n_C da_prev_pad(vert_start:vert_end,horiz_start:horiz_end,:)=da_prev_pad(vert_start:vert_end,horiz_start:horiz_end,:)+W(:,:,:,c)*dZ(h,w,c,i);
dW(:,:,:,c)=dW(:,:,:,c)+a_slice*dZ(h,w,c,i);
db(:,:,:,c)=db(:,:,:,c)+dZ(h,w,c,i);
end
end
end
dA_prev(:,:,:,i)=da_prev_pad(pad+1:end-pad,pad+1:end-pad,:);
end
end
dW=dW/m_train;
db=db/m_train;
end
6. 使用Adm更新参数:
function [V_dW1,V_db1,S_dW1,S_db1,W1,b1,V_dW2,V_db2,S_dW2,S_db2,W2,b2,V_dW3,V_db3,S_dW3,S_db3,W3,b3]=CNN_Adm_Optimization(V_dW1,V_db1,S_dW1,S_db1,W1,b1,dW1,db1,V_dW2,V_db2,S_dW2,S_db2,W2,b2,dW2,db2,V_dW3,V_db3,S_dW3,S_db3,W3,b3,dW3,db3,learning_rate,beta1,beta2,i,sigma)
% 以上为Adm Optimization的函数名和输入输出。
% 了解Adm Optimization的实现很简单。
V_dW1=beta1*V_dW1+(1-beta1)*dW1;
V_db1=beta1*V_db1+(1-beta1)*db1;
S_dW1=beta2*S_dW1+(1-beta2)*(dW1.^2);
S_db1=beta2*S_db1+(1-beta2)*(db1.^2);
V_dW1_corrected=V_dW1/(1-beta1^i);
V_db1_corrected=V_db1/(1-beta1^i);
S_dW1_corrected=S_dW1/(1-beta2^i);
S_db1_corrected=S_db1/(1-beta2^i);
W1=W1-learning_rate*(V_dW1_corrected./(sqrt(S_dW1_corrected)+sigma));
b1=b1-learning_rate*(V_db1_corrected./(sqrt(S_db1_corrected)+sigma));
V_dW2=beta1*V_dW2+(1-beta1)*dW2;
V_db2=beta1*V_db2+(1-beta1)*db2;
S_dW2=beta2*S_dW2+(1-beta2)*(dW2.^2);
S_db2=beta2*S_db2+(1-beta2)*(db2.^2);
V_dW2_corrected=V_dW2/(1-beta1^i);
V_db2_corrected=V_db2/(1-beta1^i);
S_dW2_corrected=S_dW2/(1-beta2^i);
S_db2_corrected=S_db2/(1-beta2^i);
W2=W2-learning_rate*(V_dW2_corrected./(sqrt(S_dW2_corrected)+sigma));
b2=b2-learning_rate*(V_db2_corrected./(sqrt(S_db2_corrected)+sigma));
V_dW3=beta1*V_dW3+(1-beta1)*dW3;
V_db3=beta1*V_db3+(1-beta1)*db3;
S_dW3=beta2*S_dW3+(1-beta2)*(dW3.^2);
S_db3=beta2*S_db3+(1-beta2)*(db3.^2);
V_dW3_corrected=V_dW3/(1-beta1^i);
V_db3_corrected=V_db3/(1-beta1^i);
S_dW3_corrected=S_dW3/(1-beta2^i);
S_db3_corrected=S_db3/(1-beta2^i);
W3=W3-learning_rate*(V_dW3_corrected./(sqrt(S_dW3_corrected)+sigma));
b3=b3-learning_rate*(V_db3_corrected./(sqrt(S_db3_corrected)+sigma));
end
7. 梯度下降:
X=randn(64,64,3,3);
m_train=size(X,4);
Y(:,1)=[0;0;0;0;0;1];
Y(:,2)=[1;0;0;0;0;0];
Y(:,3)=[0;1;0;0;0;0];
[W1,b1,W2,b2,W3,b3,n_H_Pool2,n_W_Pool2,num_filter2]=CNN_Initial_Parameters(X,filter1_row,filter1_col,num_filter1,filter2_row,filter2_col,num_filter2,pool1_f,pool1_s,pool2_f,pool2_s);% 参数初始化
[Z1,A1,P1,Z2,A2,P2,Z3,A3,Z4,A4]=CNN_foward(X,W1,W2,W3,b1,b2,b3,pool1_f,pool1_s,pool2_f,pool2_s); % 前向传播
[dW1,db1,dW2,db2,dW3,db3,dZ1,dA1,dP1,dZ2,dA2,dP2,dZ3,dZ4]=CNN_backward(A4,A3,Y,W3,Z2,P1,W2,Z1,X,A2,A1,W1,pool1_f,pool1_s,pool2_f,pool2_s,n_H_Pool2,n_W_Pool2,num_filter2);% 反向传播
grad=reshape(dW1,1,[]);
grad=[grad,reshape(db1,1,[])];
grad=[grad,reshape(dW2,1,[])];
grad=[grad,reshape(db2,1,[])];
grad=[grad,reshape(dW3,1,[])];
grad=[grad,reshape(db3,1,[])]; % 把所有参数的误差转为向量。
A=reshape(W1,1,[]);
A=[A,reshape(b1,1,[])];
A=[A,reshape(W2,1,[])];
A=[A,reshape(b2,1,[])];
A=[A,reshape(W3,1,[])];
A=[A,reshape(b3,1,[])]; %把所有参数转为向量。
num_parameters=size(A,2) % 所有参数的数量。
cost_function_plus=zeros(1,num_parameters);
cost_function_minus=zeros(1,num_parameters);
gradapprox=zeros(1,num_parameters);
epsilon=1e-7;
lambd=3e-3;
for i=1:num_parameters
thetaplus=A;
thetaplus(1,i)=thetaplus(1,i)+epsilon;
W1_plus=reshape(thetaplus(1,1:length(W1(:))),size(W1));
b1_plus=reshape(thetaplus(1,length(W1(:))+1:length(W1(:))+length(b1(:))),size(b1));
W2_plus=reshape(thetaplus(1,length(W1(:))+length(b1(:))+1:length(W1(:))+length(b1(:))+length(W2(:))),size(W2));
b2_plus=reshape(thetaplus(1,length(W1(:))+length(b1(:))+length(W2(:))+1:length(W1(:))+length(b1(:))+length(W2(:))+length(b2(:))),size(b2));
W3_plus=reshape(thetaplus(1,length(W1(:))+length(b1(:))+length(W2(:))+length(b2(:))+1:length(W1(:))+length(b1(:))+length(W2(:))+length(b2(:))+length(W3(:))),size(W3));
b3_plus=reshape(thetaplus(1,length(W1(:))+length(b1(:))+length(W2(:))+length(b2(:))+length(W3(:))+1:length(W1(:))+length(b1(:))+length(W2(:))+length(b2(:))+length(W3(:))+length(b3(:))),size(b3)); %以上为把向量参数转为先前的大小。
[Z1_plus,A1_plus,P1_plus,Z2_plus,A2_plus,P2_plus,Z3_plus,A3_plus,Z4_plus,A4_plus]=CNN_foward(X,W1_plus,W2_plus,W3_plus,b1_plus,b2_plus,b3_plus,pool1_f,pool1_s,pool2_f,pool2_s);
cost_function_plus(1,i)=sum(sum(-(log(A4_plus).*Y)))/m_train;% 计算 J(θ+epsilon)
thetaminus=A;
thetaminus(1,i)=thetaminus(1,i)-epsilon;
W1_minus=reshape(thetaminus(1,1:length(W1(:))),size(W1));
b1_minus=reshape(thetaminus(1,length(W1(:))+1:length(W1(:))+length(b1(:))),size(b1));
W2_minus=reshape(thetaminus(1,length(W1(:))+length(b1(:))+1:length(W1(:))+length(b1(:))+length(W2(:))),size(W2));
b2_minus=reshape(thetaminus(1,length(W1(:))+length(b1(:))+length(W2(:))+1:length(W1(:))+length(b1(:))+length(W2(:))+length(b2(:))),size(b2));
W3_minus=reshape(thetaminus(1,length(W1(:))+length(b1(:))+length(W2(:))+length(b2(:))+1:length(W1(:))+length(b1(:))+length(W2(:))+length(b2(:))+length(W3(:))),size(W3));
b3_minus=reshape(thetaminus(1,length(W1(:))+length(b1(:))+length(W2(:))+length(b2(:))+length(W3(:))+1:length(W1(:))+length(b1(:))+length(W2(:))+length(b2(:))+length(W3(:))+length(b3(:))),size(b3));
[Z1_minus,A1_minus,P1_minus,Z2_minus,A2_minus,P2_minus,Z3_minus,A3_minus,Z4_minus,A4_minus]=CNN_foward(X,W1_minus,W2_minus,W3_minus,b1_minus,b2_minus,b3_minus,pool1_f,pool1_s,pool2_f,pool2_s);
cost_function_minus(1,i)=sum(sum(-(log(A4_minus).*Y)))/m_train;%计算J(θ-epsilon)
gradapprox(1,i)=(cost_function_plus(1,i)-cost_function_minus(1,i))/(2*epsilon);
end
numerator=norm(gradapprox-grad);
denominator=norm(grad)+norm(gradapprox);
difference=numerator/denominator;