深度学习实践(二)——多层神经网络

#一、准备
为了更深入的理解神经网络,笔者基本采用纯C++的手写方式实现,其中矩阵方面的运算则调用opencv,数据集则来自公开数据集a1a。
实验环境:

本文紧跟上篇文章深度学习实践(一)——logistic regression
#二、神经网络基础
标准的神经网络结构如下图所示,其实就是上文logistic regression的增强版(即多加了几个隐层),基本思路还未变化。关于更详细的原理介绍,这里还是推荐吴恩达的深度学习系列课程
这里写图片描述

下面以三层神经网络(即上图)并结合a1a数据集,介绍构建的一般步骤:

  1. 初始化参数w1、w2、w3和b1、b2、b3,因为a1a数据集的维度是有123个特征,所以上图中input_layer维度为(123,m),m为样本数量,如训练集则为1065;而我们所构建的三层神经网络中间隐层神经元个数分别为(64,16,1),所以初始化参数矩阵w1(123,64)、w2(64,16)、w3(16,1)和偏置实数b1、b2、b3。
  2. 将W和X相乘(矩阵相乘,X为上层的输出,一开始即为样本的输入),再加上偏置b(为实数),则得到Z。
  3. 将Z进行激活,在隐层选择激活函数relu(可以更好的防止梯度爆炸,且结果很好),输出层选择sigmoid限制输出,它们的图像如下:这里写图片描述
  4. 将上面的正向传播完成后,定义损失函数,这里使用交叉熵代价函数。
  5. 反向传播,并更新参数。

正向传播基本公式:
这里上标L代表第几层,上标i表示第几个样本(对应到a1a数据集即第几行),如 A [ 0 ] A^{[0]} A[0]表示0层的输入(即样本输入)。

(1) Z [ 1 ] = W [ 1 ] A [ 0 ] + b [ 1 ] Z^{[1]} = W^{[1]}A^{[0]} +b^{[1]}\tag{1} Z[1]=W[1]A[0]+b[1](1)
(2) A [ 1 ] = R e l u ( Z [ 1 ] ) A^{[1]} = Relu(Z^{[1]})\tag{2} A[1]=Relu(Z[1])(2)
(3) Z [ 2 ] = W [ 2 ] A [ 1 ] + b [ 2 ] Z^{[2]} = W^{[2]}A^{[1]} +b^{[2]}\tag{3} Z[2]=W[2]A[1]+b[2](3)
(4) A [ 2 ] = R e l u ( Z [ 2 ] ) A^{[2]} = Relu(Z^{[2]})\tag{4} A[2]=Relu(Z[2])(4)
(5) Z [ 3 ] = W [ 3 ] A [ 2 ] + b [ 3 ] Z^{[3]} = W^{[3]}A^{[2]} +b^{[3]}\tag{5} Z[3]=W[3]A[2]+b[3](5)
(6) A [ 3 ] = S i g m o i d ( Z [ 3 ] ) A^{[3]} = Sigmoid(Z^{[3]})\tag{6} A[3]=Sigmoid(Z[3])(6)
(7) L ( A [ 3 ] , Y ^ ) = − A [ 3 ] log ⁡ ( A [ 3 ] ) − ( 1 − Y ^ ) log ⁡ ( 1 − A [ 3 ] ) \mathcal{L}(A^{[3]}, \hat Y) = - A^{[3]}\log(A^{[3]}) - (1-\hat Y ) \log(1-A^{[3]})\tag{7} L(A[3],Y^)=A[3]log(A[3])(1Y^)log(1A[3])(7)

The cost is then computed by summing over all training examples:
(8) J = 1 m ∑ i = 1 m L ( A ( i ) [ 3 ] , Y ( i ) ) J = \frac{1}{m} \sum_{i=1}^m \mathcal{L}(A^{(i)[3]}, Y^{(i)})\tag{8} J=m1i=1mL(A(i)[3],Y(i))(8)

反向传播基本公式:
(1) d A [ 3 ] = ∂ L ∂ A [ 3 ] = 1 − Y ^ 1 − A [ 3 ] − Y ^ A [ 3 ] dA^{[3]}= \frac{\partial \mathcal{L}}{\partial A^{[3]}}= \frac{1-\hat Y}{1-A^{[3]}}-\frac{\hat Y}{A^{[3]}}\tag{1} dA[3]=A[3]L=1A[3]1Y^A[3]Y^(1)
(2) d Z [ 3 ] = ∂ L ∂ A [ 3 ] ∗ ∂ A [ 3 ] ∂ Z [ 3 ] = d A [ 3 ] ∗ A [ 3 ] ∗ ( 1 − A [ 3 ] ) dZ^{[3]}=\frac{\partial \mathcal{L}}{\partial A^{[3]}}*\frac{\partial A^{[3]} }{\partial Z^{[3]}}=dA[3]*A^{[3]}*(1-A^{[3]})\tag{2} dZ[3]=A[3]LZ[3]A[3]=dA[3]A[3](1A[3])(2)
(3) d W [ 3 ] = ∂ L ∂ Z [ 3 ] ∗ ∂ Z [ 3 ] ∂ W [ 3 ] = 1 m d Z [ 3 ] A [ 2 ] T dW^{[3]} = \frac{\partial \mathcal{L} }{\partial Z^{[3]}} *\frac{\partial Z^{[3]} }{\partial{W^{[3]}}}= \frac{1}{m} dZ^{[3]} A^{[2] T} \tag{3} dW[3]=Z[3]LW[3]Z[3]=m1dZ[3]A[2]T(3)
(4) d b [ 3 ] = ∂ L ∂ Z [ 3 ] ∗ ∂ Z [ 3 ] ∂ b [ 3 ] = 1 m ∑ i = 1 m d Z [ 3 ] ( i ) db^{[3]} = \frac{\partial \mathcal{L} }{\partial Z^{[3]}} *\frac{\partial Z^{[3]} }{\partial{b^{[3]}}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[3](i)}\tag{4} db[3]=Z[3]Lb[3]Z[3]=m1i=1mdZ[3](i)(4)
(5) d A [ 2 ] = ∂ L ∂ Z [ 3 ] ∗ ∂ Z [ 3 ] ∂ A [ 2 ] = W [ 3 ] T d Z [ 3 ] dA^{[2]} = \frac{\partial \mathcal{L} }{\partial Z^{[3]}} *\frac{\partial Z^{[3]}}{\partial A^{[2]}} = W^{[3] T} dZ^{[3]} \tag{5} dA[2]=Z[3]LA[2]Z[3]=W[3]TdZ[3](5)
(6) d Z [ 2 ] = ∂ L ∂ A [ 2 ] ∗ ∂ A [ 2 ] ∂ Z [ 2 ] = d A [ 2 ] ∗ ( A [ 2 ] > 0 ) dZ^{[2]}=\frac{\partial \mathcal{L}}{\partial A^{[2]}}*\frac{\partial A^{[2]} }{\partial Z^{[2]}}=dA^{[2]}*(A^{[2]}>0)\tag{6} dZ[2]=A[2]LZ[2]A[2]=dA[2](A[2]>0)(6)
(7) d W [ 2 ] = ∂ L ∂ Z [ 2 ] ∗ ∂ Z [ 2 ] ∂ W [ 2 ] = 1 m d Z [ 2 ] A [ 1 ] T dW^{[2]} = \frac{\partial \mathcal{L} }{\partial Z^{[2]}} *\frac{\partial Z^{[2]} }{\partial{W^{[2]}}}= \frac{1}{m} dZ^{[2]} A^{[1] T} \tag{7} dW[2]=Z[2]LW[2]Z[2]=m1dZ[2]A[1]T(7)
(8) d b [ 2 ] = ∂ L ∂ Z [ 2 ] ∗ ∂ Z [ 2 ] ∂ b [ 2 ] = 1 m ∑ i = 1 m d Z [ 2 ] ( i ) db^{[2]} = \frac{\partial \mathcal{L} }{\partial Z^{[2]}} *\frac{\partial Z^{[2]} }{\partial{b^{[2]}}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[2](i)}\tag{8} db[2]=Z[2]Lb[2]Z[2]=m1i=1mdZ[2](i)(8)
(9) d A [ 1 ] = ∂ L ∂ Z [ 2 ] ∗ ∂ Z [ 2 ] ∂ A [ 1 ] = W [ 2 ] T d Z [ 2 ] dA^{[1]} = \frac{\partial \mathcal{L} }{\partial Z^{[2]}} *\frac{\partial Z^{[2]}}{\partial A^{[1]}} = W^{[2] T} dZ^{[2]} \tag{9} dA[1]=Z[2]LA[1]Z[2]=W[2]TdZ[2](9)
(10) d Z [ 1 ] = ∂ L ∂ A [ 1 ] ∗ ∂ A [ 1 ] ∂ Z [ 1 ] = d A [ 1 ] ∗ ( A [ 1 ] > 0 ) dZ^{[1]}=\frac{\partial \mathcal{L}}{\partial A^{[1]}}*\frac{\partial A^{[1]} }{\partial Z^{[1]}}=dA^{[1]}*(A^{[1]}>0)\tag{10} dZ[1]=A[1]LZ[1]A[1]=dA[1](A[1]>0)(10)
(11) d W [ 1 ] = ∂ L ∂ Z [ 1 ] ∗ ∂ Z [ 1 ] ∂ W [ 1 ] = 1 m d Z [ 1 ] A [ 0 ] T dW^{[1]} = \frac{\partial \mathcal{L} }{\partial Z^{[1]}} *\frac{\partial Z^{[1]} }{\partial{W^{[1]}}}= \frac{1}{m} dZ^{[1]} A^{[0] T} \tag{11} dW[1]=Z[1]LW[1]Z[1]=m1dZ[1]A[0]T(11)
(12) d b [ 1 ] = ∂ L ∂ Z [ 1 ] ∗ ∂ Z [ 1 ] ∂ b [ 1 ] = 1 m ∑ i = 1 m d Z [ 1 ] ( i ) db^{[1]} = \frac{\partial \mathcal{L} }{\partial Z^{[1]}} *\frac{\partial Z^{[1]} }{\partial{b^{[1]}}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[1](i)}\tag{12} db[1]=Z[1]Lb[1]Z[1]=m1i=1mdZ[1](i)(12)

#三、实践
数据集介绍、处理及一些公用的函数已在系列的上一篇文章,故在此不做赘述(只写出函数声明)。

从文件中创建矩阵:

void creatMat(Mat &x, Mat &y, String fileName);

初始化参数(这里使用xavier初始化):

void initial_parermaters(Mat &w, double &b, int n1, int n0) {
	w = Mat::zeros(n1, n0, CV_64FC1);
	b = 0.0;

	//double temp = 2 / (sqrt(n1));
	double temp = sqrt(6.0 / (double)(n1 + n0));
	RNG rng;
	for (int i = 0; i < w.rows; i++) {
		for (int j = 0; j < w.cols; j++) {
			w.at<double>(i, j) = rng.uniform(-temp, temp);//xavier初始化
														  //w.at<double>(i, j) = 0;
		}
	}
}

relu函数的编写:

void relu(const Mat &original, Mat &response) {
	response = original.clone();//防止维度不同
	for (int i = 0; i < original.rows; i++) {
		for (int j = 0; j < original.cols; j++) {
			if (original.at<double>(i, j) < 0) {
				response.at<double>(i, j) = 0.0;
			}
		}
	}
}

正向传播:

void linear_activation_forward(Mat &a_prev, Mat &a, Mat &w, double &b, string activation) {
	cv::Mat z;
	if (activation == "sigmoid") {
		z = (w*a_prev) + b;
		//cout << w.rows<<","<<w.cols<<" " << a_prev.rows<<","<<a_prev.cols<<endl;

		sigmoid(z, a);
	}
	else if (activation == "relu") {
		z = (w*a_prev) + b;
		//cout << w.rows << "," << w.cols << " " << a_prev.rows << "," << a_prev.cols << endl;

		relu(z, a);
	}
}

反向传播:

void activation_backward(const Mat &a, const Mat &da, Mat &dz, string activation) {
	if (activation == "sigmoid") {

		dz = da.mul(a.mul(1 - a));
	}
	else if (activation == "relu") {
		dz = da.clone();//保证维度相同
		for (int i = 0; i < a.rows; i++) {
			for (int j = 0; j < a.cols; j++) {
				if (a.at<double>(i, j) <= 0) {
					dz.at<double>(i, j) = 0.0;
				}
			}
		}
	}

}

void linear_backward(const Mat &da, const Mat &a, const Mat &a_prev, Mat &w, double &b, Mat &dw, double &db, Mat &da_prev, const int m, const double learning_rate, string activation) {
	cv::Mat dz;
	activation_backward(a, da, dz, activation);//激活函数的反向传播

	dw = (1.0 / m)*dz*a_prev.t();
	db = (1.0 / m)*sum(dz)[0];
	da_prev = w.t()*dz;
	w = w - (learning_rate * dw);
	b = b - (learning_rate * db);
}

#四、实验结果分析
迭代8000次cost分析:
这里写图片描述
我们容易发现更高的学习率可以获得较低的cost值,但是当其迭代到一定次数时,会有一定的起伏。

迭代8000次,准确率分析:
这里写图片描述
通过上图,易发现在一定迭代次数后,训练集和测试集准确率都会产生起伏,而且当训练集准确率不断上升时,测试集却未增长反而下降,最终产生了过拟合现象。

#五、结语

神经网络层数深,参数多,所以很难训练,一般训练8000次所需时间很长,下面的一篇文章主要讲一些优化方法(如adam),及如何处理过拟合(如dropout)等。

实验地址:码云

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值