numpy实现一个简单的神经网络(python3)

感知机(Perceptron)

一种简单的感知机结构如下图所示,由三个输入节点和一个输出节点构成,三个输入节点x1,x2,x3分别代表一个输入样本x的三个特征值;w1,w2,w3分别代表三个特征值对应的权重;b为偏置项;输出节点中的z和o分别代表线性变换后的输出值和非线性变换后的输出值。

(1) { z = x 1 ∗ w 1 + x 2 ∗ w 3 + x 3 ∗ w 3 + b o = f ( z ) \begin{cases}z = x_1*w_1+x_2*w_3+x_3*w_3+b\\o=f(z)\end{cases} \tag{1} {z=x1w1+x2w3+x3w3+bo=f(z)(1)

其中映射函数 f f f为激活函数,下面列几个常见的激活函数:

函数名函数表达式导数
s i g m o i d sigmoid sigmoid f ( z ) = 1 1 + e − z f(z)=\dfrac{1}{1+e^{-z}} f(z)=1+ez1 f ( z ) [ 1 − f ( z ) ] f(z)[1-f(z)] f(z)[1f(z)]
t a n h tanh tanh f ( z ) = e z − e − z e z + e − z f(z)=\dfrac{e^z-e^{-z}}{e^z+e^{-z}} f(z)=ez+ezezez 1 − f ( z ) 2 1-f(z)^2 1f(z)2
s o f t m a x softmax softmax f ( z ) = e z i ∑ j = 0 n e z j f(z)=\dfrac{e^{z_i}}{\sum_{j=0}^n e^{z_j}} f(z)=j=0nezjezi经常用其构成的
损失函数的导数:
f ( z i ) − t ( i )   f(z_i)-t(i)~ f(zi)t(i)  1

神经网络(Neural Network)

神经网络基本结构

神经网络与感知机类似,但是它的节点更加复杂,下图是一个含有1层隐藏层的神经网络,也是一种最简单的神经网络,我们可以看到这个神经网络的输入层有2个节点,隐藏层有3个节点,输出层有1个节点。我们可以认为神经网络由多个感知机构成。我们以下图所示结构为例,实现一个可以进行数据分类的神经网络。

假设我们有N个样本,对于每一个样本来说,都有两个特征值,对于这样的每一个样本 x ( x 1 , x 2 ) \textbf{\textit{x}}(x_1,x_2) x(x1,x2)都满足公式2,公式中带小括号的上标代表神经网络的层数, w i j w_{ij} wij为相邻两层两个节点之间的权重系数,其中的 i i i代表前一层的第 i i i个节点, j j j代表后一层的第 j j j个节点。

(2) { z 1 ( 1 ) = x 1 ∗ w 11 ( 1 ) + x 2 ∗ w 21 ( 1 ) + b 1 ( 1 ) ,   h 1 = f ( z 1 ( 1 ) ) z 2 ( 1 ) = x 1 ∗ w 22 ( 1 ) + x 2 ∗ w 22 ( 1 ) + b 2 ( 1 ) ,   h 2 = f ( z 2 ( 1 ) ) z 3 ( 1 ) = x 1 ∗ w 13 ( 1 ) + x 2 ∗ w 23 ( 1 ) + b 3 ( 1 ) ,   h 3 = f ( z 3 1 ) z ( 2 ) = h 1 ∗ w 1 ( 2 ) + h 2 ∗ w 2 ( 2 ) + h 3 ∗ w 3 ( 2 ) + b ( 2 ) ,   o = f ( z ( 2 ) ) \begin{cases}z^{(1)}_{1} = x_1*w^{(1)}_{11}+x_2*w^{(1)}_{21}+b^{(1)}_1,~h_1=f(z^{(1)}_{1})\\z^{(1)}_2 = x_1*w^{(1)}_{22}+x_2*w^{(1)}_{22}+b^{(1)}_2,~h_2=f(z^{(1)}_2)\\z^{(1)}_3 = x_1*w^{(1)}_{13}+x_2*w^{(1)}_{23}+b^{(1)}_3 ,~h_3=f(z^1_3)\\z^{(2) }= h_1*w^{(2) }_{1}+ h_2*w^{(2) }_{2}+ h_3*w^{(2) }_{3}+b^{(2)},~o=f(z^{(2)})\end{cases}\tag{2} z1(1)=x1w11(1)+x2w21(1)+b1(1), h1=f(z1(1))z2(1)=x1w22(1)+x2w22(1)+b2(1), h2=f(z2(1))z3(1)=x1w13(1)+x2w23(1)+b3(1), h3=f(z31)z(2)=h1w1(2)+h2w2(2)+h3w3(2)+b(2), o=f(z(2))(2)
我们可以用矩阵形式改写公式2:
(3) { Z 1 = X ⋅ W 1 + B 1 H = f ( Z 1 ) Z 2 = H ⋅ W 2 + B 2 Y ^ = f ( Z 2 ) \begin{cases}Z_1=X\cdot W_1+B_1\\ H=f(Z_1)\\ Z_2=H\cdot W_2+B_2\\ \hat Y=f(Z_2)\end{cases}\tag{3} Z1=XW1+B1H=f(Z1)Z2=HW2+B2Y^=f(Z2)(3)

公式2中 X [ N × 2 ] X_{[N\times2]} X[N×2]为输入矩阵, B 1   [ N × 3 ] B_{1~[N\times3]} B1 [N×3]为隐藏层偏置矩阵, W 1   [ 2 × 3 ] W_{1~[2\times3]} W1 [2×3]为输入层到隐藏层的权重矩阵, W 2   [ 3 × 1 ] W_{2~[3\times1]} W2 [3×1]为隐藏层到输出层的权重矩阵, B 2   [ N × 1 ] B_{2~[N\times1]} B2 [N×1]为输出层偏置矩阵, Y ^ [ N × 1 ] \hat{Y}_{[N\times1]} Y^[N×1]为输出矩阵(结果预测矩阵), Z 1 Z_{1} Z1 H H H矩阵维度为 N × 3 N\times3 N×3 Z 2 Z_{2} Z2矩阵维度为 N × 1 N\times1 N×1

神经网络损失函数

我们这里用改写的方差公式作为神经网络预测分类结果的损失函数,正确的分类结果矩阵记为 Y [ N × 1 ] Y_{[N\times1]} Y[N×1]
(4) f u n c = 1 2 N ∗ ∑ i = 1 N ( Y ^ − Y ) 2 func = \dfrac{1}{2N}*\sum_{i=1}^N{(\hat Y-Y)^2}\tag{4} func=2N1i=1N(Y^Y)2(4)
根据梯度下降法,我们需要求损失函数 f u n c func func的梯度,梯度下降法的实现可以看这里。损失函数可以表示为 f u n c = f ( X , W 1 , W 2 , B 1 , B 2 ) func=f(X,W_1,W_2,B_1,B_2) func=f(X,W1,W2,B1,B2)的形式(类似地, Z 1 = f ( X , W 1 , B 1 ) Z_1=f(X,W_1,B_1) Z1=f(X,W1,B1) Z 2 = f ( Z 1 , W 2 , B 2 ) Z_2=f(Z_1,W_2,B_2) Z2=f(Z1,W2,B2)),由于 W 1 , W 2 , B 1 , B 2 W_1,W_2,B_1,B_2 W1,W2,B1,B2是我们需要训练的参数,所以我们需要分别求 f u n c func func W 1 , W 2 , B 1 , B 2 W_1,W_2,B_1,B_2 W1,W2,B1,B2的梯度(这里涉及到矩阵的求导,见附录)。

(5) { ∂ f u n c ∂ W 2 = ∂ f u n c ∂ Z 2 ∗ ∂ Z 2 ∂ W 2 = ( 1 N ∗ ∑ i = 1 N ( Y ^ − Y ) ) ∗ ( Z 1 T ⋅ f ′ ( Z 1 , W 2 , B 2 ) ) ∂ f u n c ∂ B 2 = ∂ f u n c ∂ Z 2 ∗ ∂ Z 2 ∂ B 2 = ( 1 N ∗ ∑ i = 1 N ( Y ^ − Y ) ) ∗ f ′ ( Z 1 , W 2 , B 2 ) ∂ f u n c ∂ W 1 = ∂ f u n c ∂ Z 2 ∗ ∂ Z 2 ∂ Z 1 ∗ ∂ Z 1 ∂ W 1 = ( 1 N ∗ ∑ i = 1 N ( Y ^ − Y ) ) ∗ ( f ′ ( Z 1 , W 2 , B 2 ) ⋅ W 2 T ) ∗ ( X T ⋅ f ′ ( X , W 1 , B 1 ) ) ∂ f u n c ∂ B 1 = ∂ f u n c ∂ Z 2 ∗ ∂ Z 2 ∂ Z 1 ∗ ∂ Z 1 ∂ B 1 = ( 1 N ∗ ∑ i = 1 N ( Y ^ − Y ) ) ∗ ( f ′ ( Z 1 , W 2 , B 2 ) ⋅ W 2 T ) ∗ f ′ ( X , W 1 , B 1 ) \begin{cases} \dfrac{\partial func}{\partial W_2}=\dfrac{\partial func}{\partial Z_2}*\dfrac{\partial Z_2}{\partial W_2 } = \left( \dfrac{1}{N}*\sum_{i=1}^N{(\hat Y-Y)}\right)*\left(Z_1^T\cdot f'(Z_1,W_2,B_2) \right)\\ \\ \dfrac{\partial func}{\partial B_2}=\dfrac{\partial func}{\partial Z_2}*\dfrac{\partial Z_2}{\partial B_2 } = \left( \dfrac{1}{N}*\sum_{i=1}^N{(\hat Y-Y)}\right)* f'(Z_1,W_2,B_2) \\ \\ \dfrac{\partial func}{\partial W_1}=\dfrac{\partial func}{\partial Z_2}*\dfrac{\partial Z_2}{\partial Z_1 }*\dfrac{\partial Z_1}{\partial W_1 } = \left( \dfrac{1}{N}*\sum_{i=1}^N{(\hat Y-Y)}\right)*\left(f'(Z_1,W_2,B_2)\cdot W_2^T \right)*\left(X^T\cdot f'(X,W_1,B_1) \right)\\ \\ \dfrac{\partial func}{\partial B_1}=\dfrac{\partial func}{\partial Z_2}*\dfrac{\partial Z_2}{\partial Z_1 }*\dfrac{\partial Z_1}{\partial B_1 } = \left( \dfrac{1}{N}*\sum_{i=1}^N{(\hat Y-Y)}\right)*\left(f'(Z_1,W_2,B_2)\cdot W_2^T \right)*f'(X,W_1,B_1) \\ \end{cases}\tag{5} W2func=Z2funcW2Z2=(N1i=1N(Y^Y))(Z1Tf(Z1,W2,B2))B2func=Z2funcB2Z2=(N1i=1N(Y^Y))f(Z1,W2,B2)W1func=Z2funcZ1Z2W1Z1=(N1i=1N(Y^Y))(f(Z1,W2,B2)W2T)(XTf(X,W1,B1))B1func=Z2funcZ1Z2B1Z1=(N1i=1N(Y^Y))(f(Z1,W2,B2)W2T)f(X,W1,B1)(5)

根据梯度下降法,我们在求完梯度以后,需要更新我们的参数值,这里以 W 1 W_1 W1为例:
(6) W 1 = W 1 − η ∗ ∂ f u n c ∂ W 1 W_1 =W_1 - \eta*\dfrac{\partial func}{\partial W_1}\tag{6} W1=W1ηW1func(6)
由公式6可以看出, W 1 W_1 W1的梯度矩阵应该与 W 1 W_1 W1维度相同,即 ∂ f u n c ∂ Z 2 [ 1 × 1 ] ∗ ∂ Z 2 ∂ Z 1 [ N × 1 ] ⋅ [ 3 × 1 ] ∗ ∂ Z 1 ∂ W 1 [ 2 × N ] ⋅ [ N × 3 ] \frac{\partial func}{\partial Z_2}_{[1\times1]}*\frac{\partial Z_2}{\partial Z_1 }_{[N\times1]\cdot[3\times1]}*\frac{\partial Z_1}{\partial W_1 }_{[2\times N]\cdot[N\times3]} Z2func[1×1]Z1Z2[N×1][3×1]W1Z1[2×N][N×3] W 1   [ 2 × 3 ] W_{1~[2\times3]} W1 [2×3]维度相同,因此 N N N应该为1。所以我们在编程时应该一个样本一个样本的训练,而不是 N N N个样本一起训练。当 N = 1 N=1 N=1时,公式5可以简化为:
(7) { ∂ f u n c ∂ W 2 = ∂ f u n c ∂ Z 2 ∗ ∂ Z 2 ∂ W 2 = ( Y ^ − Y ) ∗ ( Z 1 T ∗ f ′ ( Z 1 , W 2 , B 2 ) ) ∂ f u n c ∂ B 2 = ∂ f u n c ∂ Z 2 ∗ ∂ Z 2 ∂ B 2 = ( Y ^ − Y ) ∗ f ′ ( Z 1 , W 2 , B 2 ) ∂ f u n c ∂ W 1 = ∂ f u n c ∂ Z 2 ∗ ∂ Z 2 ∂ Z 1 ∗ ∂ Z 1 ∂ W 1 = ( Y ^ − Y ) ∗ ( f ′ ( Z 1 , W 2 , B 2 ) ∗ W 2 T ) ∗ ( X T ⋅ f ′ ( X , W 1 , B 1 ) ) ∂ f u n c ∂ B 1 = ∂ f u n c ∂ Z 2 ∗ ∂ Z 2 ∂ Z 1 ∗ ∂ Z 1 ∂ B 1 = ( Y ^ − Y ) ∗ ( f ′ ( Z 1 , W 2 , B 2 ) ∗ W 2 T ) ∗ f ′ ( X , W 1 , B 1 ) \begin{cases} \dfrac{\partial func}{\partial W_2}=\dfrac{\partial func}{\partial Z_2}*\dfrac{\partial Z_2}{\partial W_2 } = (\hat Y-Y)*\left(Z_1^T* f'(Z_1,W_2,B_2) \right)\\ \\ \dfrac{\partial func}{\partial B_2}=\dfrac{\partial func}{\partial Z_2}*\dfrac{\partial Z_2}{\partial B_2 } = (\hat Y-Y)* f'(Z_1,W_2,B_2) \\ \\ \dfrac{\partial func}{\partial W_1}=\dfrac{\partial func}{\partial Z_2}*\dfrac{\partial Z_2}{\partial Z_1 }*\dfrac{\partial Z_1}{\partial W_1 } = (\hat Y-Y)*\left(f'(Z_1,W_2,B_2)* W_2^T \right)*\left(X^T\cdot f'(X,W_1,B_1) \right)\\ \\ \dfrac{\partial func}{\partial B_1}=\dfrac{\partial func}{\partial Z_2}*\dfrac{\partial Z_2}{\partial Z_1 }*\dfrac{\partial Z_1}{\partial B_1 } = (\hat Y-Y)*\left(f'(Z_1,W_2,B_2)* W_2^T \right)*f'(X,W_1,B_1) \\ \end{cases}\tag{7} W2func=Z2funcW2Z2=(Y^Y)(Z1Tf(Z1,W2,B2))B2func=Z2funcB2Z2=(Y^Y)f(Z1,W2,B2)W1func=Z2funcZ1Z2W1Z1=(Y^Y)(f(Z1,W2,B2)W2T)(XTf(X,W1,B1))B1func=Z2funcZ1Z2B1Z1=(Y^Y)(f(Z1,W2,B2)W2T)f(X,W1,B1)(7)
按照上述思路进行编程,这里隐藏层激活函数选择sigmoid函数,输出层激活函数选择tanh函数,得到分类结果的错误率为0.01~0.06,当隐藏层和输出层激活函数都选择tanh函数时,错误率更低。下图为错误率为0.025时的分类结果图。我们可以看到图中有5个数据点分类错误。

局限性

由于我们是一个样本一个样本训练的,所以我们得到的参数也是和这些样本一一对应的,因此这个模型无法画出决策边界,也无法预测新的数据,预测新的数据好像是应该对训练好的参数进行插值,但是我看别人没有那么做的,可能这样不大好。

附录

神经网络代码

# -*- encoding=utf-8 -*-
__Author__ = "stubborn vegeta"

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from matplotlib.colors import ListedColormap

class neuralNetwork(object):
	def __init__(self, X, Y, inputLayer, outputLayer, hiddenLayer=3,learningRate=0.01, epochs=10):
		"""
		learningRate:学习率
		epochs:训练次数
		inputLayer:输入层节点数
		hiddenLayer:隐藏层节点数
		outputLayer:输出层节点数
		"""
		self.learningRate = learningRate
		self.epochs = epochs
		self.inputLayer = inputLayer
		self.hiddenLayer = hiddenLayer
		self.outputLayer = outputLayer
		self.X = X
		self.Y = Y
		self.lenX,_ = np.shape(self.X)
		s=np.random.seed(0)
		# W1:输入层与隐藏层之间的权重;W2:隐藏层与输出层之间的权重;B1:隐藏层各节点的偏置项;B2:输出层各节点的偏置项
		self.W1 = np.array(np.random.random([self.inputLayer, self.hiddenLayer])*0.5)   	 #2*3
		self.B1 = np.array(np.random.random([self.lenX,self.hiddenLayer])*0.5)               #200*3
		self.W2 = np.array(np.random.random([self.hiddenLayer, self.outputLayer])*0.5)  	 #3*1
		self.B2 = np.array(np.random.random([self.lenX,self.outputLayer])*0.5)               #200*1

	def activationFunction(self, funcName:str, X):
		"""
		激活函数
		sigmoid: 1/1+e^(-z)
		tanh: [e^z-e^(-z)]/[e^z+e^(-z)]
		softmax: e^zi/sum(e^j)
		"""
		switch = {
				"sigmoid": 1/(1+np.exp(-X)),
				"tanh": np.tanh(X), 
				# "softmax": np.exp(X-np.max(X))/np.sum(np.exp(X-np.max(X)), axis=0)
				}
		return switch[funcName]

	def activationFunctionGrad(self, funcName:str, X):
		"""
		激活函数的导数
		"""
		switch = {
				"sigmoid": np.exp(-X)/(1+np.exp(-X))**2,
				"tanh": 1-(np.tanh(X)**2),
				# "softmax": np.exp(X-np.max(X))/np.sum(np.exp(X-np.max(X)), axis=0)
				}
		return switch[funcName]

	def train(self, funcNameH:str, funcNameO:str):
		"""
		funcNameH: 隐藏层激活函数
		funcNameO: 输出层激活函数
		"""
		for i in range(0,self.epochs):
			j = np.random.randint(self.lenX)
			x = np.array([self.X[j]])
			y = np.array([self.Y[j]])
			b1 = np.array([self.B1[j]])
			b2 = np.array([self.B2[j]])
			# 前向传播
			zHidden = x.dot(self.W1)+b1
			z1 = self.activationFunction(funcNameH, zHidden)  #1*3
			zOutput = z1.dot(self.W2)+b2
			z2 = self.activationFunction(funcNameO, zOutput)  #1*1 

			# 反向传播
			dW2 = (z2-y)*(z1.T*self.activationFunctionGrad(funcNameO,zOutput))
			db2 = (z2-y)*self.activationFunctionGrad(funcNameO,zOutput)
			dW1 = (z2-y)*(self.activationFunctionGrad(funcNameO,zOutput)*self.W2.T)*(x.T.dot(self.activationFunctionGrad(funcNameH,zHidden)))
			db1 = (z2-y)*(self.activationFunctionGrad(funcNameO,zOutput)*self.W2.T)*self.activationFunctionGrad(funcNameH,zHidden)

			#更新参数
			self.W2 -= self.learningRate*dW2
			self.B2[j] -= self.learningRate*db2[0]
			self.W1 -= self.learningRate*dW1
			self.B1[j] -= self.learningRate*db1[0]
		return 0

	def predict(self, xNewData, funcNameH:str, funcNameO:str):
		X = xNewData										 #200*2
		N,_ = np.shape(X)
		yPredict = []
		for j in range(0,N):	
			x = np.array([X[j]])
			b1 = np.array([self.B1[j]])
			b2 = np.array([self.B2[j]])
			# 前向传播
			zHidden = x.dot(self.W1)+b1
			z1 = self.activationFunction(funcNameH, zHidden)  #1*3
			zOutput = z1.dot(self.W2)+b2
			z2 = self.activationFunction(funcNameO, zOutput)  #1*1 
			z2 = 1 if z2>0.5 else 0
			yPredict.append(z2)
		return yPredict,N


if __name__ == "__main__":
	X,Y = datasets.make_moons(200, noise=0.15)
	neural_network = neuralNetwork (X=X, Y=Y, learningRate=0.2, epochs=1000, inputLayer=2, hiddenLayer=3, outputLayer=1)
	funcNameH = "sigmoid"
	funcNameO = "tanh"
	neural_network.train(funcNameH=funcNameH,funcNameO=funcNameO)		
	yPredict,N = neural_network.predict(xNewData=X,funcNameH=funcNameH,funcNameO=funcNameO)
	print("错误率:", sum((Y-yPredict)**2)/N)
	colormap = ListedColormap(['royalblue','forestgreen'])				# 用colormap中的颜色表示分类结果
	plt.subplot(1,2,1)
	plt.scatter(X[:,0],X[:,1],s=40, c=Y, cmap=colormap)
	plt.xlabel("x")
	plt.ylabel("y")
	plt.title("Standard data")
	plt.subplot(1,2,2)
	plt.scatter(X[:,0],X[:,1],s=40, c=yPredict, cmap=colormap)
	plt.xlabel("x")
	plt.ylabel("y")
	plt.title("Predicted data")
	plt.show()

感知机结构图代码

digraph network{
edge[fontname="Monaco"]
node[fontname="Monaco"]
rankdir=LR
b[shape=plaintext] 
x1->"z|o"[label=w1]
x2->"z|o"[label=w2]
x3->"z|o"[label=w3]
b->"z|o"
{rank=same;b;"z|o"}
}

神经网络结构图代码

digraph network{
	edge[fontname="Monaco"]
	node[fontname="Monaco",shape=circle]
	rankdir=LR

	subgraph cluster_1{
		color = white
		fontname="Monaco"
		x1,x2;
		label = "Input Layer";
	}
	subgraph cluster_2{
		color = white
		fontname="Monaco"
		h3,h1,h2;
		label = "Hidden Layer";
	}
	subgraph cluster_3{
		// rank=same
		color = white
		fontname="Monaco"
		o;
		label = "Output Layer";
	}
	x1->h1
	x1->h2
	x1->h3
	x2->h1
	x2->h2
	x2->h3
	rank=same;h1;h2;h3
	h1->o
	h2->o
	h3->o		
}

矩阵求导公式

Y = A ⋅ X   ⟹   d Y d X = A T Y=A\cdot X~\Longrightarrow~ \dfrac{dY}{dX}=A^T Y=AX  dXdY=AT Y = X ⋅ A   ⟹   d Y d X = A T Y=X\cdot A~\Longrightarrow~ \dfrac{dY}{dX}=A^T Y=XA  dXdY=AT
Y = X T ⋅ A   ⟹   d Y d X = A Y=X^T\cdot A~\Longrightarrow~ \dfrac{dY}{dX}=A Y=XTA  dXdY=A Y = A ⋅ X   ⟹   d Y d X T = A Y=A\cdot X~\Longrightarrow~ \dfrac{dY}{dX^T}=A Y=AX  dXTdY=A
d X T d X = I \dfrac{dX^T}{dX}=I dXdXT=I d X d X T = I \dfrac{dX}{dX^T}=I dXTdX=I

  1. Softmax函数与交叉熵 ↩︎

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值