Build a neural network with one hidden layer
版权声明:本文为博主原创文章,未经博主允许不得转载。
上一篇讲的是如何实现一个Logistic Regression分类器,Neural network其实和LR是很相似的,可以把Neural Network看作是有多个LR对叠起来实现的.只要理解了Logistic Regression,就不难理解Neural Network.
本文的主要内容
- 实现一个2分类的,单个隐藏层的神经网络模型
- 神经元的非线性激活,使用tanh函数
- 计算交叉熵损失
- 实现正向和反向传播
- 更新参数
- 超参数的选择
1 - Packages
- numpy
- sklearn:scikit-learn
- matplotlib
# Package imports
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import sklearn.datasets
import sklearn.linear_model
%matplotlib inline
np.random.seed(1) # set a seed so that the results are consistent
2.Helper function
# 生成训练数据
def create_dataset(m = 400, D = 2):
"""
m : number of example
D : number of features
N : number of class
X : data matrix each row is a single example
Y : label vector
"""
np.random.seed(1)
N = int(m/2)
X = np.zeros((m,D))
Y = np.zeros((m,1), dtype='uint8') # (0 for red, 1 for blue)
a = 4 # maximum ray of the flower
for j in range(2):
ix = range(N*j,N*(j+1))
t = np.linspace(j*3.12,(j+1)*3.12,N) + np.random.randn(N)*0.2 # theta
r = a*np.sin(4*t) + np.random.randn(N)*0.2 # radius
X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
Y[ix] = j
# X:shape(m, D)
# Y:shape(m, 1)
return X, Y
# 画出模型的分类决策边界
def plot_decision_boundary(model, X, y):
# Set min and max values and give it some padding
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
h = 0.01
# Generate a grid of points with distance h between them
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Predict the function value for the whole grid
Z = model(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot the contour and training examples
plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)
plt.ylabel('x2')
plt.xlabel('x1')
plt.scatter(X[:,0], X[:,1], c=y[:,0], cmap=plt.cm.Spectral)
#3. Create and overview dataset ##
- create data function : create_dataset()
- 随即生成一些,两个类别的训练数据
3.1 generate data
X, Y = create_dataset(400, 2)
print ('The shape of X is: ' + str(X.shape))
print ('The shape of Y is: ' + str(Y.shape))
print ('We have m = %d training examples!' % (X.shape[0]))
The shape of X is: (400, 2)
The shape of Y is: (400, 1)
We have m = 400 training examples!
3.2 visualize dataset
- 目标:build a model 拟合这些数据
# Visualize the data:
plt.scatter(X[:, 0], X[:, 1], c=Y[:,0], s=30, cmap=plt.cm.Spectral);
training dataset:
- a numpy-array (matrix) X,features (x1, x2)
- a numpy-array (vector) Y,labels (red:0, blue:1).
#4. Simple Logistic Regression
在实现全连接网络之前,先使用Logistic Regression 分类器来fit数据,看看LR在这个问题上的表现如何,通过sklearn来实现Logistic Regression非常简单,两行代码搞定.
##4.1 train logistic regression classifier ###
# Train the logistic regression classifier
clf = sklearn.linear_model.LogisticRegressionCV()
clf.fit(X, Y.reshape(X.shape[0],))
LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
fit_intercept=True, intercept_scaling=1.0, max_iter=100,
multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0)
##4.2 plot decision boundary
# Plot the decision boundary for logistic regression
plot_decision_boundary(lambda x: clf.predict(x), X, Y)
plt.title("Logistic Regression")
# Print accuracy
LR_predictions = clf.predict(X)
print ('Accuracy of logistic regression: %d ' % float((np.dot(Y[:,0],LR_predictions) + np.dot(1-Y[:,0],1-LR_predictions))/float(Y.size)*100) +
'% ' + "(percentage of correctly labelled datapoints)")
Accuracy of logistic regression: 47 % (percentage of correctly labelled datapoints)
Output:
Accuracy | 47% |
分类的准确率只有47%,logistic regression对数据拟合不是很好.下面使用neural network来对数据进行分类. Let’s try this now!
5 - Neural Network model
a Neural Network with a single hidden layer.
Here is our model:
Mathematically:
For one example
x
(
i
)
x^{(i)}
x(i):
z
[
1
]
(
i
)
=
W
[
1
]
x
(
i
)
+
b
[
1
]
(
i
)
(1)
z^{[1] (i)} = W^{[1]} x^{(i)} + b^{[1] (i)}\tag{1}
z[1](i)=W[1]x(i)+b[1](i)(1)
a
[
1
]
(
i
)
=
tanh
(
z
[
1
]
(
i
)
)
(2)
a^{[1] (i)} = \tanh(z^{[1] (i)})\tag{2}
a[1](i)=tanh(z[1](i))(2)
z
[
2
]
(
i
)
=
W
[
2
]
a
[
1
]
(
i
)
+
b
[
2
]
(
i
)
(3)
z^{[2] (i)} = W^{[2]} a^{[1] (i)} + b^{[2] (i)}\tag{3}
z[2](i)=W[2]a[1](i)+b[2](i)(3)
y
^
(
i
)
=
a
[
2
]
(
i
)
=
σ
(
z
[
2
]
(
i
)
)
(4)
\hat{y}^{(i)} = a^{[2] (i)} = \sigma(z^{ [2] (i)})\tag{4}
y^(i)=a[2](i)=σ(z[2](i))(4)
KaTeX parse error: Undefined control sequence: \mbox at position 42: …gin{cases} 1 & \̲m̲b̲o̲x̲{if } a^{[2](i)…
计算 m 个样本时的 cost
J
J
J as follows:
J
=
−
1
m
∑
i
=
0
m
(
y
(
i
)
log
(
a
[
2
]
(
i
)
)
+
(
1
−
y
(
i
)
)
log
(
1
−
a
[
2
]
(
i
)
)
)
(6)
J = - \frac{1}{m} \sum\limits_{i = 0}^{m} \large\left(\small y^{(i)}\log\left(a^{[2] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[2] (i)}\right) \large \right) \small \tag{6}
J=−m1i=0∑m(y(i)log(a[2](i))+(1−y(i))log(1−a[2](i)))(6)
Reminder: Bulid neural network 的步骤:
-
确定网络的结构 ( # of input units, # of hidden units, etc).
-
初始化模型的参数
-
Loop:
- 正向传播
- 计算loss
- 反向传播,计算梯度
- 更新参数 (gradient descent)
需要实现一些辅助函数来实现1-3,然后在将辅助函数集中到'nn_model()'中,最后训练模型,学习参数.在新的数据上进行预测.
##5.1 - Defining the neural network structure ###
Exercise: Define three variables:
- n_x: the size of the input layer
- n_h: the size of the hidden layer (set this to 4)
- n_y: the size of the output layer
# GRADED FUNCTION: layer_sizes
def layer_sizes(X, Y):
"""
Arguments:
X -- input dataset of shape ( number of examples,inputs/features)
Y -- labels of shape (number of examples, output)
Returns:
n_x -- the size of the input layer
n_h -- the size of the hidden layer
n_y -- the size of the output layer
"""
n_x = X.shape[1]
n_h = 4 # hard code
n_y = Y.shape[1]
return (n_x, n_h, n_y)
test function
(n_x, n_h, n_y) = layer_sizes(X, Y)
print("The size of the input layer is: n_x = " + str(n_x))
print("The size of the hidden layer is: n_h = " + str(n_h))
print("The size of the output layer is: n_y = " + str(n_y))
The size of the input layer is: n_x = 2
The size of the hidden layer is: n_h = 4
The size of the output layer is: n_y = 1
##5.2 - Initialize the model’s parameters
初始化参数: function initialize_parameters()
.
初始化方法:
- 随即初始化.
- Use:
np.random.randn(a,b) * 0.01
to randomly initialize a matrix of shape (a,b).
- Use:
- 全0初始化
- Use:
np.zeros((a,b))
to initialize a matrix of shape (a,b) with 0.
- Use:
- 使用不同的初始化方法,观察对模型的影响
# GRADED FUNCTION: initialize_parameters
# 提供两种初始化方案,增加标志参数,flag
def initialize_parameters(n_x, n_h, n_y, flag=0):
"""
Argument:
n_x -- size of the input layer
n_h -- size of the hidden layer
n_y -- size of the output layer
flag = 0 ,random initial
flag = 1 , zeros initial
Returns:
params -- python dictionary containing your parameters:
W1 -- weight matrix of shape (n_h, n_x)
b1 -- bias vector of shape (n_h, 1)
W2 -- weight matrix of shape (n_y, n_h)
b2 -- bias vector of shape (n_y, 1)
"""
np.random.seed(2) #set up a seed although the initialization is random.
if flag:
W1 = np.zeros((n_h, n_x))
b1 = np.zeros((n_h, 1))
W2 = np.zeros((n_y, n_h))
b2 = np.zeros((n_y, 1))
else :
W1 = np.random.randn(n_h, n_x)*0.01
b1 = np.zeros((n_h, 1))
W2 = np.random.randn(n_y, n_h)*0.01
b2 = np.zeros((n_y, 1))
assert (W1.shape == (n_h, n_x))
assert (b1.shape == (n_h, 1))
assert (W2.shape == (n_y, n_h))
assert (b2.shape == (n_y, 1))
parameters = {"W1": W1,
"b1": b1,
"W2": W2,
"b2": b2}
return parameters
test function:initialize_parameters()
- 随即初始化
- 0 初始化
parameters = initialize_parameters(n_x, n_h, n_y)
print("W1 : shape " + str(parameters["W1"].shape))
print("b1 : shape " + str(parameters["b1"].shape))
print("W2 : shape " + str(parameters["W2"].shape))
print("b2 : shape " + str(parameters["b2"].shape))
print('------------')
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))
W1 : shape (4, 2)
b1 : shape (4, 1)
W2 : shape (1, 4)
b2 : shape (1, 1)
------------
W1 = [[-0.00416758 -0.00056267]
[-0.02136196 0.01640271]
[-0.01793436 -0.00841747]
[ 0.00502881 -0.01245288]]
b1 = [[ 0.]
[ 0.]
[ 0.]
[ 0.]]
W2 = [[-0.01057952 -0.00909008 0.00551454 0.02292208]]
b2 = [[ 0.]]
parameters_0 = initialize_parameters(n_x, n_h, n_y, 1)
print("W1 = " + str(parameters_0["W1"]))
print("b1 = " + str(parameters_0["b1"]))
print("W2 = " + str(parameters_0["W2"]))
print("b2 = " + str(parameters_0["b2"]))
W1 = [[ 0. 0.]
[ 0. 0.]
[ 0. 0.]
[ 0. 0.]]
b1 = [[ 0.]
[ 0.]
[ 0.]
[ 0.]]
W2 = [[ 0. 0. 0. 0.]]
b2 = [[ 0.]]
##5.3 - The Loop ####
正向传播:forward_propagation()
用到的激活函数和要计算的值:
- sigmoid(),需要实现
- np.tanh(),numpy提供
- Z [ 1 ] , A [ 1 ] , Z [ 2 ] Z^{[1]}, A^{[1]}, Z^{[2]} Z[1],A[1],Z[2] and A [ 2 ] A^{[2]} A[2] ( A [ 2 ] A^{[2]} A[2]包含对所有样本的预测输出).
- 以上计算结果在,反向传播时需要用到
5.3.1 forward propagation
# Function: sigmoid()
def sigmoid(z):
return 1./(1 + np.exp(-z))
# GRADED FUNCTION: forward_propagation
def forward_propagation(X, parameters):
"""
Argument:
X -- input data of size (m, n_x)
parameters -- python dictionary containing parameters
Returns:
A2 -- The sigmoid output of the second activation
cache -- a dictionary containing "Z1", "A1", "Z2" and "A2"
"""
W1 = parameters['W1'] # (4,2)
b1 = parameters['b1'] # (4,1)
W2 = parameters['W2'] # (1,4)
b2 = parameters['b2'] # (1,1)
Z1 = np.dot(W1, X.T) + b1 #(n_h, X.shape[0])
A1 = np.tanh(Z1)
Z2 = np.dot(W2, A1) + b2 #(n_y, X.shape[0])
A2 = sigmoid(Z2)
assert(A2.shape == (1, X.shape[0]))
cache = {"Z1": Z1,
"A1": A1,
"Z2": Z2,
"A2": A2}
return A2, cache
用测试数据,测试forward_propagation()
# 测试数据
X_assess = np.random.randn(3, 2)
parameters = {'W1': np.array([[-0.00416758, -0.00056267],
[-0.02136196, 0.01640271],
[-0.01793436, -0.00841747],
[ 0.00502881, -0.01245288]]),
'W2': np.array([[-0.01057952, -0.00909008, 0.00551454, 0.02292208]]),
'b1': np.array([[ 0.],
[ 0.],
[ 0.],
[ 0.]]),
'b2': np.array([[ 0.]])}
A2, cache = forward_propagation(X_assess, parameters)
print('Z1 ,shpae = '+str(cache['Z1'].shape))
print('A1 ,shape = '+str(cache['A1'].shape))
print('Z2 ,shape = '+str(cache['Z2'].shape))
print('A2 ,shape = '+str(cache['A2'].shape))
Z1 ,shpae = (4, 3)
A1 ,shape = (4, 3)
Z2 ,shape = (1, 3)
A2 ,shape = (1, 3)
5.3.2 cost function
A
[
2
]
A^{[2]}
A[2] (in the Python variable “A2
”),矩阵
A
[
2
]
A^{[2]}
A[2] 中的每一个元素
a
[
2
]
(
i
)
a^{[2](i)}
a[2](i) 为模型对样本的d的预测输出.
- cost function as follows:
J = − 1 m ∑ i = 0 m ( y ( i ) log ( a [ 2 ] ( i ) ) + ( 1 − y ( i ) ) log ( 1 − a [ 2 ] ( i ) ) ) J = - \frac{1}{m} \sum\limits_{i = 0}^{m} \large{(} \small y^{(i)}\log\left(a^{[2] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[2] (i)}\right) \large{)} \small J=−m1i=0∑m(y(i)log(a[2](i))+(1−y(i))log(1−a[2](i)))
-
compute_cost(): 计算cost J J J.
-
交叉熵计算numpy: − ∑ i = 0 m y ( i ) log ( a [ 2 ] ( i ) ) - \sum\limits_{i=0}^{m} y^{(i)}\log(a^{[2](i)}) −i=0∑my(i)log(a[2](i)):
logprobs = np.multiply(np.log(A2),Y)
cost = - np.sum(logprobs) # no need to use a for loop!
- 也可以np.dot(A2,Y)
# GRADED FUNCTION: compute_cost
def compute_cost(A2, Y, parameters):
"""
Computes the cross-entropy cost given in equation (13)
Arguments:
A2 -- The sigmoid output of shape (1, number of examples)
Y -- "true" labels vector of shape (number of examples,1)
parameters -- python dictionary containing your parameters W1, b1, W2 and b2
Returns:
cost -- cross-entropy cost given equation (13)
"""
m = Y.shape[0] # number of example
# Compute the cross-entropy cost
# logprobs = np.multiply(np.log(A2), Y.T) + np.multiply((1-Y.T),np.log(1-A2))
# cost = -1*np.sum(logprobs)/m
cost = -1*(np.dot(np.log(A2), Y) + np.dot(np.log(1-A2), (1-Y)))/m
cost = np.squeeze(cost)
# makes sure cost is the dimension we expect.
# E.g., turns [[17]] into 17
#assert(isinstance(cost, float))
return cost
test cost function
- 测试数据:
- A2, Y_assess, parameters
Y_assess = np.random.randn(3, 1)
parameters = {'W1': np.array([[-0.00416758, -0.00056267],
[-0.02136196, 0.01640271],
[-0.01793436, -0.00841747],
[ 0.00502881, -0.01245288]]),
'W2': np.array([[-0.01057952, -0.00909008, 0.00551454, 0.02292208]]),
'b1': np.array([[ 0.],[ 0.],[ 0.],[ 0.]]),
'b2': np.array([[ 0.]])}
A2 = (np.array([[ 0.5002307 , 0.49985831, 0.50023963]]))
print("cost = " + str(compute_cost(A2, Y_assess, parameters)))
cost = 0.6934522895013014
5.3.3 backward propagation.
反向传播: backward_propagation()
实现反向传播的六个方程
- 上标
(
i
)
(i)
(i),表示第
i
i
i个样本
∂ J ∂ z 2 ( i ) = ( a [ 2 ] ( i ) − y ( i ) ) (1) \frac{\partial \mathcal{J} }{ \partial z_{2}^{(i)} } = (a^{[2](i)} - y^{(i)})\tag{1} ∂z2(i)∂J=(a[2](i)−y(i))(1)
∂ J ∂ W 2 = 1 m ∑ i = 1 m ∂ J ∂ z 2 ( i ) a [ 1 ] ( i ) T (2) \frac{\partial \mathcal{J} }{ \partial W_2 } = \frac{1}{m}\sum_{i=1}^m\frac{\partial \mathcal{J} }{ \partial z_{2}^{(i)} } a^{[1] (i) T}\tag{2} ∂W2∂J=m1i=1∑m∂z2(i)∂Ja[1](i)T(2)
∂ J ∂ b 2 = 1 m ∑ i = 1 m ∂ J ∂ z 2 ( i ) (3) \frac{\partial \mathcal{J} }{ \partial b_2 } = \frac{1}{m}\sum_{i=1}^m{\frac{\partial \mathcal{J} }{ \partial z_{2}^{(i)}}}\tag{3} ∂b2∂J=m1i=1∑m∂z2(i)∂J(3)
- ⊙ \odot ⊙ : 两向量对应元素相乘,返回等大的向量
- t a n h ( ) tanh() tanh() : t a n h ( z ) = a , t a n h ′ ( z ) = 1 − a 2 tanh(z) = a, tanh'(z) = 1 - a^2 tanh(z)=a,tanh′(z)=1−a2
∂ J ∂ z 1 ( i ) = W 2 T ∂ J ∂ z 2 ( i ) ⊙ ( 1 − a [ 1 ] ( i ) 2 ) (4) \frac{\partial \mathcal{J} }{ \partial z_{1}^{(i)} } = W_2^T \frac{\partial \mathcal{J} }{ \partial z_{2}^{(i)} } \odot ( 1 - a^{[1] (i) 2}) \tag{4} ∂z1(i)∂J=W2T∂z2(i)∂J⊙(1−a[1](i)2)(4)
∂ J ∂ W 1 = 1 m ∑ i = 1 m ∂ J ∂ z 1 ( i ) X T (5) \frac{\partial \mathcal{J} }{ \partial W_1 } =\frac{1}{m}\sum_{i=1}^m \frac{\partial \mathcal{J} }{ \partial z_{1}^{(i)} } X^T \tag{5} ∂W1∂J=m1i=1∑m∂z1(i)∂JXT(5)
∂ J i ∂ b 1 = 1 m ∑ i = 1 m ∂ J ∂ z 1 ( i ) (6) \frac{\partial \mathcal{J} _i }{ \partial b_1 } =\frac{1}{m}\sum_{i=1}^m {\frac{\partial \mathcal{J} }{ \partial z_{1}^{(i)}}}\tag{6} ∂b1∂Ji=m1i=1∑m∂z1(i)∂J(6)
- 下面是矩阵乘法版本的六个方程
d Z [ 2 ] = A [ 2 ] − Y (1) dZ^{[2]} = A^{[2]} - Y \tag{1} dZ[2]=A[2]−Y(1)
d W [ 2 ] = 1 m d Z [ 2 ] A [ 1 ] T (2) dW^{[2]} = \frac{1}{m}dZ^{[2]}A^{[1]^T}\tag{2} dW[2]=m1dZ[2]A[1]T(2)
d b [ 2 ] = 1 m ∑ i = 1 m ( d Z [ 2 ] ) (3) db^{[2]} = \frac{1}{m}\sum_{i=1}^m(dZ^{[2]})\tag{3} db[2]=m1i=1∑m(dZ[2])(3)
d Z [ 1 ] = W [ 2 ] T d Z [ 2 ] ⊙ ( 1 − A [ 1 ] 2 ) (4) dZ^{[1]} = W^{[2]^T}dZ^{[2]}\odot(1- A^{[1]^2})\tag{4} dZ[1]=W[2]TdZ[2]⊙(1−A[1]2)(4)
d W [ 1 ] = 1 m d Z [ 1 ] X (5) dW^{[1]} = \frac{1}{m}dZ^{[1]}X \tag{5} dW[1]=m1dZ[1]X(5)
d b [ 1 ] = 1 m ∑ i = 1 m d Z [ 1 ] (6) db^{[1]} = \frac{1}{m}\sum_{i=1}^m dZ^{[1]} \tag{6} db[1]=m1i=1∑mdZ[1](6)
- The notation you will use is common in deep learning coding:
- dW1 = ∂ J ∂ W 1 \frac{\partial \mathcal{J} }{ \partial W_1 } ∂W1∂J
- db1 = ∂ J ∂ b 1 \frac{\partial \mathcal{J} }{ \partial b_1 } ∂b1∂J
- dW2 = ∂ J ∂ W 2 \frac{\partial \mathcal{J} }{ \partial W_2 } ∂W2∂J
- db2 = ∂ J ∂ b 2 \frac{\partial \mathcal{J} }{ \partial b_2 } ∂b2∂J
# GRADED FUNCTION: backward_propagation
def backward_propagation(parameters, cache, X, Y):
"""
Arguments:
parameters -- python dictionary containing our parameters
cache -- a dictionary containing "Z1", "A1", "Z2" and "A2".
X -- input data of shape (2, number of examples)
Y -- "true" labels vector of shape (1, number of examples)
Returns:
grads -- python dictionary containing your gradients with respect to different parameters
"""
# X.shape = (400, 2)
# Y.shape = (400, 1)
m = X.shape[0]
W1 = parameters['W1'] # W1.shape = (4, n_x)
W2 = parameters['W2'] # W2.shape = (n_y, 4)
A1 = cache['A1'] # A1.shape = (4, m)
A2 = cache['A2'] # A2.shape = (n_y, m)
# Backward propagation: calculate dW1, db1, dW2, db2.
dZ2 = A2 - Y.T # dZ2.shape = (n_y, m)
dW2 = np.dot(dZ2, A1.T)/m # dW2.shape = (n_y, n_h)
db2 = np.sum(dZ2, axis=1, keepdims=True)/m # db2.shape = (n_y,)
dZ1 = np.multiply(np.dot(W2.T, dZ2), (1 - np.power(A1, 2))) # dZ1.shape = (4, m)
dW1 = np.dot(dZ1, X) # dW1.shape = (n_h, n_x)
db1 = np.sum(dZ1, axis=1, keepdims=True)/m # db1.shape = (n_h, 1)
grads = {"dW1": dW1,
"db1": db1,
"dW2": dW2,
"db2": db2}
return grads
test backward_propagation function
测试数据:
- parameters(同上)
- cache
- X_assess
- Y_assess
X_assess = np.random.randn(3, 2)
Y_assess = np.random.randn(3, 1)
cache = {'A1': np.array([[-0.00616578, 0.0020626 , 0.00349619],
[-0.05225116, 0.02725659, -0.02646251],
[-0.02009721, 0.0036869 , 0.02883756],
[ 0.02152675, -0.01385234, 0.02599885]]),
'A2': np.array([[ 0.5002307 , 0.49985831, 0.50023963]]),
'Z1': np.array([[-0.00616586, 0.0020626 , 0.0034962 ],
[-0.05229879, 0.02726335, -0.02646869],
[-0.02009991, 0.00368692, 0.02884556],
[ 0.02153007, -0.01385322, 0.02600471]]),
'Z2': np.array([[ 0.00092281, -0.00056678, 0.00095853]])}
grads = backward_propagation(parameters, cache, X_assess, Y_assess)
print ("dW1.shape = "+ str(grads["dW1"].shape))
print ("db1.shape = "+ str(grads["db1"].shape))
print ("dW2.shape = "+ str(grads["dW2"].shape))
print ("db2.shape = "+ str(grads["db2"].shape))
dW1.shape = (4, 2)
db1.shape = (4, 1)
dW2.shape = (1, 4)
db2.shape = (1, 1)
5.3.4 update parameters
Question:use (dW1, db1, dW2, db2) update (W1, b1, W2, b2).
梯度下降:
- $ \theta = \theta - \alpha \frac{\partial J }{ \partial \theta }$
-
α
\alpha
α is the learning rate and
θ
\theta
θ 超参数
α \alpha α选择很重要,好的参数,可以让模型更快的学习到最优的权重
# GRADED FUNCTION: update_parameters
def update_parameters(parameters, grads, lr = 1.2):
# lr : learning_rate
"""
Arguments:
parameters -- python dictionary containing your parameters
grads -- python dictionary containing your gradients
Returns:
parameters -- python dictionary containing your updated parameters
"""
W1 = parameters['W1']
b1 = parameters['b1']
W2 = parameters['W2']
b2 = parameters['b2']
dW1 = grads['dW1']
db1 = grads['db1']
dW2 = grads['dW2']
db2 = grads['db2']
W1 -= lr*dW1
b1 -= lr*db1
W2 -= lr*dW2
b2 -= lr*db2
parameters = {"W1": W1,
"b1": b1,
"W2": W2,
"b2": b2}
return parameters
##5.4 - Integrate parts 4.1, 4.2 and 4.3 in nn_model()
# GRADED FUNCTION: nn_model
def nn_model(X, Y, n_h, num_iterations = 10000, print_cost=500, flag=0, lr=1.2):
"""
Arguments:
X -- dataset of shape (2, number of examples)
Y -- labels of shape (1, number of examples)
n_h -- size of the hidden layer
num_iterations -- Number of iterations in gradient descent loop
print_cost -- if True, print the cost every 100 iterations
flag -- parameters初始化方式,0:随即,1:全0
lr -- learning rate
Returns:
parameters -- parameters learnt by the model. They can then be used to predict.
"""
np.random.seed(3)
n_x = layer_sizes(X, Y)[0]
n_y = layer_sizes(X, Y)[2]
parameters = initialize_parameters(n_x, n_h, n_y, flag)
W1 = parameters['W1']
b1 = parameters['b1']
W2 = parameters['W2']
b2 = parameters['b2']
costs = []
# Loop (gradient descent)
for i in range(0, num_iterations):
# Forward propagation. Inputs: "X, parameters". Outputs: "A2, cache".
A2, cache = forward_propagation(X, parameters)
# Cost function. Inputs: "A2, Y, parameters". Outputs: "cost".
cost = compute_cost(A2, Y, parameters)
# Backpropagation. Inputs: "parameters, cache, X, Y". Outputs: "grads".
grads = backward_propagation(parameters, cache, X, Y)
# Gradient descent parameter update. Inputs: "parameters, grads". Outputs: "parameters".
parameters = update_parameters(parameters, grads, lr)
costs.append(cost)
# Print the cost every 1000 iterations
if print_cost and i % print_cost == 0:
print ("Cost after iteration %i: %f" %(i, cost))
return parameters,costs
##4.5 Predictions
Predictions:
-
用forward propagation来预测结果
-
predictions = y p r e d i c t i o n = 1 activation > 0.5 = { 1 if a c t i v a t i o n > 0.5 0 otherwise y_{prediction} = \mathbb 1 \text{{activation > 0.5}} = \begin{cases} 1 & \text{if}\ activation > 0.5 \\ 0 & \text{otherwise} \end{cases} yprediction=1activation > 0.5={10if activation>0.5otherwise
# GRADED FUNCTION: predict
def predict(parameters, X):
"""
parameters -- python dictionary containing your parameters
X -- input data of size (n_x, m)
Returns
predictions -- vector of predictions of our model (red: 0 / blue: 1)
"""
A2, cache = forward_propagation(X, parameters)
predictions = (A2 > 0.5)*1.
return predictions
#6.Training model(全0初始化)
# Build a model with a n_h-dimensional hidden layer
parameters,cost = nn_model(X, Y, n_h = 4, num_iterations=20000, print_cost=1000,lr=1.0,flag=1)
# Plot the decision boundary
plot_decision_boundary(lambda x: predict(parameters, x), X, Y)
plt.title("Decision Boundary for hidden layer size " + str(4))
Cost after iteration 0: 0.693147
Cost after iteration 1000: 0.693147
Cost after iteration 2000: 0.693147
Cost after iteration 3000: 0.693147
Cost after iteration 4000: 0.693147
Cost after iteration 5000: 0.693147
Cost after iteration 6000: 0.693147
Cost after iteration 7000: 0.693147
Cost after iteration 8000: 0.693147
Cost after iteration 9000: 0.693147
Cost after iteration 10000: 0.693147
Cost after iteration 11000: 0.693147
Cost after iteration 12000: 0.693147
Cost after iteration 13000: 0.693147
Cost after iteration 14000: 0.693147
Cost after iteration 15000: 0.693147
Cost after iteration 16000: 0.693147
Cost after iteration 17000: 0.693147
Cost after iteration 18000: 0.693147
Cost after iteration 19000: 0.693147
Text(0.5,1,'Decision Boundary for hidden layer size 4')
plt.figure(figsize=(14,6))
plt.title('Lost curve')
plt.grid()
plt.plot(cost)
全0初始化模型,梯度不会下降
6.1Training model(随即初始化)
parameters,cost = nn_model(X, Y, n_h = 4, num_iterations=20000, print_cost=1000,lr=1.0)
# Plot the decision boundary
plot_decision_boundary(lambda x: predict(parameters, x), X, Y)
plt.title("Decision Boundary for hidden layer size " + str(4))
Cost after iteration 0: 0.693048
Cost after iteration 1000: 0.513391
Cost after iteration 2000: 0.517645
Cost after iteration 3000: 0.516130
Cost after iteration 4000: 0.515024
Cost after iteration 5000: 0.514154
Cost after iteration 6000: 0.513449
Cost after iteration 7000: 0.512871
Cost after iteration 8000: 0.512389
Cost after iteration 9000: 0.511984
Cost after iteration 10000: 0.511640
Cost after iteration 11000: 0.511343
Cost after iteration 12000: 0.511086
Cost after iteration 13000: 0.510861
Cost after iteration 14000: 0.510661
Cost after iteration 15000: 0.510484
Cost after iteration 16000: 0.510325
Cost after iteration 17000: 0.510182
Cost after iteration 18000: 0.510052
Cost after iteration 19000: 0.509933
Text(0.5,1,'Decision Boundary for hidden layer size 4')
plt.figure(figsize=(14,6))
plt.title('Lost curve')
plt.grid()
plt.plot(cost)
# Print accuracy
predictions = predict(parameters, X)
accuracy = float((np.dot(predictions,Y) + np.dot(1-predictions,1-Y))/float(Y.shape[0])*100)
print ('Accuracy: %d '%accuracy+'%')
Accuracy: 68 %
模型的准确率似乎不是很高,有三个参数可以进行调整:
- 增加训练的次数
- 增加隐藏层神经元的数量
- 寻找合适的学习效率
# 49种组合
n_hs = [3, 5, 7, 9, 10, 20 , 30]
lrs = [1.5, 2.0, 2.5, 3.0, 3.3, 3.5, 4.0]
params = []
for n_h in n_hs:
for lr in lrs:
params.append((n_h, lr))
训练模型,并画出分类决策边界
# This may take about 2 minutes to run
plt.figure(figsize=(70,70))
Costs = []
for i, (n_h,lr) in enumerate(params):
plt.subplot(7, 7, i+1)
plt.title('n_b : %d,lr : %f' % (n_h,lr))
parameters,costs = nn_model(X, Y, n_h, num_iterations = 10000, lr=lr,print_cost=0)
Costs.append(costs)
plot_decision_boundary(lambda x: predict(parameters, x), X, Y)
predictions = predict(parameters, X)
accuracy = float((np.dot(predictions,Y) + np.dot(1-predictions,1-Y))/float(Y.shape[0])*100)
print ("Accuracy for {} hidden units, learning rate: {},{}%".format(n_h,lr, accuracy))
Accuracy for 3 hidden units, learning rate: 1.5,68.5%
Accuracy for 3 hidden units, learning rate: 2.0,67.0%
Accuracy for 3 hidden units, learning rate: 2.5,67.0%
Accuracy for 3 hidden units, learning rate: 3.0,67.0%
Accuracy for 3 hidden units, learning rate: 3.3,68.25%
Accuracy for 3 hidden units, learning rate: 3.5,67.25%
Accuracy for 3 hidden units, learning rate: 4.0,78.0%
Accuracy for 5 hidden units, learning rate: 1.5,67.25%
Accuracy for 5 hidden units, learning rate: 2.0,74.0%
Accuracy for 5 hidden units, learning rate: 2.5,68.75%
Accuracy for 5 hidden units, learning rate: 3.0,92.25%
Accuracy for 5 hidden units, learning rate: 3.3,91.75%
Accuracy for 5 hidden units, learning rate: 3.5,92.5%
Accuracy for 5 hidden units, learning rate: 4.0,92.5%
Accuracy for 7 hidden units, learning rate: 1.5,92.0%
Accuracy for 7 hidden units, learning rate: 2.0,92.75%
Accuracy for 7 hidden units, learning rate: 2.5,92.75%
Accuracy for 7 hidden units, learning rate: 3.0,92.75%
Accuracy for 7 hidden units, learning rate: 3.3,92.5%
Accuracy for 7 hidden units, learning rate: 3.5,92.25%
Accuracy for 7 hidden units, learning rate: 4.0,92.0%
Accuracy for 9 hidden units, learning rate: 1.5,92.0%
Accuracy for 9 hidden units, learning rate: 2.0,92.75%
Accuracy for 9 hidden units, learning rate: 2.5,92.75%
Accuracy for 9 hidden units, learning rate: 3.0,92.75%
Accuracy for 9 hidden units, learning rate: 3.3,92.5%
Accuracy for 9 hidden units, learning rate: 3.5,92.5%
Accuracy for 9 hidden units, learning rate: 4.0,92.75%
Accuracy for 10 hidden units, learning rate: 1.5,92.75%
Accuracy for 10 hidden units, learning rate: 2.0,92.75%
Accuracy for 10 hidden units, learning rate: 2.5,92.75%
Accuracy for 10 hidden units, learning rate: 3.0,92.75%
Accuracy for 10 hidden units, learning rate: 3.3,92.75%
Accuracy for 10 hidden units, learning rate: 3.5,92.75%
Accuracy for 10 hidden units, learning rate: 4.0,93.0%
Accuracy for 20 hidden units, learning rate: 1.5,93.75%
Accuracy for 20 hidden units, learning rate: 2.0,92.75%
Accuracy for 20 hidden units, learning rate: 2.5,93.0%
Accuracy for 20 hidden units, learning rate: 3.0,92.75%
Accuracy for 20 hidden units, learning rate: 3.3,91.0%
Accuracy for 20 hidden units, learning rate: 3.5,91.25%
Accuracy for 20 hidden units, learning rate: 4.0,89.5%
Accuracy for 30 hidden units, learning rate: 1.5,92.75%
Accuracy for 30 hidden units, learning rate: 2.0,93.75%
Accuracy for 30 hidden units, learning rate: 2.5,91.75%
Accuracy for 30 hidden units, learning rate: 3.0,91.5%
Accuracy for 30 hidden units, learning rate: 3.3,92.75%
Accuracy for 30 hidden units, learning rate: 3.5,92.0%
Accuracy for 30 hidden units, learning rate: 4.0,92.5%
下图,纵轴为n_hs,横轴为lrs:
- n_hs = [3, 5, 7, 9, 10, 20, 30]
- lrs = [1.5, 2.0, 2.5. 3.0, 3.3, 3.5, 4.0]
cost curve
plt.figure(figsize=(140,140))
for i, (n_h,lr) in enumerate(params):
plt.subplot(7, 7, i+1)
plt.grid()
plt.ylim(0,1)
plt.title('n_b : %d,lr : %f' % (n_h,lr))
plt.plot(Costs[i],c='green')
上图,为49种组合的,cost curve,纵轴为n_hs,横轴为lrs:
- n_hs = [3, 5, 7, 9, 10, 20, 30]
- lrs = [1.5, 2.0, 2.5. 3.0, 3.3, 3.5, 4.0]
plt.figure(figsize=(14,56))
for i,n_h,n in zip(np.arange(0,49,7), n_hs, range(7)):
plt.subplot(7,1, n+1)
plt.ylim(0,0.7)
plt.grid()
plt.title('Hidden Layer of size %d' % n_h)
for lr,cost in zip(lrs,Costs[i: i+7]):
plt.plot(cost, label='lr=%.2f'%lr)
plt.legend()
通过对上面数据的观察找到几组不错的参数(n_h, lr)
- (7, 2.5)
- (9, 2.5)
- (10, 3.5)
- (20, 2.0)
##7.Find best combination
对选出的四组参数对应的cost curve进行可视化
plt.figure(figsize=(14,8))
plt.grid()
plt.ylim(0.1,0.25)
plt.title('Find best combination')
for n_h, lr in [(7,2.5),(9,2.5),(10,3.5),(20,2.0)]:
i = n_hs.index(n_h)
j = lrs.index(lr)
plt.plot(Costs[i*7:(i+1)*7][j], label='n_h:%d,lr=%.2f'%(n_h,lr))
plt.legend()
黄色的曲线表现不错,在迭代5500次之后,很平滑,下降的也比较快.下面继续增加迭代的次数,看看,黄色的曲线是否还会出现震荡.
plt.figure(figsize=(14,8))
plt.grid()
plt.ylim(0.1,0.25)
for i, (n_h,lr) in enumerate([(7,2.0),(10,4.0),(20,1.5),(30,2.0)]):
parameters,costs = nn_model(X, Y, n_h, num_iterations = 30000, lr=lr,print_cost=0)
plt.plot(costs, label='n_h=%d,lr=%.2f'%(n_h,lr))
plt.legend()
增加了训练次数后,除了红色的曲线依然有震荡,且周期也在变大.其他的曲线15000轮的迭代后基本趋于稳定,所以(9, 2.5)可能是最佳组合,对数据拟合比较好,但是可能出现过拟合下降.不过这次的内容先不涉及过拟合的问题.
- n_h = 9
- lr = 2.5
- num_iteration:15000