前言:
1. 模型搭建-训练-测试过程
训练集、验证集和测试集
- 工作流程:
想法idea(论文中的想法)--->选择初始的参数值----->构建神经网络模型结构----->写代码(实现神经网络)------->实验验证参数对应神经网络的表现性能------->根据验证结果,对参数进行调整优化
- 重复以上过程,然后选定最佳参数,使得神经网络最优
- 加速训练过程
- 设置Train/Dev/Test 集合 :
- Train(训练集合): 训练算法模型
- Dev(验证集合): 验证不同算法的表现情况
- Test(测试集合): 测试最好算法的实际表现,作为该算法的无偏估计
选取 合适 的Train/Dev/Test 集合:
- 大样本 : Train:Dev:Test = 98% : 1% : 1% 或者 Train:Dev:Test = 99.5% : 0.5% : 0.5%
- 选取样本尽量保证来自同一分布
- Dev 和 Test集合 尽量保证来自同一分布
- 不够样本如何
"增加"
样本数量-----即使样本来自不同分布,也能提高模型性能
- 进行图片的翻转、裁切和增加随机噪声,扩大样本数量
2. 偏差/方差
2.1 Definition 偏差(bias)和方差(Variance)
- 偏差(bias)代表
欠拟合
- 偏差指的是算法的期望预测与真实预测之间的偏差程度, 反应了模型本身的
拟合
能力- 方差(Variance)代表
过拟合
- 方差度量了同等大小的
训练集
的变动导致学习性能的变化, 刻画了数据扰动所导致的影响。- 模型越
复杂
,拟合程度越高
,偏差越小
,但方差越大,进而该模型过拟合
- 模型越
简单
,拟合程度越低
,偏差越大
,但方差越小,进而该模型欠拟合
2.2 偏差和方差 分析
train error 、Dev error 、bias 、variance、过拟合、欠拟合
概念
train error: 训练集合 模型预测的值和真实值之间的误差(一次测量准确性)
Dev error: 验证集合中, 模型预测的值和真实值之间的误差(多次测量准确性)
- bias: 偏差,高偏差体现欠拟合
- variance: 方差,高方差体现过拟合
- 某个点上的平方误差期望值:
Err ( x ) = E [ ( Y − f ^ ( x ) ) 2 ] \operatorname{Err}(x)=E\left[(Y-\hat{f}(x))^{2}\right] Err(x)=E[(Y−f^(x))2]
- 误差分解:
Err ( x ) = ( E [ f ^ ( x ) ] − f ( x ) ) 2 + E [ ( f ^ ( x ) − E [ f ^ ( x ) ] ) 2 ] + σ e 2 Err ( x ) = Bias 2 + Variance + Irreducible Error \begin{array}{c} \operatorname{Err}(x)=(E[\hat{f}(x)]-f(x))^{2}+E\left[(\hat{f}(x)-E[\hat{f}(x)])^{2}\right]+\sigma_{e}^{2} \\ \operatorname{Err}(x)=\operatorname{Bias}^{2}+\text { Variance + Irreducible Error } \end{array} Err(x)=(E[f^(x)]−f(x))2+E[(f^(x)−E[f^(x)])2]+σe2Err(x)=Bias2+ Variance + Irreducible Error
- error体现bias和varriance
Train集合error偏小(1%),Dev集合error偏大(15%)
---->模型对训练样本
有过拟合
,泛化能力不强------->高方差
Train集error偏大(15%),Dev集合error偏大(16%),两者相近
------>模型对训练样本和验证样本识别都不好,存在欠拟合
------->高偏差
Train集合error偏大(15%),Dev集合error偏大(30%)
----->模型存在高偏差和高方差
----->最坏结果Train集合error偏小(0.5%),Dev集合error偏小(0.5%)
----->模型低偏差和低方差
------>最好结果
该图代表
部分区域存在过拟合
情况,但部分区域存在欠拟合
情况
- 如何 评价该模型好坏
- 传统算法中,
bias和variance是对立的,减少bias会增加variance,增加bias会增加bias
- 现代深度学习中,
bias和variance是 鱼和熊掌可兼得
- 我们
优化模型目标: 低偏差和低方差
1. 深度神经网络的初始化
1.1 梯度消失和梯度爆炸现象
参考这位知乎大佬的手推
https://zhuanlan.zhihu.com/p/76772734
总结:
可以推导出 ∂ C ∂ b 1 = ∂ C ∂ y 4 ∂ y 4 ∂ z 4 ∂ z 4 ∂ x 4 ∂ x 4 ∂ z 3 ∂ z 3 ∂ x 3 ∂ x 3 ∂ z 2 ∂ z 2 ∂ x 2 ∂ x 2 ∂ z 1 ∂ z 1 ∂ b 1 = ∂ C ∂ y 4 σ ′ ( z 4 ) w 4 σ ′ ( z 3 ) w 3 σ ′ ( z 2 ) w 2 σ ′ ( z 1 ) 可以推导出 \begin{array}{l} \frac{\partial C}{\partial b_{1}}=\frac{\partial C}{\partial y_{4}} \frac{\partial y_{4}}{\partial z_{4}} \frac{\partial z_{4}}{\partial x_{4}} \frac{\partial x_{4}}{\partial z_{3}} \frac{\partial z_{3}}{\partial x_{3}} \frac{\partial x_{3}}{\partial z_{2}} \frac{\partial z_{2}}{\partial x_{2}} \frac{\partial x_{2}}{\partial z_{1}} \frac{\partial z_{1}}{\partial b_{1}} \\ =\frac{\partial C}{\partial y_{4}} \sigma^{\prime}\left(z_{4}\right) w_{4} \sigma^{\prime}\left(z_{3}\right) w_{3} \sigma^{\prime}\left(z_{2}\right) w_{2} \sigma^{\prime}\left(z_{1}\right) \end{array} 可以推导出∂b1∂C=∂y4∂C∂z4∂y4∂x4∂z4∂z3∂x4∂x3∂z3∂z2∂x3∂x2∂z2∂z1∂x2∂b1∂z1=∂y4∂Cσ′(z4)w4σ′(z3)w3σ′(z2)w2σ′(z1)
- sigmoid手动推导
S ( x ) = Sigmoid ( x ) = 1 1 + e − x S(x)=\operatorname{Sigmoid}(x)=\frac{1}{1+e^{-x}} S(x)=Sigmoid(x)=1+e−x1
求导过程 : S ′ ( x ) = − 1 ( 1 + e − x ) 2 × ( 1 + e − x ) ′ = − 1 ( 1 + e − x ) 2 × ( − e − x ) = 1 1 + e − x × e − x 1 + e − x = 1 1 + e − x × 1 + e − x − 1 1 + e − x = S ( x ) ( 1 − S ( x ) ) 求导过程: \begin{array}{l} S^{\prime}(x)=-\frac{1}{\left(1+e^{-x}\right)^{2}} \times\left(1+e^{-x}\right)^{\prime}=-\frac{1}{\left(1+e^{-x}\right)^{2}} \times\left(-e^{-x}\right)=\frac{1}{1+e^{-x}} \times \frac{e^{-x}}{1+e^{-x}}=\frac{1}{1+e^{-x}} \\ \times \frac{1+e^{-x}-1}{1+e^{-x}}=S(x)(1-S(x)) \end{array} 求导过程:S′(x)=−(1+e−x)21×(1+e−x)′=−(1+e−x)21×(−e−x)=1+e−x1×1+e−xe−x=1+e−x1×1+e−x1+e−x−1=S(x)(1−S(x))
- 由函数和图像得到
s i g m o i d < = 0.25 { 当 w 矩阵参数都 小于 1 的时候 , ∣ σ ′ ( z ) w ∣ ≤ 1 4 ⇒ ∂ C ∂ y 4 σ ′ ( z 4 ) w 4 σ ′ ( z 3 ) w 3 σ ′ ( z 2 ) w 2 σ ′ ( z 1 ) ↓ 当 w 矩阵参数都 大于 1 的时候 , ∣ σ ′ ( z ) w ∣ > 1 ⇒ ∂ C ∂ y 4 σ ′ ( z 4 ) w 4 σ ′ ( z 3 ) w 3 σ ′ ( z 2 ) w 2 σ ′ ( z 1 ) ↑ sigmoid <= 0.25\begin{cases} 当w矩阵参数都 {\color{Red} 小于} 1的时候,\left|\sigma^{\prime}(z) w\right| \leq \frac{1}{4}\Rightarrow\frac{\partial C}{\partial y_{4}} \sigma^{\prime}\left(z_{4}\right) w_{4} \sigma^{\prime}\left(z_{3}\right) w_{3} \sigma^{\prime}\left(z_{2}\right) w_{2} \sigma^{\prime}\left(z_{1}\right){\color{Red} \downarrow} \\ 当w矩阵参数都{\color{Red} 大于} 1的时候,\left|\sigma^{\prime}(z) w\right| > 1\Rightarrow\frac{\partial C}{\partial y_{4}} \sigma^{\prime}\left(z_{4}\right) w_{4} \sigma^{\prime}\left(z_{3}\right) w_{3} \sigma^{\prime}\left(z_{2}\right) w_{2} \sigma^{\prime}\left(z_{1}\right){\color{Red} \uparrow } \end{cases} sigmoid<=0.25{当w矩阵参数都小于1的时候,∣σ′(z)w∣≤41⇒∂y4∂Cσ′(z4)w4σ′(z3)w3σ′(z2)w2σ′(z1)↓当w矩阵参数都大于1的时候,∣σ′(z)w∣>1⇒∂y4∂Cσ′(z4)w4σ′(z3)w3σ′(z2)w2σ′(z1)↑
w [ l ] = w [ l ] − α d w [ l ] ⇒ { d w [ l ] ↓ , w [ l ] 更新过慢 d w [ l ] ↑ , w [ l ] 更新过快 w^{[l]} = w^{[l]} - \alpha dw^{[l]}\Rightarrow \begin{cases} dw^{[l]}\downarrow ,w^{[l]}更新过慢 \\ dw^{[l]}\uparrow ,w^{[l]} 更新过快 \end{cases} w[l]=w[l]−αdw[l]⇒{dw[l]↓,w[l]更新过慢dw[l]↑,w[l]更新过快
- 显而易见,
梯度爆炸跟层数成正比,由于层数的增加,容易导致梯度消失或梯度下降
且初始矩阵w过大时候,容易导致梯度爆炸现象
1.2 W和b参数初始化
- 初始化方式
- 使用0矩阵进行初始化
- 使用随机数进行初始化
- 使用抑制梯度异常初始化参数
- 初始化目的
- 加快模型下降、模型收敛
- 减少梯度下降收敛过程出现误差的几率
1.3 零矩阵初始化代码
- 初始化代码
'''
Author: Jean_Leung
Date: 2024-04-15 23:03:15
LastEditors: Jean_Leung
LastEditTime: 2024-04-16 10:08:41
FilePath: \Chap2\chap5exercise.py
Description: 深度神经网络的初始化测试
一: 我们来说一下步骤:
1.初始化网络参数
1.1 w和b是零矩阵
1.2 随机初始化w和b矩阵
1.3 He的论文中方法
Copyright (c) 2024 by ${robotlive limit}, All Rights Reserved.
'''
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import sklearn.datasets
import init_utils
from init_utils import load_dataset
# # 如果使用的是Jupyter Notebook,加上 %matplotlib inline
# plt.rcParams['figure.figsize'] =(7.0,4.0)
# plt.rcParams['image.interpolation'] = 'nearest'
# plt.rcParams['image.cmap'] = 'gray'
# # 加载图像数据集:蓝色/红色的圆点
train_X, train_Y, test_X, test_Y = load_dataset(is_plot=False)
# plt.show()
#三种初始化方法
# 初始化为零
'''
description: 第一种w和b矩阵初始化方法,需要初始化为零
param {*} layers_dims
return {*}
'''
def initialize_parameters_zeros(layers_dims):
"""
参数:
layers_dims - 包含我们网络中每个图层的节点数量的列表
返回:
parameters - 包含参数“W1”,“b1”,...,“WL”,“bL”的字典:
"""
# np.random.seed(1)
parameters = {}
L = len(layers_dims)# 层数
for l in range(1, L):
parameters['W' + str(l)] = np.zeros((layers_dims[l], layers_dims[l-1]))
parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))
assert(parameters['W' + str(l)].shape == (layers_dims[l],layers_dims[l-1]))
assert(parameters['b' + str(l)].shape == (layers_dims[l],1))
return parameters
# # 测试
# print("=====初始化为零测试=====")
# parameters = initialize_parameters_zeros([3,2,1])
# print("W1 = " + str(parameters["W1"]))
# print("b1 = " + str(parameters["b1"]))
# print("W2 = " + str(parameters["W2"]))
# print("b2 = " + str(parameters["b2"]))
1.4 随机矩阵初始化代码
- 随机矩阵初始化代码
'''
Author: Jean_Leung
Date: 2024-04-15 23:03:15
LastEditors: Jean_Leung
LastEditTime: 2024-04-16 10:08:41
FilePath: \Chap2\chap5exercise.py
Description: 深度神经网络的初始化测试
一: 我们来说一下步骤:
1.初始化网络参数
1.1 w和b是零矩阵
1.2 随机初始化w和b矩阵
1.3 He的论文中方法
Copyright (c) 2024 by ${robotlive limit}, All Rights Reserved.
'''
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import sklearn.datasets
import init_utils
from init_utils import load_dataset
# # 如果使用的是Jupyter Notebook,加上 %matplotlib inline
# plt.rcParams['figure.figsize'] =(7.0,4.0)
# plt.rcParams['image.interpolation'] = 'nearest'
# plt.rcParams['image.cmap'] = 'gray'
# # 加载图像数据集:蓝色/红色的圆点
train_X, train_Y, test_X, test_Y = load_dataset(is_plot=False)
# plt.show()
'''
description: 随机数初始化
param {*} layers_dims
return {*}
'''
def initialize_parameters_random(layers_dims):
"""
参数:
layers_dims - 包含我们网络中每个图层的节点数量的列表
返回:
parameters - 包含参数“W1”,“b1”,...,“WL”,“bL”的字典:
"""
np.random.seed(3)
parameters = {}
L = len(layers_dims)# 层数
for l in range(1, L):
parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * 10
parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))
assert(parameters['W' + str(l)].shape == (layers_dims[l],layers_dims[l-1]))
assert(parameters['b' + str(l)].shape == (layers_dims[l],1))
return parameters
# # 测试
# print("=====随机初始化测试=====")
# parameters = initialize_parameters_random([3,2,1])
# print("W1 = " + str(parameters["W1"]))
# print("b1 = " + str(parameters["b1"]))
# print("W2 = " + str(parameters["W2"]))
# print("b2 = " + str(parameters["b2"]))
1.5 "He"论文中的初始化代码
'''
Author: Jean_Leung
Date: 2024-04-15 23:03:15
LastEditors: Jean_Leung
LastEditTime: 2024-04-16 10:08:41
FilePath: \Chap2\chap5exercise.py
Description: 深度神经网络的初始化测试
一: 我们来说一下步骤:
1.初始化网络参数
1.1 w和b是零矩阵
1.2 随机初始化w和b矩阵
1.3 He的论文中方法
Copyright (c) 2024 by ${robotlive limit}, All Rights Reserved.
'''
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import sklearn.datasets
import init_utils
from init_utils import load_dataset
# # 如果使用的是Jupyter Notebook,加上 %matplotlib inline
# plt.rcParams['figure.figsize'] =(7.0,4.0)
# plt.rcParams['image.interpolation'] = 'nearest'
# plt.rcParams['image.cmap'] = 'gray'
# # 加载图像数据集:蓝色/红色的圆点
train_X, train_Y, test_X, test_Y = load_dataset(is_plot=False)
# plt.show()
# 论文中的"He"初始化
'''
description: He初始化
param {*} layers_dims
return {*}
'''
def initialize_parameters_he(layers_dims):
"""
参数:
layers_dims - 包含我们网络中每个图层的节点数量的列表
返回:
parameters - 包含参数“W1”,“b1”,...,“WL”,“bL”的字典:
"""
np.random.seed(3)
parameters = {}
L = len(layers_dims)# 层数
for l in range(1, L):
parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * np.sqrt(2/(layers_dims[l-1]))
parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))
assert(parameters['W' + str(l)].shape == (layers_dims[l],layers_dims[l-1]))
assert(parameters['b' + str(l)].shape == (layers_dims[l],1))
return parameters
# # 测试
# print("=====He始化测试=====")
# parameters = initialize_parameters_he([3,2,1])
# print("W1 = " + str(parameters["W1"]))
# print("b1 = " + str(parameters["b1"]))
# print("W2 = " + str(parameters["W2"]))
# print("b2 = " + str(parameters["b2"]))
1.6 总代码
- 主代码
'''
Author: Jean_Leung
Date: 2024-04-15 23:03:15
LastEditors: Jean_Leung
LastEditTime: 2024-04-18 00:25:27
FilePath: \Chap2\chap5exercise.py
Description: 深度神经网络的初始化测试
一: 我们来说一下步骤:
1.初始化网络参数
1.1 w和b是零矩阵
1.2 随机初始化w和b矩阵
1.3 He的论文中方法
Copyright (c) 2024 by ${robotlive limit}, All Rights Reserved.
'''
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import sklearn.datasets
import init_utils
from init_utils import load_dataset
plt.rcParams['figure.figsize'] = (10.0, 10.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'
# # 如果使用的是Jupyter Notebook,加上 %matplotlib inline
# plt.rcParams['figure.figsize'] =(7.0,4.0)
# plt.rcParams['image.interpolation'] = 'nearest'
# plt.rcParams['image.cmap'] = 'gray'
# # 加载图像数据集:蓝色/红色的圆点
train_X, train_Y, test_X, test_Y = load_dataset(is_plot=False)
# plt.show()
#三种初始化方法
# 初始化为零
'''
description: 第一种w和b矩阵初始化方法,需要初始化为零
param {*} layers_dims
return {*}
'''
def initialize_parameters_zeros(layers_dims):
"""
参数:
layers_dims - 包含我们网络中每个图层的节点数量的列表
返回:
parameters - 包含参数“W1”,“b1”,...,“WL”,“bL”的字典:
"""
# np.random.seed(1)
parameters = {}
L = len(layers_dims)# 层数
for l in range(1, L):
parameters['W' + str(l)] = np.zeros((layers_dims[l], layers_dims[l-1]))
parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))
assert(parameters['W' + str(l)].shape == (layers_dims[l],layers_dims[l-1]))
assert(parameters['b' + str(l)].shape == (layers_dims[l],1))
return parameters
# # 测试
# print("=====初始化为零测试=====")
# parameters = initialize_parameters_zeros([3,2,1])
# print("W1 = " + str(parameters["W1"]))
# print("b1 = " + str(parameters["b1"]))
# print("W2 = " + str(parameters["W2"]))
# print("b2 = " + str(parameters["b2"]))
'''
description: 随机数初始化
param {*} layers_dims
return {*}
'''
def initialize_parameters_random(layers_dims):
"""
参数:
layers_dims - 包含我们网络中每个图层的节点数量的列表
返回:
parameters - 包含参数“W1”,“b1”,...,“WL”,“bL”的字典:
"""
np.random.seed(3)
parameters = {}
L = len(layers_dims)# 层数
for l in range(1, L):
parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * 10
parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))
assert(parameters['W' + str(l)].shape == (layers_dims[l],layers_dims[l-1]))
assert(parameters['b' + str(l)].shape == (layers_dims[l],1))
return parameters
# # 测试
# print("=====随机初始化测试=====")
# parameters = initialize_parameters_random([3,2,1])
# print("W1 = " + str(parameters["W1"]))
# print("b1 = " + str(parameters["b1"]))
# print("W2 = " + str(parameters["W2"]))
# print("b2 = " + str(parameters["b2"]))
# 论文中的"He"初始化
'''
description: He初始化
param {*} layers_dims
return {*}
'''
def initialize_parameters_he(layers_dims):
"""
参数:
layers_dims - 包含我们网络中每个图层的节点数量的列表
返回:
parameters - 包含参数“W1”,“b1”,...,“WL”,“bL”的字典:
"""
np.random.seed(3)
parameters = {}
L = len(layers_dims)# 层数
for l in range(1, L):
parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * np.sqrt(2/(layers_dims[l-1]))
parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))
assert(parameters['W' + str(l)].shape == (layers_dims[l],layers_dims[l-1]))
assert(parameters['b' + str(l)].shape == (layers_dims[l],1))
return parameters
# # 测试
# print("=====He始化测试=====")
# parameters = initialize_parameters_he([3,2,1])
# print("W1 = " + str(parameters["W1"]))
# print("b1 = " + str(parameters["b1"]))
# print("W2 = " + str(parameters["W2"]))
# print("b2 = " + str(parameters["b2"]))
'''
description: 总测试代码,包括模型的正向、反向和更新,最后输出代码
param {*} X
param {*} Y
param {*} learning_rate
param {*} num_iterations
param {*} print_cost
param {*} initialization
param {*} is_polt
return {*}
'''
def model(X,Y,learning_rate = 0.01, num_iterations = 15000,print_cost = True,initialization="he",is_polt=True,k=1):
"""
参数:
X - 数据集,维度为(2,示例数)
Y - 标签,维度为(1,示例数)
num_iterations - 表示用于优化参数的迭代次数的超参数
print_cost - 如果为True,则每1000次迭代打印一次成本数值
initialization - 表示初始化方法的超参数
is_plot - 设置为true以每100次迭代打印成本
返回:
d - 包含有关模型信息的字典。
"""
graads = {}
costs = []
m = X.shape[1] # 样本数
layers_dims = [X.shape[0],10,5,1] # 深度神经网络层数数组
# n = X.shape[0] # 特征数
# 初始化参数
if initialization == 'zeros':
parameters = initialize_parameters_zeros(layers_dims)
elif initialization == 'random':
parameters = initialize_parameters_random(layers_dims)
elif initialization == 'he':
parameters = initialize_parameters_he(layers_dims)
else :
print("格式不正确")
exit
# 开始迭代
for i in range(0,num_iterations):
#前向传播
a3 , cache = init_utils.forward_propagation(X,parameters)
#计算成本
cost = init_utils.compute_loss(a3,Y)
#反向传播
grads = init_utils.backward_propagation(X,Y,cache)
#更新参数
parameters = init_utils.update_parameters(parameters,grads,learning_rate)
#记录成本
if i % 1000 == 0:
costs.append(cost)
#打印成本
if print_cost:
print("第" + str(i) + "次迭代,成本值为:" + str(cost))
#学习完毕,绘制成本曲线
if is_polt:
plt.subplot(2,2,k)
plt.plot(costs)
plt.ylabel('cost')
plt.xlabel('iterations (per hundreds)')
plt.title("Learning rate = " + str(learning_rate) +"with" + initialization)
# plt.show()
#返回学习完毕后的参数
return parameters
# 测试,测试模型
initialization_list = ["random","zeros","he"]
for k in range(0,len(initialization_list)):
parameters = model(train_X,train_Y,learning_rate = 0.01, num_iterations = 15000,print_cost = True,initialization=initialization_list[k],is_polt=True,k=k+1)
print("训练集合")
predictions_train = init_utils.predict(train_X,train_Y,parameters)
print("测试集合")
predictions_test = init_utils.predict(test_X,test_Y,parameters)
# 绘制决策边界
print("predicitions_train = " + str(predictions_train))
print("predictions_test = " + str(predictions_test))
# plt.subplot(3,2,i+1)
# plt.title("Model with " + initialization_list[i])
# axes = plt.gca()
# axes.set_xlim([-1.5,1.5])
# axes.set_ylim([-1.5,1.5])
# # lambda表达式
# init_utils.plot_decision_boundary(lambda x: init_utils.predict_dec(parameters,x.T),train_X,train_Y)
plt.show()
- init_utils.py 代码
'''
Author: Jean_Leung
Date: 2024-04-15 23:05:58
LastEditors: Jean_Leung
LastEditTime: 2024-04-16 10:21:50
FilePath: \Chap2\init_utils.py
Description: 参数矩阵初始化工具包
Copyright (c) 2024 by ${robotlive limit}, All Rights Reserved.
'''
# -*- coding: utf-8 -*-
#init_utils.py
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import sklearn.datasets
def sigmoid(x):
"""
Compute the sigmoid of x
Arguments:
x -- A scalar or numpy array of any size.
Return:
s -- sigmoid(x)
"""
s = 1/(1+np.exp(-x))
return s
def relu(x):
"""
Compute the relu of x
Arguments:
x -- A scalar or numpy array of any size.
Return:
s -- relu(x)
"""
s = np.maximum(0,x)
return s
def compute_loss(a3, Y):
"""
Implement the loss function
Arguments:
a3 -- post-activation, output of forward propagation
Y -- "true" labels vector, same shape as a3
Returns:
loss - value of the loss function
"""
m = Y.shape[1]
logprobs = np.multiply(-np.log(a3),Y) + np.multiply(-np.log(1 - a3), 1 - Y)
loss = 1./m * np.nansum(logprobs)
return loss
def forward_propagation(X, parameters):
"""
Implements the forward propagation (and computes the loss) presented in Figure 2.
Arguments:
X -- input dataset, of shape (input size, number of examples)
Y -- true "label" vector (containing 0 if cat, 1 if non-cat)
parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
W1 -- weight matrix of shape ()
b1 -- bias vector of shape ()
W2 -- weight matrix of shape ()
b2 -- bias vector of shape ()
W3 -- weight matrix of shape ()
b3 -- bias vector of shape ()
Returns:
loss -- the loss function (vanilla logistic loss)
"""
# retrieve parameters
W1 = parameters["W1"]
b1 = parameters["b1"]
W2 = parameters["W2"]
b2 = parameters["b2"]
W3 = parameters["W3"]
b3 = parameters["b3"]
# LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
z1 = np.dot(W1, X) + b1
a1 = relu(z1)
z2 = np.dot(W2, a1) + b2
a2 = relu(z2)
z3 = np.dot(W3, a2) + b3
a3 = sigmoid(z3)
cache = (z1, a1, W1, b1, z2, a2, W2, b2, z3, a3, W3, b3)
return a3, cache
def backward_propagation(X, Y, cache):
"""
Implement the backward propagation presented in figure 2.
Arguments:
X -- input dataset, of shape (input size, number of examples)
Y -- true "label" vector (containing 0 if cat, 1 if non-cat)
cache -- cache output from forward_propagation()
Returns:
gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
"""
m = X.shape[1]
(z1, a1, W1, b1, z2, a2, W2, b2, z3, a3, W3, b3) = cache
dz3 = 1./m * (a3 - Y)
dW3 = np.dot(dz3, a2.T)
db3 = np.sum(dz3, axis=1, keepdims = True)
da2 = np.dot(W3.T, dz3)
dz2 = np.multiply(da2, np.int64(a2 > 0))
dW2 = np.dot(dz2, a1.T)
db2 = np.sum(dz2, axis=1, keepdims = True)
da1 = np.dot(W2.T, dz2)
dz1 = np.multiply(da1, np.int64(a1 > 0))
dW1 = np.dot(dz1, X.T)
db1 = np.sum(dz1, axis=1, keepdims = True)
gradients = {"dz3": dz3, "dW3": dW3, "db3": db3,
"da2": da2, "dz2": dz2, "dW2": dW2, "db2": db2,
"da1": da1, "dz1": dz1, "dW1": dW1, "db1": db1}
return gradients
def update_parameters(parameters, grads, learning_rate):
"""
Update parameters using gradient descent
Arguments:
parameters -- python dictionary containing your parameters
grads -- python dictionary containing your gradients, output of n_model_backward
Returns:
parameters -- python dictionary containing your updated parameters
parameters['W' + str(i)] = ...
parameters['b' + str(i)] = ...
"""
L = len(parameters) // 2 # number of layers in the neural networks
# Update rule for each parameter
for k in range(L):
parameters["W" + str(k+1)] = parameters["W" + str(k+1)] - learning_rate * grads["dW" + str(k+1)]
parameters["b" + str(k+1)] = parameters["b" + str(k+1)] - learning_rate * grads["db" + str(k+1)]
return parameters
def predict(X, y, parameters):
"""
This function is used to predict the results of a n-layer neural network.
Arguments:
X -- data set of examples you would like to label
parameters -- parameters of the trained model
Returns:
p -- predictions for the given dataset X
"""
m = X.shape[1]
p = np.zeros((1,m), dtype = np.int)
# Forward propagation
a3, caches = forward_propagation(X, parameters)
# convert probas to 0/1 predictions
for i in range(0, a3.shape[1]):
if a3[0,i] > 0.5:
p[0,i] = 1
else:
p[0,i] = 0
# print results
print("Accuracy: " + str(np.mean((p[0,:] == y[0,:]))))
return p
def load_dataset(is_plot=True):
np.random.seed(1)
train_X, train_Y = sklearn.datasets.make_circles(n_samples=300, noise=.05)
np.random.seed(2)
test_X, test_Y = sklearn.datasets.make_circles(n_samples=100, noise=.05)
# Visualize the data
if is_plot:
plt.scatter(train_X[:, 0], train_X[:, 1], c=train_Y, s=40, cmap=plt.cm.Spectral);
train_X = train_X.T
train_Y = train_Y.reshape((1, train_Y.shape[0]))
test_X = test_X.T
test_Y = test_Y.reshape((1, test_Y.shape[0]))
return train_X, train_Y, test_X, test_Y
def plot_decision_boundary(model, X, y):
# Set min and max values and give it some padding
x_min, x_max = X[0, :].min() - 1, X[0, :].max() + 1
y_min, y_max = X[1, :].min() - 1, X[1, :].max() + 1
h = 0.01
# Generate a grid of points with distance h between them
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Predict the function value for the whole grid
Z = model(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot the contour and training examples
plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)
plt.ylabel('x2')
plt.xlabel('x1')
plt.scatter(X[0, :], X[1, :], c=y, cmap=plt.cm.Spectral)
plt.show()
def predict_dec(parameters, X):
"""
Used for plotting decision boundary.
Arguments:
parameters -- python dictionary containing your parameters
X -- input data of size (m, K)
Returns
predictions -- vector of predictions of our model (red: 0 / blue: 1)
"""
# Predict using forward propagation and a classification threshold of 0.5
a3, cache = forward_propagation(X, parameters)
predictions = (a3>0.5)
return predictions
1.7 运行结果
2. 深度神经网络的正则化和Dropout处理
2.1 背景
前文说过,如果一个模型出现
过拟合
现象,high variance现象,应该怎么解决,这时我们该如何解决?解决方法:
- 扩大训练样本-------难以实现
- L2正则化处理和L1正则化
- Dropout处理
2.2 L2正则化
- 公式推导
表达公式
( w [ 1 ] , b [ 1 ] , ⋯ , w [ L ] , b [ L ] ) = 1 m ∑ i = 1 m L ( y ^ ( i ) , y ( i ) ) + λ 2 m ∑ l = 1 L ∥ w [ l ] ∥ 2 ∥ w [ l ] ∥ 2 = ∑ i = 1 n [ l ] ∑ j = 1 n [ l − 1 ] ( w i j [ l ] ) 2 ∥ w [ l ] ∥ 2 称为 F r o b e n i u s 范数 \begin{array}{c} \left(w^{[1]}, b^{[1]}, \cdots, w^{[L]}, b^{[L]}\right)=\frac{1}{m} \sum_{i=1}^{m} L\left(\hat{y}^{(i)}, y^{(i)}\right)+\frac{\lambda}{2 m} \sum_{l=1}^{L}\left\|w^{[l]}\right\|^{2} \\ \left\|w^{[l]}\right\|^{2}=\sum_{i=1}^{n^{[l]}} \sum_{j=1}^{n^{[l-1]}}\left(w_{i j}^{[l]}\right)^{2} \\ \left\|w^{[l]}\right\|^{2}称为Frobenius范数 \end{array} (w[1],b[1],⋯,w[L],b[L])=m1∑i=1mL(y^(i),y(i))+2mλ∑l=1L w[l] 2 w[l] 2=∑i=1n[l]∑j=1n[l−1](wij[l])2 w[l] 2称为Frobenius范数
矩阵的范数计算
∥ A ∥ F = ∑ i = 1 m ∑ j = 1 n ∣ a i j ∣ 2 \|A\|_{F}=\sqrt{\sum_{i=1}^{m} \sum_{j=1}^{n}\left|a_{i j}\right|^{2}} ∥A∥F=i=1∑mj=1∑n∣aij∣2
所以由于我们Lost函数加了正则化项,所以反向传播和更新w参数步骤需要做出改变
反向传播公式出现变化
由 J ( w [ 1 ] , b [ 1 ] , ⋯ , w [ L ] , b [ L ] ) a f t e r = 1 m ∑ i = 1 m L ( y ^ ( i ) , y ( i ) ) + λ 2 m ∑ l = 1 L ∥ w [ l ] ∥ 2 = 1 m ∑ i = 1 m L ( y ^ ( i ) , y ( i ) ) + ( w 11 [ l ] ) 2 + ⋯ + ( w i j [ l ] ) 2 得 且 ∂ J b e f o r e ∂ w [ l ] = d w b e f o r e [ l ] , 令 Q = ( w i j [ l ] ) 2 得 得 d Q d w = 2 ( w i j [ l ] ) 最终得 ∂ J a f t e r ∂ w [ l ] = d w b e f o r e [ l ] + λ 2 m ⋅ 2 ( w i j [ l ] ) = d w b e f o r e [ l ] + λ ⋅ ( w [ l ] ) m \begin{align} &由J\left(w^{[1]}, b^{[1]}, \cdots, w^{[L]}, b^{[L]}\right)_{after}=\frac{1}{m} \sum_{i=1}^{m} L\left(\hat{y}^{(i)}, y^{(i)}\right)+\frac{\lambda}{2 m} \sum_{l=1}^{L}\left\|w^{[l]}\right\|^{2} =\frac{1}{m} \sum_{i=1}^{m} L\left(\hat{y}^{(i)}, y^{(i)}\right)+ \left(w_{1 1}^{[l]}\right)^{2}+\cdots+\left(w_{i j}^{[l]}\right)^{2} 得\\ & 且\frac{\partial J_{before}}{\partial w^{[l]}} = dw^{[l]}_{before},令Q = \left(w_{i j}^{[l]}\right)^{2}得\\ & 得\frac{dQ}{dw} = 2 \left(w_{i j}^{[l]}\right)\\ &最终得 \frac{\partial J_{after}}{\partial w^{[l]}} = dw^{[l]}_{before}+\frac{\lambda}{2m} \cdot2\left(w_{i j}^{[l]}\right)=dw^{[l]}_{before}+{\color{Red} \frac{\lambda \cdot \left(w^{[l]}\right)}{m}} \end{align} 由J(w[1],b[1],⋯,w[L],b[L])after=m1i=1∑mL(y^(i),y(i))+2mλl=1∑L w[l] 2=m1i=1∑mL(y^(i),y(i))+(w11[l])2+⋯+(wij[l])2得且∂w[l]∂Jbefore=dwbefore[l],令Q=(wij[l])2得得dwdQ=2(wij[l])最终得∂w[l]∂Jafter=dwbefore[l]+2mλ⋅2(wij[l])=dwbefore[l]+mλ⋅(w[l])
更新参数公式发生变化
w [ l ] : = w [ l ] − α ⋅ d w [ l ] = w [ l ] − α ⋅ ( d w before [ l ] + λ m w [ l ] ) = ( 1 − α λ m ) w [ l ] − α ⋅ d w before [ l ] 其中, ( 1 − α λ m ) < 1 。 \begin{aligned} w^{[l]} & :=w^{[l]}-\alpha \cdot d w^{[l]} \\ & =w^{[l]}-\alpha \cdot\left(d w_{\text {before }}^{[l]}+\frac{\lambda}{m} w^{[l]}\right) \\ & =\left(1-\alpha \frac{\lambda}{m}\right) w^{[l]}-\alpha \cdot d w_{\text {before }}^{[l]} \\ &其中, \left(1-\alpha \frac{\lambda}{m}\right)<1 。 \end{aligned} w[l]:=w[l]−α⋅dw[l]=w[l]−α⋅(dwbefore [l]+mλw[l])=(1−αmλ)w[l]−α⋅dwbefore [l]其中,(1−αmλ)<1。
- 总结:
- 看到w更新参数公式需要多减去正则项,所以w衰减的更快,那么也就是迭代过程中,w不断进行衰减
- 原理说明
- 由前向传播公式得
Z [ 1 ] = W [ 1 ] X + b [ 1 ] A [ 1 ] = g [ 1 ] ( Z [ 1 ] ) Z [ 2 ] = W [ 2 ] A [ 1 ] + b [ 2 ] A [ 2 ] = g [ 2 ] ( Z [ 2 ] ) ⋮ A [ L ] = g [ L ] ( Z [ L ] ) = Y ^ \begin{aligned} Z^{[1]} & =W^{[1]} X+b^{[1]} \\ A^{[1]} & =g^{[1]}\left(Z^{[1]}\right) \\ Z^{[2]} & =W^{[2]} A^{[1]}+b^{[2]} \\ A^{[2]} & =g^{[2]}\left(Z^{[2]}\right) \\ \vdots & \\ A^{[L]} & =g^{[L]}\left(Z^{[L]}\right)=\hat{Y} \end{aligned} Z[1]A[1]Z[2]A[2]⋮A[L]=W[1]X+b[1]=g[1](Z[1])=W[2]A[1]+b[2]=g[2](Z[2])=g[L](Z[L])=Y^
由前面看到
由 w [ l ] = ( 1 − α λ m ) w [ l ] − α ⋅ d w before [ l ] ↓ 则 Z [ 2 ] = W [ 2 ] A [ 1 ] + b [ 2 ] ↓ 也就是说 Z 不断衰减 ⇒ 网络中某些神经元不起作用 ⇒ 变相提高方差,也就是减少过拟合 \begin{align} &由w^{[l]} = \left(1-\alpha \frac{\lambda}{m}\right) w^{[l]}-\alpha \cdot d w_{\text {before }}^{[l]}\downarrow \\ &则Z^{[2]} =W^{[2]} A^{[1]}+b^{[2]}\downarrow \\ &也就是说{\color{Red} Z不断衰减} \Rightarrow {\color{Red} 网络中某些神经元不起作用} \Rightarrow {\color{Red} 变相提高方差,也就是减少过拟合} \end{align} 由w[l]=(1−αmλ)w[l]−α⋅dwbefore [l]↓则Z[2]=W[2]A[1]+b[2]↓也就是说Z不断衰减⇒网络中某些神经元不起作用⇒变相提高方差,也就是减少过拟合
- 第二种解释:
2.3 Dropout调整
- 原理说明
Dropout是指在深度学习网络的训练过程中,对于每层的神经元,按照
一定的概率将其暂时从网络中丢弃
。也就是说,每次训练时,每一层都有部分神经元不工作,起到简化复杂网络模型的效果,从而避免发生过拟合
。
不同隐藏层的dropout系数keep_prob可以不同
。一般来说,神经元越多的隐藏层,keep_out可以设置得小一些
,例如0.5;神经元越少的隐藏层,keep_out可以设置的大一些
,例如0.8,设置是1。实际应用中,不建议对输入层进行dropout,如果输入层维度很大,例如图片,那么可以设置dropout,但keep_out应设置的大一些,例如0.8,0.9
- 值得注意的是,使用dropout训练结束后,在测试和实际应用模型时,不需要进行dropout和随机删减神经元,所有的神经元都在工作
2.4 数据增强
其中一种方法是增加训练样本数量。但是通常成本较高,难以获得额外的训练样本。但是,我们可以对已有的训练样本进行一些处理来“制造”出更多的样本,称为data augmentation。
图片识别问题中,可以对已有的图片进行水平翻转、垂直翻转、任意角度旋转、缩放或扩大
2.5 L2正则化代码
- L2正则化代码
'''
Author: Jean_Leung
Date: 2024-04-16 10:11:58
LastEditors: Jean_Leung
LastEditTime: 2024-04-18 14:21:34
FilePath: \Chap2\chap6exercise.py
Description: 深度神经网络的正则化
具体实现:
1.不使用正则化
2.使用正则化
2.1 使用L2正则化
2.2 使用随机节点删除
Copyright (c) 2024 by ${robotlive limit}, All Rights Reserved.
'''
# import packages
import numpy as np
import matplotlib.pyplot as plt
from reg_utils import *
import sklearn
import sklearn.datasets
from testCase import *
# # 设置 plot 的默认大小
# plt.rcParams['figure.figsize'] = (7.0, 4.0)
# plt.rcParams['image.interpolation'] = 'nearest'
# plt.rcParams['image.cmap'] = 'gray'
plt.rcParams['figure.figsize'] = (20.0, 30.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'
train_X, train_Y, test_X, test_Y = load_2D_dataset(is_plot=False)
# plt.show()
# 使用正则化正向传播
def compute_cost_with_regularization(A3, Y, parameters, lambd):
"""
实现公式2的L2正则化计算成本
参数:
A3 - 正向传播的输出结果,维度为(输出节点数量,训练/测试的数量)
Y - 标签向量,与数据一一对应,维度为(输出节点数量,训练/测试的数量)
parameters - 包含模型学习后的参数的字典
lambda - 进行正则化因子
返回:
cost - 使用公式2计算出来的正则化损失的值
"""
m = Y.shape[1]
W1 = parameters["W1"]
W2 = parameters["W2"]
W3 = parameters["W3"]
# 先计算原先的compute_cost值
cross_entropy_cost = compute_cost(A3,Y)
# 这就是正则化公式,额外多出来的部分正则式
L2_regularization_cost = lambd * (np.sum(np.square(W1)) + np.sum(np.square(W2)) + np.sum(np.square(W3))) / (2 * m)
# 初始+正则化公式 = 新的costJ函数
cost = cross_entropy_cost + L2_regularization_cost
return cost
# 测试compute_cost_with_regularization函数
# print("============测试正则化后的cost函数==========")
# A3 , Y_assess,parameters = compute_cost_with_regularization_test_case()
# print("cost = " + str(compute_cost_with_regularization(A3,Y_assess,parameters,lambd=0.1)))
# 当然,因为改变了成本函数,我们也必须改变向后传播的函数,
# 所有的梯度都必须根据这个新的成本值来计算。
'''
description: 有正则表达式的反向传输
param {*} X
param {*} Y
param {*} cache
param {*} lambd
return {*}
'''
def backward_propagation_with_regularization(X,Y,cache,lambd):
"""
实现我们添加了L2正则化的模型的后向传播。
参数:
X - 输入数据集,维度为(输入节点数量,数据集里面的数量)
Y - 标签,维度为(输出节点数量,数据集里面的数量)
cache - 来自forward_propagation()的cache输出
lambda - regularization超参数,实数
返回:
gradients - 一个包含了每个参数、激活值和预激活值变量的梯度的字典
"""
m = X.shape[1]
# 首先,从字典 之前正向计算搜索出对应各类参数
(Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache
# 正向传播的梯度
dZ3 = A3 - Y
# 加上正则表达式,正则部分的导数为 lambd * W3 / m
dW3 = 1/m * np.dot(dZ3,A2.T) + ((lambd * W3) / m)
db3 = 1/m * np.sum(dZ3,axis=1,keepdims=True)
dA2 = np.dot(W3.T,dZ3)
dZ2 = np.multiply(dA2, np.int64(A2 > 0))
dW2 = 1/m * np.dot(dZ2,A1.T) + ((lambd * W2) / m)
db2 = 1/m * np.sum(dZ2,axis=1,keepdims=True)
dA1 = np.dot(W2.T,dZ2)
dZ1 = np.multiply(dA1, np.int64(A1 > 0))
dW1 = 1/m * np.dot(dZ1,X.T) + ((lambd * W1) / m)
db1 = 1/m * np.sum(dZ1,axis=1,keepdims=True)
gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3, "dA2": dA2,
"dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1,
"dZ1": dZ1, "dW1": dW1, "db1": db1}
return gradients
# X_assess,Y_assess,cache = backward_propagation_with_regularization_test_case()
# # 测试backward_propagation_with_regularization函数
# grads = backward_propagation_with_regularization(X_assess,Y_assess,cache,lambd=0.7)
# print("dW1 = " + str(grads["dW1"]))
# print("dW2 = " + str(grads["dW2"]))
# print("dW3 = " + str(grads["dW3"]))
'''
description: 总模型
param {*} X
param {*} Y
param {*} learning_rate
param {*} num_iterations
param {*} print_cost
param {*} is_plot
param {*} lambd
param {*} keep_prob
param {*} k 用来表示显示图片的图层
return {*}
'''
def model(X,Y,learning_rate=0.3,num_iterations=30000,print_cost=True,is_plot=True,lambd=0,keep_prob=1,k=1):
"""
实现一个三层的神经网络:LINEAR ->RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
参数:
X - 输入的数据,维度为(2, 要训练/测试的数量)
Y - 标签,【0(蓝色) | 1(红色)】,维度为(1,对应的是输入的数据的标签)
learning_rate - 学习速率
num_iterations - 迭代的次数
print_cost - 是否打印成本值,每迭代10000次打印一次,但是每1000次记录一个成本值
is_polt - 是否绘制梯度下降的曲线图
lambd - 正则化的超参数,实数
keep_prob - 随机删除节点的概率
返回
parameters - 学习后的参数
"""
grads = {}
costs = []
m = X.shape[1]
print("X 输入维度" + str(X.shape[0]))
# print("m = ", m)
layers_dims = [X.shape[0],20,3,1]
# 初始化参数
parameters = initialize_parameters(layers_dims)
# 开始学习
for i in range(0,num_iterations):
# 前向传播
## 是否随机删除节点
if keep_prob == 1:
### 不随机删除节点
a3 , cache = forward_propagation(X,parameters)
elif keep_prob < 1:
### 随机删除节点
a3 , cache = forward_propagation_with_dropout(X,parameters,keep_prob)
else:
print("keep_prob参数错误!程序退出。")
exit
# 计算成本
## 是否使用二范数
if lambd == 0:
### 不使用 L2 正则化
cost = compute_cost(a3,Y)
else:
### 使用 L2 正则化
cost = compute_cost_with_regularization(a3,Y,parameters,lambd)
## 可以同时使用 L2 正则化和随机删除节点,但是本次实验不同时使用。
assert(lambd == 0 or keep_prob == 1)
# 反向传播
## 两个参数的使用情况
if (lambd == 0 and keep_prob == 1):
### 不使用 L2 正则化和不使用随机删除节点
grads = backward_propagation(X,Y,cache)
elif lambd != 0:
### 使用 L2 正则化,不使用随机删除节点
grads = backward_propagation_with_regularization(X, Y, cache, lambd)
elif keep_prob < 1:
### 使用随机删除节点,不使用 L2 正则化
grads = backward_propagation_with_dropout(X, Y, cache, keep_prob)
#更新参数
parameters = update_parameters(parameters, grads, learning_rate)
# 记录并打印成本
if i % 1000 == 0:
## 记录成本
costs.append(cost)
if (print_cost and i % 10000 == 0):
# 打印成本
print("第" + str(i) + "次迭代,成本值为:" + str(cost))
# 是否绘制成本曲线图
if is_plot:
plt.subplot(3,2,k)
plt.plot(costs)
plt.ylabel('cost')
plt.xlabel('iterations (x1,000)')
# plt.title("Learning rate =" + str(learning_rate) +"\ndropout with keep_prob = " + str(keep_prob))
plt.title("Learning rate =" + str(learning_rate) +"\nregression with lambd = " + str(lambd))
# plt.show()
# 返回学习后的参数
return parameters
# # 测试使用L2正则化,但是不使用删除节点
# parameters = model(train_X,train_Y,lambd=0.7,is_plot=True)
# print("使用正则化,训练集:")
# predictions_train = predict(train_X, train_Y, parameters)
# print("使用正则化,测试集:")
# predictions_test = predict(test_X, test_Y, parameters)
# plt.title("Model with L2-regularization")
# axes = plt.gca()
# axes.set_xlim([-0.75,0.40])
# axes.set_ylim([-0.75,0.65])
# plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)
# plt.show()
'''
description: 反向drop_out技术实现
param {*} X
param {*} parameters
param {*} keep_prob
return {*}
'''
def forward_propagation_with_dropout(X, parameters, keep_prob = 0.5):
"""
Implements the forward propagation: LINEAR -> RELU + DROPOUT -> LINEAR -> RELU + DROPOUT -> LINEAR -> SIGMOID.
Arguments:
X -- input dataset, of shape (2, number of examples)
parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
W1 -- weight matrix of shape (20, 2)
b1 -- bias vector of shape (20, 1)
W2 -- weight matrix of shape (3, 20)
b2 -- bias vector of shape (3, 1)
W3 -- weight matrix of shape (1, 3)
b3 -- bias vector of shape (1, 1)
keep_prob - probability of keeping a neuron active during drop-out, scalar
Returns:
A3 -- last activation value, output of the forward propagation, of shape (1,1)
cache -- tuple, information stored for computing the backward propagation
"""
np.random.seed(1)
# Retrieve each parameter from the dictionary "parameters"
W1 = parameters["W1"]
b1 = parameters["b1"]
W2 = parameters["W2"]
b2 = parameters["b2"]
W3 = parameters["W3"]
b3 = parameters["b3"]
# LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
Z1 = np.dot(W1,X) + b1
A1 = relu(Z1)
# START CODE HERE
D1 = np.random.rand(A1.shape[0],A1.shape[1])
D1 = D1 < keep_prob
A1 = A1 * D1
A1 = A1 / keep_prob
# END CODE HERE
Z2 = np.dot(W2,A1) + b2
A2 = relu(Z2)
D2 = np.random.rand(A2.shape[0],A2.shape[1])
D2 = D2 < keep_prob
A2 = A2 * D2
A2 = A2 / keep_prob
Z3 = np.dot(W3,A2) + b3
A3 = sigmoid(Z3)
# 也需要返回D矩阵,因为需要根据D矩阵进行掩码操作
cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3)
return A3, cache
# X_assess, parameters = forward_propagation_with_dropout_test_case()
# # 测试forward_propagation_with_dropout函数
# A3, cache = forward_propagation_with_dropout(X_assess,parameters,keep_prob = 0.5)
# print("A3 = " + str(A3))
# 带有dropout的反向传输
def backward_propagation_with_dropout(X, Y, cache, keep_prob):
"""
Implements the backward propagation of our baseline model to which we added dropout.
Arguments:
X -- input dataset, of shape (2, number of examples)
Y -- "true" labels vector, of shape (output size, number of examples)
cache -- cache output from forward_propagation_with_dropout()
keep_prob - probability of keeping a neuron active during drop-out, scalar
Returns:
gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
"""
m = X.shape[1]
(Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache
# Retrieve each parameter from the dictionary "parameters"
dZ3 = A3 - Y
dW3 = 1./ m * np.dot(dZ3, A2.T)
db3 = 1./ m * np.sum(dZ3, axis = 1, keepdims = True)
dA2 = np.dot(W3.T,dZ3)
# 关闭操作
dA2 = dA2 * D2
dA2 = dA2 / keep_prob
dZ2 = np.multiply(dA2, np.int64(A2 > 0))
dW2 = 1./m * np.dot(dZ2, A1.T)
db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
dA1 = np.dot(W2.T,dZ2)
dA1 = dA1 * D1
dA1 = dA1 / keep_prob
dZ1 = np.multiply(dA1,np.int64(A1 > 0))
dW1 = 1./m * np.dot(dZ1, X.T)
db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
"dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1,
"dZ1": dZ1, "dW1": dW1, "db1": db1}
return gradients
# X_assess, Y_assess, cache = backward_propagation_with_dropout_test_case()
# gradients = backward_propagation_with_dropout(X_assess, Y_assess, cache, keep_prob = 0.8)
# print ("dA1 = " + str(gradients["dA1"]))
# print ("dA2 = " + str(gradients["dA2"]))
# 测试kee_prob因子,以一定概率的舍弃权重
keep_prob_list = [1.0,0.96,0.86,0.8,0.75,0.7]
lambd_list = [20.0,15.0,10.0,5.0,1.0,0]
# 没使用L2正则化,只测试keep_prob因子
for k in range(0,len(lambd_list)):
parameters = model(train_X,train_Y,learning_rate=0.3,num_iterations=30000,print_cost=True,is_plot=True,lambd=lambd_list[k],keep_prob=1.0,k= k +1)
print("使用正则化,训练集:")
predictions_train = predict(train_X, train_Y, parameters)
print("使用正则化,测试集:")
predictions_test = predict(test_X, test_Y, parameters)
# plt.subplot(3,2,k+1)
# # plt.title("Model with Dropout" + "\nkeep_prob is" + str(keep_prob_list[k]))
# plt.title("Model with regression" + "\nlambd" + str(lambd_list[k]))
# axes = plt.gca()
# axes.set_xlim([-0.75,0.40])
# axes.set_ylim([-0.75,0.65])
# plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)
plt.show()
结论
当lambd=0的时候,明显从结果图可以看出,过拟合严重,lambd增大的时候,就在提高w的衰减,过拟合趋势减少,但是一旦lamda过大,就会使得方差过高,导师模型错误,所以lambd不宜过高
2.6 Dropout代码
- Dropout代码
'''
Author: Jean_Leung
Date: 2024-04-16 10:11:58
LastEditors: Jean_Leung
LastEditTime: 2024-04-18 11:45:03
FilePath: \Chap2\chap6exercise.py
Description: 深度神经网络的正则化
具体实现:
1.不使用正则化
2.使用正则化
2.1 使用L2正则化
2.2 使用随机节点删除
Copyright (c) 2024 by ${robotlive limit}, All Rights Reserved.
'''
# import packages
import numpy as np
import matplotlib.pyplot as plt
from reg_utils import *
import sklearn
import sklearn.datasets
from testCase import *
# # 设置 plot 的默认大小
# plt.rcParams['figure.figsize'] = (7.0, 4.0)
# plt.rcParams['image.interpolation'] = 'nearest'
# plt.rcParams['image.cmap'] = 'gray'
plt.rcParams['figure.figsize'] = (20.0, 20.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'
train_X, train_Y, test_X, test_Y = load_2D_dataset(is_plot=False)
# plt.show()
# 使用正则化正向传播
def compute_cost_with_regularization(A3, Y, parameters, lambd):
"""
实现公式2的L2正则化计算成本
参数:
A3 - 正向传播的输出结果,维度为(输出节点数量,训练/测试的数量)
Y - 标签向量,与数据一一对应,维度为(输出节点数量,训练/测试的数量)
parameters - 包含模型学习后的参数的字典
lambda - 进行正则化因子
返回:
cost - 使用公式2计算出来的正则化损失的值
"""
m = Y.shape[1]
W1 = parameters["W1"]
W2 = parameters["W2"]
W3 = parameters["W3"]
# 先计算原先的compute_cost值
cross_entropy_cost = compute_cost(A3,Y)
# 这就是正则化公式,额外多出来的部分正则式
L2_regularization_cost = lambd * (np.sum(np.square(W1)) + np.sum(np.square(W2)) + np.sum(np.square(W3))) / (2 * m)
# 初始+正则化公式 = 新的costJ函数
cost = cross_entropy_cost + L2_regularization_cost
return cost
# 测试compute_cost_with_regularization函数
# print("============测试正则化后的cost函数==========")
# A3 , Y_assess,parameters = compute_cost_with_regularization_test_case()
# print("cost = " + str(compute_cost_with_regularization(A3,Y_assess,parameters,lambd=0.1)))
# 当然,因为改变了成本函数,我们也必须改变向后传播的函数,
# 所有的梯度都必须根据这个新的成本值来计算。
'''
description: 有正则表达式的反向传输
param {*} X
param {*} Y
param {*} cache
param {*} lambd
return {*}
'''
def backward_propagation_with_regularization(X,Y,cache,lambd):
"""
实现我们添加了L2正则化的模型的后向传播。
参数:
X - 输入数据集,维度为(输入节点数量,数据集里面的数量)
Y - 标签,维度为(输出节点数量,数据集里面的数量)
cache - 来自forward_propagation()的cache输出
lambda - regularization超参数,实数
返回:
gradients - 一个包含了每个参数、激活值和预激活值变量的梯度的字典
"""
m = X.shape[1]
# 首先,从字典 之前正向计算搜索出对应各类参数
(Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache
# 正向传播的梯度
dZ3 = A3 - Y
# 加上正则表达式,正则部分的导数为 lambd * W3 / m
dW3 = 1/m * np.dot(dZ3,A2.T) + ((lambd * W3) / m)
db3 = 1/m * np.sum(dZ3,axis=1,keepdims=True)
dA2 = np.dot(W3.T,dZ3)
dZ2 = np.multiply(dA2, np.int64(A2 > 0))
dW2 = 1/m * np.dot(dZ2,A1.T) + ((lambd * W2) / m)
db2 = 1/m * np.sum(dZ2,axis=1,keepdims=True)
dA1 = np.dot(W2.T,dZ2)
dZ1 = np.multiply(dA1, np.int64(A1 > 0))
dW1 = 1/m * np.dot(dZ1,X.T) + ((lambd * W1) / m)
db1 = 1/m * np.sum(dZ1,axis=1,keepdims=True)
gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3, "dA2": dA2,
"dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1,
"dZ1": dZ1, "dW1": dW1, "db1": db1}
return gradients
# X_assess,Y_assess,cache = backward_propagation_with_regularization_test_case()
# # 测试backward_propagation_with_regularization函数
# grads = backward_propagation_with_regularization(X_assess,Y_assess,cache,lambd=0.7)
# print("dW1 = " + str(grads["dW1"]))
# print("dW2 = " + str(grads["dW2"]))
# print("dW3 = " + str(grads["dW3"]))
'''
description: 总模型
param {*} X
param {*} Y
param {*} learning_rate
param {*} num_iterations
param {*} print_cost
param {*} is_plot
param {*} lambd
param {*} keep_prob
param {*} k 用来表示显示图片的图层
return {*}
'''
def model(X,Y,learning_rate=0.3,num_iterations=30000,print_cost=True,is_plot=True,lambd=0,keep_prob=1,k=1):
"""
实现一个三层的神经网络:LINEAR ->RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
参数:
X - 输入的数据,维度为(2, 要训练/测试的数量)
Y - 标签,【0(蓝色) | 1(红色)】,维度为(1,对应的是输入的数据的标签)
learning_rate - 学习速率
num_iterations - 迭代的次数
print_cost - 是否打印成本值,每迭代10000次打印一次,但是每1000次记录一个成本值
is_polt - 是否绘制梯度下降的曲线图
lambd - 正则化的超参数,实数
keep_prob - 随机删除节点的概率
返回
parameters - 学习后的参数
"""
grads = {}
costs = []
m = X.shape[1]
layers_dims = [X.shape[0],20,3,1]
# 初始化参数
parameters = initialize_parameters(layers_dims)
# 开始学习
for i in range(0,num_iterations):
# 前向传播
## 是否随机删除节点
if keep_prob == 1:
### 不随机删除节点
a3 , cache = forward_propagation(X,parameters)
elif keep_prob < 1:
### 随机删除节点
a3 , cache = forward_propagation_with_dropout(X,parameters,keep_prob)
else:
print("keep_prob参数错误!程序退出。")
exit
# 计算成本
## 是否使用二范数
if lambd == 0:
### 不使用 L2 正则化
cost = compute_cost(a3,Y)
else:
### 使用 L2 正则化
cost = compute_cost_with_regularization(a3,Y,parameters,lambd)
# 反向传播
## 可以同时使用 L2 正则化和随机删除节点,但是本次实验不同时使用。
assert(lambd == 0 or keep_prob ==1)
## 两个参数的使用情况
if (lambd == 0 and keep_prob == 1):
### 不使用 L2 正则化和不使用随机删除节点
grads = backward_propagation(X,Y,cache)
elif lambd != 0:
### 使用 L2 正则化,不使用随机删除节点
grads = backward_propagation_with_regularization(X, Y, cache, lambd)
elif keep_prob < 1:
### 使用随机删除节点,不使用 L2 正则化
grads = backward_propagation_with_dropout(X, Y, cache, keep_prob)
#更新参数
parameters = update_parameters(parameters, grads, learning_rate)
# 记录并打印成本
if i % 1000 == 0:
## 记录成本
costs.append(cost)
if (print_cost and i % 10000 == 0):
# 打印成本
print("第" + str(i) + "次迭代,成本值为:" + str(cost))
# 是否绘制成本曲线图
if is_plot:
plt.plot(costs)
plt.ylabel('cost')
plt.xlabel('iterations (x1,000)')
plt.title("Learning rate =" + str(learning_rate))
plt.show()
# 返回学习后的参数
return parameters
# # 测试使用L2正则化,但是不使用删除节点
# parameters = model(train_X,train_Y,lambd=0.7,is_plot=True)
# print("使用正则化,训练集:")
# predictions_train = predict(train_X, train_Y, parameters)
# print("使用正则化,测试集:")
# predictions_test = predict(test_X, test_Y, parameters)
# plt.title("Model with L2-regularization")
# axes = plt.gca()
# axes.set_xlim([-0.75,0.40])
# axes.set_ylim([-0.75,0.65])
# plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)
# plt.show()
'''
description: 反向drop_out技术实现
param {*} X
param {*} parameters
param {*} keep_prob
return {*}
'''
def forward_propagation_with_dropout(X, parameters, keep_prob = 0.5):
"""
Implements the forward propagation: LINEAR -> RELU + DROPOUT -> LINEAR -> RELU + DROPOUT -> LINEAR -> SIGMOID.
Arguments:
X -- input dataset, of shape (2, number of examples)
parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
W1 -- weight matrix of shape (20, 2)
b1 -- bias vector of shape (20, 1)
W2 -- weight matrix of shape (3, 20)
b2 -- bias vector of shape (3, 1)
W3 -- weight matrix of shape (1, 3)
b3 -- bias vector of shape (1, 1)
keep_prob - probability of keeping a neuron active during drop-out, scalar
Returns:
A3 -- last activation value, output of the forward propagation, of shape (1,1)
cache -- tuple, information stored for computing the backward propagation
"""
np.random.seed(1)
# Retrieve each parameter from the dictionary "parameters"
W1 = parameters["W1"]
b1 = parameters["b1"]
W2 = parameters["W2"]
b2 = parameters["b2"]
W3 = parameters["W3"]
b3 = parameters["b3"]
# LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
Z1 = np.dot(W1,X) + b1
A1 = relu(Z1)
# START CODE HERE
D1 = np.random.rand(A1.shape[0],A1.shape[1])
D1 = D1 < keep_prob
A1 = A1 * D1
A1 = A1 / keep_prob
# END CODE HERE
Z2 = np.dot(W2,A1) + b2
A2 = relu(Z2)
D2 = np.random.rand(A2.shape[0],A2.shape[1])
D2 = D2 < keep_prob
A2 = A2 * D2
A2 = A2 / keep_prob
Z3 = np.dot(W3,A2) + b3
A3 = sigmoid(Z3)
# 也需要返回D矩阵,因为需要根据D矩阵进行掩码操作
cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3)
return A3, cache
# X_assess, parameters = forward_propagation_with_dropout_test_case()
# # 测试forward_propagation_with_dropout函数
# A3, cache = forward_propagation_with_dropout(X_assess,parameters,keep_prob = 0.5)
# print("A3 = " + str(A3))
# 带有dropout的反向传输
def backward_propagation_with_dropout(X, Y, cache, keep_prob):
"""
Implements the backward propagation of our baseline model to which we added dropout.
Arguments:
X -- input dataset, of shape (2, number of examples)
Y -- "true" labels vector, of shape (output size, number of examples)
cache -- cache output from forward_propagation_with_dropout()
keep_prob - probability of keeping a neuron active during drop-out, scalar
Returns:
gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
"""
m = X.shape[1]
(Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache
# Retrieve each parameter from the dictionary "parameters"
dZ3 = A3 - Y
dW3 = 1./ m * np.dot(dZ3, A2.T)
db3 = 1./ m * np.sum(dZ3, axis = 1, keepdims = True)
dA2 = np.dot(W3.T,dZ3)
# 关闭操作
dA2 = dA2 * D2
dA2 = dA2 / keep_prob
dZ2 = np.multiply(dA2, np.int64(A2 > 0))
dW2 = 1./m * np.dot(dZ2, A1.T)
db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
dA1 = np.dot(W2.T,dZ2)
dA1 = dA1 * D1
dA1 = dA1 / keep_prob
dZ1 = np.multiply(dA1,np.int64(A1 > 0))
dW1 = 1./m * np.dot(dZ1, X.T)
db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
"dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1,
"dZ1": dZ1, "dW1": dW1, "db1": db1}
return gradients
# X_assess, Y_assess, cache = backward_propagation_with_dropout_test_case()
# gradients = backward_propagation_with_dropout(X_assess, Y_assess, cache, keep_prob = 0.8)
# print ("dA1 = " + str(gradients["dA1"]))
# print ("dA2 = " + str(gradients["dA2"]))
# 测试kee_prob因子,以一定概率的舍弃权重
keep_prob_list = [0.86,0.8,0.75,0.7,0.6]
# 没使用L2正则化,只测试keep_prob因子
for k in range(0,len(keep_prob_list)):
parameters = model(train_X,train_Y,learning_rate=0.3,num_iterations=30000,print_cost=True,is_plot=False,lambd=0,keep_prob=keep_prob_list[k],k=k)
print("使用正则化,训练集:")
predictions_train = predict(train_X, train_Y, parameters)
print("使用正则化,测试集:")
predictions_test = predict(test_X, test_Y, parameters)
plt.subplot(3,2,k+1)
plt.title("Model with Dropout")
axes = plt.gca()
axes.set_xlim([-0.75,0.40])
axes.set_ylim([-0.75,0.65])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)
plt.show()
- 测试效果
结论:
- 此次测试网络为(2,20,3,1为三层网络),也就是针对与20的第一层,可以keep_prob调小一点,以舍弃更多的神经元,那么这就可以达到避免 过拟合的效果
- 当 keep_prob= 1的时候,模型的过拟合很明显从上图测试结果图一可以看出,当keep_prob变小的时候,过拟合明显变小,方差变小
3. 深度神经网络的梯度校验
3.1 背景
Back Propagation神经网络有一项重要的测试是梯度检查(gradient checking)。
其目的是检查验证反向传播过程中梯度下降算法是否正确。
3.2 梯度校验原理
- 在高数中,我们有导数的定义
导数定义 : tan α = lim Δ x → 0 tan φ = lim Δ x → 0 f ( x 0 + Δ x ) − f ( x 0 ) Δ x \begin{align} & 导数定义:\tan \alpha=\lim _{\Delta x \rightarrow 0} \tan \varphi=\lim _{\Delta x \rightarrow 0} \frac{f\left(x_{0}+\Delta x\right)-f\left(x_{0}\right)}{\Delta x}\\ \end{align} 导数定义:tanα=Δx→0limtanφ=Δx→0limΔxf(x0+Δx)−f(x0)
- 由前面笔记可得,二元函数的梯度就是在该点的斜率,那么在某点的梯度就表示为
那么我们可以得到
g ′ ( θ ) = lim ε → 0 f ( θ + ε ) − f ( θ − ε ) 2 ε g^{'}(\theta)=\lim_{\varepsilon \to 0} \frac{f(\theta+\varepsilon)-f(\theta-\varepsilon)}{2 \varepsilon} g′(θ)=ε→0lim2εf(θ+ε)−f(θ−ε)
- 一元函数 y = θx 求解过程:
1. θ + = θ + ε 2. θ − = θ − ε 3. J + = J ( θ + ) 4. J − = J ( θ − ) 5. g r a d a p p r o x = J + − J − 2 ε 6. 反向传播计算梯度 , 将结果存储在 g r a d 中 7. 计算 difference = ∥ grad-gradapprox ∥ 2 ∥ grad ∥ 2 + ∥ gradapprox ∥ 2 \begin{align} &1. \theta^{+}= \theta+\varepsilon \\ &2. \theta^{-} = \theta-\varepsilon \\ &3. J^{+} = J\left(\theta^{+}\right) \\ &4. J^{-} = J\left(\theta^{-}\right)\\ &5. gradapprox = \frac{J^{+}-J^{-}}{2 \varepsilon } \\ &6. 反向传播计算梯度,将结果存储在grad中 \\ &7. 计算\text { difference }=\frac{\| \text { grad-gradapprox } \|_{2}}{\| \text { grad }\left\|_{2}+\right\| \text { gradapprox } \|_{2}} \end{align} 1.θ+=θ+ε2.θ−=θ−ε3.J+=J(θ+)4.J−=J(θ−)5.gradapprox=2εJ+−J−6.反向传播计算梯度,将结果存储在grad中7.计算 difference =∥ grad ∥2+∥ gradapprox ∥2∥ grad-gradapprox ∥2
多维度函数 J求解过程
J ( w [ 1 ] , b [ 1 ] , ⋯ , w [ L ] , b [ L ] ) J\left(w^{[1]}, b^{[1]}, \cdots, w^{[L]}, b^{[L]}\right) J(w[1],b[1],⋯,w[L],b[L])
根据测试案例
设置
W [ 1 ] = [ W 11 W 12 W 13 W 14 W 21 W 22 W 23 W 24 W 31 W 32 W 33 W 34 W 41 W 42 W 43 W 44 W 51 W 52 W 53 W 54 ] W^{[1]} = \begin{bmatrix} W^{11}& W^{12} & W^{13} & W^{14} \\ W^{21}& W^{22} & W^{23} & W^{24} \\ W^{31}& W^{32} & W^{33} & W^{34} \\ W^{41}& W^{42} & W^{43} & W^{44} \\ W^{51}& W^{52} & W^{53} & W^{54} \end{bmatrix} W[1]= W11W21W31W41W51W12W22W32W42W52W13W23W33W43W53W14W24W34W44W54 W [ 2 ] = [ W 11 W 12 W 13 W 14 W 15 W 21 W 22 W 23 W 24 W 25 W 31 W 32 W 33 W 34 W 35 ] W^{[2]}=\begin{bmatrix} W^{11}& W^{12} & W^{13} & W^{14} & W^{15}\\ W^{21}& W^{22} & W^{23} & W^{24} & W^{25}\\ W^{31}& W^{32} & W^{33} & W^{34} & W^{35}\\ \end{bmatrix} W[2]= W11W21W31W12W22W32W13W23W33W14W24W34W15W25W35
W [ 1 ] = [ W 11 W 12 W 13 ] W^{[1]}=\begin{bmatrix} W^{11}& W^{12} & W^{13} \\ \end{bmatrix} W[1]=[W11W12W13]
b [ 1 ] = [ b 11 b 21 b 31 b 41 b 51 ] b^{[1]}=\begin{bmatrix} b^{11} \\ b^{21}\\ b^{31}\\ b^{41}\\ b^{51} \end{bmatrix} b[1]= b11b21b31b41b51
b [ 2 ] = [ b 11 b 21 b 31 ] b^{[2]}=\begin{bmatrix} b^{11} \\ b^{21}\\ b^{31}\\ \end{bmatrix} b[2]= b11b21b31
b [ 3 ] = b 11 b^{[3]}=b^{11} b[3]=b11
将parameters字典数组转化为列向量,方便运算
3.3 梯度校验代码
- 主代码
'''
Author: Jean_Leung
Date: 2024-04-16 21:05:57
LastEditors: Jean_Leung
LastEditTime: 2024-04-18 17:10:31
FilePath: \Chap2\chap7exercise.py
Description: 深度神经网络的梯度检验
Copyright (c) 2024 by ${robotlive limit}, All Rights Reserved.
'''
import numpy as np
from gc_utils import sigmoid, relu, dictionary_to_vector, vector_to_dictionary, gradients_to_vector
from testCase import *
import numpy as np
from gc_utils import sigmoid, relu, dictionary_to_vector, vector_to_dictionary, gradients_to_vector
from testCase import *
# 正向传播
'''
description: 正向传播
param {*} X
param {*} Y
param {*} parameters
return {*}
'''
def forward_propagation_n(X,Y,parameters):
"""
正向传播
Arguments:
X -- 输入数据
Y -- 标签数据
parameters -- 神经网络的参数
Returns:
cost -- 成本函数
cache -- 用于后向传播的缓存
"""
m = X.shape[1]
W1 = parameters["W1"]
b1 = parameters["b1"]
W2 = parameters["W2"]
b2 = parameters["b2"]
W3 = parameters["W3"]
b3 = parameters["b3"]
# LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
Z1 = np.dot(W1,X) + b1
A1 = relu(Z1)
Z2 = np.dot(W2,A1) + b2
A2 = relu(Z2)
Z3 = np.dot(W3,A2) + b3
A3 = sigmoid(Z3)
# 计算成本
logprobs = np.multiply(-np.log(A3),Y) + np.multiply(-np.log(1-A3),1-Y)
cost = (1/m) * np.sum(logprobs)
cache = (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3)
return cost, cache
# 反向传播
'''
description: 反向传播
param {*} X
param {*} Y
param {*} cache
return {*}
'''
def backward_propagation_n(X, Y, cache):
"""
反向传播
Arguments:
X -- 输入数据
Y -- 标签数据
cache -- 用于后向传播的缓存
Returns:
grads -- 梯度
"""
m = X.shape[1]
(Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache
dZ3 = A3 - Y
dW3 = 1/m * np.dot(dZ3,A2.T)
db3 = 1/m * np.sum(dZ3,axis=1,keepdims=True)
dA2 = np.dot(W3.T, dZ3)
dZ2 = np.multiply(dA2, np.int64(A2 > 0))
dW2 = 1./m * np.dot(dZ2, A1.T)
db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
dA1 = np.dot(W2.T, dZ2)
dZ1 = np.multiply(dA1, np.int64(A1 > 0))
dW1 = 1./m * np.dot(dZ1, X.T)
db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,
"dA2": dA2, "dZ2": dZ2, "dW2": dW2, "db2": db2,
"dA1": dA1, "dZ1": dZ1, "dW1": dW1, "db1": db1}
return gradients
'''
description: 梯度检查代码
param {*} parameters
param {*} gradients
param {*} X
param {*} Y
param {*} epsilon
return {*}
'''
def gradient_check_n(parameters,gradients,X,Y,epsilon=1e-7):
"""
检查梯度是否正确
Arguments:
parameters -- 神经网络的参数
gradients -- 梯度字典
X -- 输入数据
Y -- 标签数据
epsilon -- 数值比较精度
Returns:
"""
m = X.shape[1]
W1 = parameters["W1"]
b1 = parameters["b1"]
W2 = parameters["W2"]
b2 = parameters["b2"]
W3 = parameters["W3"]
b3 = parameters["b3"]
# 先准备paramters矩阵,然后将parameters矩阵进行向量化
# parameters_values 为parameters向量
parameters_values, key = dictionary_to_vector(parameters)
# gradients_values 为梯度向量
gradients_values = gradients_to_vector(gradients)
num_parameters = parameters_values.shape[0] # 向量维度,有多少个梯度,就需要迭代多少次
# print(num_parameters) # 包括W1,b1,W2,b2,W3,b3的所有参数,共47个
# print(gradients_values.shape[0]) # 一样是47个参数
J_plus = np.zeros((num_parameters,1)) # 将J_plus初始化为(47,1)的列向量,也就是cost损失函数值
J_minus = np.zeros((num_parameters,1))# 将J_minus初始化为(47,1)的列向量,也就是cost损失函数值
gradapprox = np.zeros((num_parameters,1)) # 将gradapprox初始化为(47,1)的列向量
for i in range(0,num_parameters):
# 算左极限函数
thetaplus = np.copy(parameters_values) # 将θ设置为parameters_values
thetaplus[i][0] = thetaplus[i][0] + epsilon # 将θi = θi + epsilon
J_plus[i], cache = forward_propagation_n(X,Y,vector_to_dictionary(thetaplus)) # Step 3 ,cache用不到,正向遍历,算出最后Jplus函数,
# 同理,算右极限函数
thetaminus = np.copy(parameters_values) # 将θ设置为parameters_values
thetaminus[i][0] = thetaminus[i][0] - epsilon # 将θi = θi + epsilon
J_minus[i], cache = forward_propagation_n(X,Y,vector_to_dictionary(thetaminus)) # Step 3 ,cache用不到,正向遍历,算出最后Jplus函数
# 计算 gradapprox[i],这一步就是导数的极限定义
gradapprox[i] = (J_plus[i] - J_minus[i]) / (2 * epsilon)
# print(gradapprox[i])
numerator = np.linalg.norm(gradients_values - gradapprox)
denominator = np.linalg.norm(gradients_values) + np.linalg.norm(gradapprox)
difference = numerator / denominator
if difference > 1e-7:
print ("\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m")
else:
print ("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m")
return difference
X, Y, parameters = gradient_check_n_test_case()
cost, cache = forward_propagation_n(X, Y, parameters)
gradients = backward_propagation_n(X, Y, cache)
difference = gradient_check_n(parameters, gradients, X, Y)
# print(difference)
- gc_utils代码
'''
Author: Jean_Leung
Date: 2024-04-16 21:08:25
LastEditors: Jean_Leung
LastEditTime: 2024-04-18 15:41:20
FilePath: \Chap2\gc_utils.py
Description: 深度神经网络梯度校验工具类
Copyright (c) 2024 by ${robotlive limit}, All Rights Reserved.
'''
import numpy as np
def sigmoid(x):
"""
Compute the sigmoid of x
Arguments:
x -- A scalar or numpy array of any size.
Return:
s -- sigmoid(x)
"""
s = 1/(1+np.exp(-x))
return s
def relu(x):
"""
Compute the relu of x
Arguments:
x -- A scalar or numpy array of any size.
Return:
s -- relu(x)
"""
s = np.maximum(0,x)
return s
'''
description: 字典向量化
param {*} parameters
return {*}
'''
def dictionary_to_vector(parameters):
"""
Roll all our parameters dictionary into a single vector satisfying our specific required shape.
"""
keys = []
count = 0
for key in ["W1", "b1", "W2", "b2", "W3", "b3"]:
# flatten parameter
new_vector = np.reshape(parameters[key], (-1,1))
keys = keys + [key]*new_vector.shape[0]
if count == 0:
theta = new_vector
else:
theta = np.concatenate((theta, new_vector), axis=0)
count = count + 1
return theta, keys
def vector_to_dictionary(theta):
"""
Unroll all our parameters dictionary from a single vector satisfying our specific required shape.
"""
parameters = {}
parameters["W1"] = theta[:20].reshape((5,4))
parameters["b1"] = theta[20:25].reshape((5,1))
parameters["W2"] = theta[25:40].reshape((3,5))
parameters["b2"] = theta[40:43].reshape((3,1))
parameters["W3"] = theta[43:46].reshape((1,3))
parameters["b3"] = theta[46:47].reshape((1,1))
return parameters
'''
description: 将gradietns字典,也就是梯度字典向量化
param {*} gradients
return {*}
'''
def gradients_to_vector(gradients):
"""
Roll all our gradients dictionary into a single vector satisfying our specific required shape.
"""
count = 0
for key in ["dW1", "db1", "dW2", "db2", "dW3", "db3"]:
# flatten parameter
new_vector = np.reshape(gradients[key], (-1,1))
if count == 0:
theta = new_vector
else:
theta = np.concatenate((theta, new_vector), axis=0)
count = count + 1
return theta
- 测试结果
4. 深度神经网络梯度下降优化算法
~待更新
5. 超参数调试、Batch正则化和程序框架
~待更新