一、全连接神经网络
在上一次作业中,已经实现了两层神经网络的架构。但该实现有个问题,即程序不够模块化,比如在loss()
函数中,同时计算了损失函数和各参数的梯度。这种耦合,使得扩展网络深度时,需要做大量修改。另外,神经网络的层与层的结构也类似,这意味着朴素实现的方式中存在着代码重复。而本作业中,将要实现一种模块化的神经网路架构:将各个功能层封装为一个对象,如全连接层对象、ReLU层对象;在各层对象的前向传播函数中,将由上层传来的数据和本层的系数,产生本层的激活函数输出值,并缓存计算梯度时所必需的参数;在各层对象的后向传播函数中,将由下层传来的各个激活值的梯度和本层的缓存值,计算本层各参数的梯度值。
1. 仿射层的前向传播
前向传播的实现非常直白,与上次作业中的实现相比,这次还需缓存计算本层参数的梯度时,所需的中间计算结果。
layers
文件中的affine_forward()
函数:
def affine_forward(x, w, b):
out = None
# TODO: Implement the affine forward pass.
batch_size = x.shape[0]
x_oneline = x.reshape(batch_size, -1)
out = x_oneline.dot(w) + b
cache = (x, w, b)
return out, cache
2. 仿射层的后向传播
相比于上次作业中的实现,这里只是将具体逻辑抽离出来,并封装为单独的函数。
layers
文件中的affine_backward()
函数:
def affine_backward(dout, cache):
x, w, b = cache
x_shape = x.shape
batch_size = x_shape[0]
sample_shape = x_shape[1:]
M = w.shape[1]
# TODO: Implement the affine backward pass.
x_oneline = x.reshape(batch_size, -1)
dx, dw, db = None, None, None
dx = dout.dot(w.T).reshape(batch_size, *sample_shape)
dw = x_oneline.T.dot(dout)
db = np.sum(np.ones(M) * dout, axis=0)
return dx, dw, db
3. ReLU
层的前向传播
layers
文件中的relu_forward()
函数:
def relu_forward(x):
out = None
# TODO: Implement the ReLU forward pass.
out = np.maximum(0, x)
cache = x
return out, cache
4. ReLU
层的后向传播
在计算图的后向传播结构中,激活层就像一个门开关。
layers
文件中的relu_backward()
函数:
def relu_backward(dout, cache):
dx, x = None, cache
# TODO: Implement the ReLU backward pass.
dx = dout*(x>0)
return dx
5. 利用层对象重新实现两层神经网络
fc_net
文件中的TwoLayerNet
对象:
class TwoLayerNet(object):
def __init__(self, input_dim=3*32*32, hidden_dim=100, num_classes=10,
weight_scale=1e-3, reg=0.0):
self.params = {}
self.reg = reg
# TODO: Initialize the weights and biases of the two-layer net.
self.params["W1"] = weight_scale * np.random.randn(input_dim, hidden_dim)
self.params["b1"] = np.zeros(hidden_dim)
self.params["W2"] = weight_scale * np.random.randn(hidden_dim, num_classes)
self.params["b2"] = np.zeros(num_classes)
def loss(self, X, y=None):
scores = None
# TODO: Implement the forward pass for the two-layer net.
layer1_relu_out, layer1_relu_cache = affine_relu_forward(X, self.params["W1"], self.params["b1"])
layer2_out, layer2_cache = affine_forward(layer1_relu_out, self.params["W2"], self.params["b2"])
scores = layer2_out
# If y is None then we are in test mode so just return scores
if y is None:
return scores
loss, grads = 0, {}
# TODO: Implement the backward pass for the two-layer net.
loss, dloss = softmax_loss(layer2_out, y)
loss += 0.5 * self.reg * (np.sum(np.square(self.params["W1"])) + np.sum(np.square(self.params["W2"])))
dlayer2_out, dW2, db2 = affine_backward(dloss, layer2_cache)
_, dW1, db1 = affine_relu_backward(dlayer2_out, layer1_relu_cache)
grads["W1"] = dW1 + self.reg * self.params["W1"]
grads["b1"] = db1
grads["W2"] = dW2 + self.reg * self.params["W2"]
grads["b2"] = db2
return loss, grads
6. 封装训练过程
下述参数可以使得在测试集上的准确率约为53%:
# TODO: Use a Solver instance to train a TwoLayerNet.
model = TwoLayerNet(reg=0.2)
solver = Solver(model, data,
update_rule='sgd',
optim_config={
'learning_rate': 1e-3,
},
lr_decay=0.95,
num_epochs=20, batch_size=500,
print_every=500)
solver.train()
7. 层数任意的全连接网络
依据传入的hidden_dims
参数,决定网络层数。
fc_net
文件中的FullyConnectedNet
对象:
class FullyConnectedNet(object):
def __init__(self, hidden_dims, input_dim=3*32*32, num_classes=10,
dropout=0, use_batchnorm=False, reg=0.0,
weight_scale=1e-2, dtype=np.float32, seed=None):
self.use_batchnorm = use_batchnorm
self.use_dropout = dropout > 0
self.reg = reg
self.num_layers = 1 + len(hidden_dims)
self.dtype = dtype
self.params = {}
# TODO: Initialize the parameters of the network.
param_dims = [input_dim] + hidden_dims + [num_classes]
for indx in range(1, len(param_dims)):
self.params["W"+str(indx)] = weight_scale * np.random.randn(param_dims[indx-1], param_dims[indx])
self.params["b"+str(indx)] = np.zeros(param_dims[indx])
if self.use_batchnorm:
for indx in range(1, len(param_dims) - 1):
self.params["gamma"+str(indx)] = np.ones(param_dims[indx])
self.params["beta" +str(indx)] = np.zeros(param_dims[indx])
self.dropout_param = {}
if self.use_dropout:
self.dropout_param = {'mode': 'train', 'p': dropout}
if seed is not None:
self.dropout_param['seed'] = seed
self.bn_params = []
if self.use_batchnorm:
self.bn_params = [{'mode': 'train'} for i in range(self.num_layers - 1)]
# Cast all parameters to the correct datatype
for k, v in self.params.items():
self.params[k] = v.astype(dtype)
def loss(self, X, y=None):
X = X.astype(self.dtype)
mode = 'test' if y is None else 'train'
# Set train/test mode for batchnorm params and dropout param since they
# behave differently during training and testing.
if self.use_dropout:
self.dropout_param['mode'] = mode
if self.use_batchnorm:
for bn_param in self.bn_params:
bn_param['mode'] = mode
# TODO: Implement the forward pass for the fully-connected net
layer_relu_out = X
layer_cache_dict = {}
if self.use_batchnorm:
for i in range(1, self.num_layers):
layer_relu_out, layer_relu_cache = affine_norm_relu_forward(layer_relu_out, self.params["W"+str(i)],\
self.params["b"+str(i)], self.params["gamma"+str(i)],
self.params["beta"+str(i)], self.bn_params[i-1])
if self.use_dropout:
layer_relu_out, dropout_cache = dropout_forward(layer_relu_out, self.dropout_param)
layer_cache_dict["dropout"+str(i)] = dropout_cache
layer_cache_dict[i] = layer_relu_cache
else:
for i in range(1, self.num_layers):
layer_relu_out, layer_relu_cache = affine_relu_forward(layer_relu_out, self.params["W"+str(i)], self.params["b"+str(i)])
if self.use_dropout:
layer_relu_out, dropout_cache = dropout_forward(layer_relu_out, self.dropout_param)
layer_cache_dict["dropout"+str(i)] = dropout_cache
layer_cache_dict[i] = layer_relu_cache
final_layer_out, final_layer_cache = affine_forward(layer_relu_out, self.params["W" + str(self.num_layers)], self.params["b"+str(self.num_layers)])
layer_cache_dict[self.num_layers] = final_layer_cache
scores = final_layer_out
if mode == 'test':
return scores
loss, grads = 0.0, {}
# TODO: Implement the backward pass for the fully-connected net.
loss, dloss = softmax_loss(final_layer_out, y)
for i in range(self.num_layers):
loss += 0.5 * self.reg * (np.sum(np.square(self.params["W"+str(i+1)])))
dx, final_dW, final_db = affine_backward(dloss, layer_cache_dict[self.num_layers])
grads["W"+str(self.num_layers)] = final_dW + self.reg * self.params["W"+str(self.num_layers)]
grads["b"+str(self.num_layers)] = final_db
if self.use_batchnorm:
for i in range(self.num_layers-1, 0, -1):
if self.use_dropout:
dx = dropout_backward(dx, layer_cache_dict["dropout"+str(i)])
dx, dw, db, dgamma, dbeta = affine_norm_relu_backward(dx, layer_cache_dict[i])
grads["W"+str(i)] = dw + self.reg * self.params["W"+str(i)]
grads["b"+str(i)] = db
grads["gamma"+str(i)] = dgamma
grads["beta" +str(i)] = dbeta
else:
for i in range(self.num_layers-1, 0, -1):
if self.use_dropout:
dx = dropout_backward(dx, layer_cache_dict["dropout"+str(i)])
dx, dw, db = affine_relu_backward(dx, layer_cache_dict[i])
grads["W"+str(i)] = dw + self.reg * self.params["W"+str(i)]
grads["b"+str(i)] = db
return loss, grads
8. 随机梯度下降法的改良:SGD+Momentum
、RNSProp
、Adam
SGD+Momentum
v
t
+
1
=
ρ
v
t
−
∇
f
(
x
t
)
,
x
t
+
1
=
x
t
+
α
v
t
+
1
v_{t+1} = \rho v_t - \nabla f(x_t), \quad x_{t+1} = x_t + \alpha v_{t+1}
vt+1=ρvt−∇f(xt),xt+1=xt+αvt+1
所以
t
+
1
t+1
t+1次时,参数的更新方向为:
v
t
+
1
=
−
∇
f
(
x
t
)
−
ρ
∇
f
(
x
t
−
1
)
−
⋯
−
ρ
t
∇
f
(
x
0
)
v_{t+1} = - \nabla f(x_t) - \rho \nabla f(x_{t-1}) - \cdots - \rho^t\nabla f(x_0)
vt+1=−∇f(xt)−ρ∇f(xt−1)−⋯−ρt∇f(x0)
optim
文件中的sgd_momentum()
函数:
def sgd_momentum(w, dw, config=None):
if config is None: config = {}
config.setdefault('learning_rate', 1e-2)
config.setdefault('momentum', 0.9)
v = config.get('velocity', np.zeros_like(w))
next_w = None
# TODO: Implement the momentum update formula.
v = config["momentum"] * v - config["learning_rate"] * dw
next_w = w + v
config['velocity'] = v
return next_w, config
RMSProp
n o r m 2 t + 1 = ρ ⋅ n o r m 2 t + ( 1 − ρ ) ∥ ∇ f ( x t ) ∥ 2 , x t + 1 = x t − α ∇ f ( x t ) n o r m 2 t + 1 \mathrm{norm2}_{t+1} = \rho \cdot \mathrm{norm2}_{t} + (1-\rho) \|\nabla f(x_t)\|^2, \quad x_{t+1} = x_t - \alpha \frac{\nabla f(x_t)}{\sqrt{\mathrm{norm2}_{t+1}}} norm2t+1=ρ⋅norm2t+(1−ρ)∥∇f(xt)∥2,xt+1=xt−αnorm2t+1∇f(xt)
optim
文件中的rmsprop
函数:
def rmsprop(x, dx, config=None):
if config is None: config = {}
config.setdefault('learning_rate', 1e-2)
config.setdefault('decay_rate', 0.99)
config.setdefault('epsilon', 1e-8)
config.setdefault('cache', np.zeros_like(x))
next_x = None
# TODO: Implement the RMSprop update formula.
grad_squared = config["decay_rate"] * config["cache"] + (1-config["decay_rate"]) * dx * dx
next_x = x - config["learning_rate"] * dx / (np.sqrt(grad_squared + config["epsilon"]))
config["cache"] = grad_squared
return next_x, config
Adam
optim
文件中的adam
函数:
def adam(x, dx, config=None):
if config is None: config = {}
config.setdefault('learning_rate', 1e-3)
config.setdefault('beta1', 0.9)
config.setdefault('beta2', 0.999)
config.setdefault('epsilon', 1e-8)
config.setdefault('m', np.zeros_like(x))
config.setdefault('v', np.zeros_like(x))
config.setdefault('t', 1)
next_x = None
# TODO: Implement the Adam update formula.
first_moment = config["beta1"] * config["m"] + (1 - config["beta1"]) * dx
second_moment = config["beta2"] * config["v"] + (1 - config["beta2"]) * dx * dx
first_unbias = first_moment / (1 - np.power(config["beta1"], config["t"]))
second_unbias = second_moment / (1 - np.power(config["beta2"], config["t"]))
next_x = x - config["learning_rate"] * first_unbias / (np.sqrt(second_unbias + config["epsilon"]))
config["m"] = first_moment
config["v"] = second_moment
return next_x, config
二、批量标准化(Batch Normalization
)
1. BN层的前向传播
使用加权平均记录整个训练数据的均值和方差,用于预测分类过程。
layers
文件中的batchnorm_forward()
函数:
def batchnorm_forward(x, gamma, beta, bn_param):
mode = bn_param['mode']
eps = bn_param.get('eps', 1e-5)
momentum = bn_param.get('momentum', 0.9)
N, D = x.shape
running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype))
running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype))
out, cache = None, None
if mode == 'train':
# TODO: Implement the training-time forward pass for batch norm.
batch_mean = np.mean(x, axis=0)
batch_var = np.var(x, axis=0)
running_mean = momentum * running_mean + (1 - momentum) * batch_mean
running_var = momentum * running_var + (1 - momentum) * batch_var
out = (x - batch_mean)/np.sqrt(batch_var)
cache = {"batch_var": batch_var, "x_norm":out, "gamma":gamma}
out = out*gamma + beta
elif mode == 'test':
# TODO: Implement the test-time forward pass for batch normalization.
out = (x - running_mean)/np.sqrt(running_var)
out = out*gamma + beta
else:
raise ValueError('Invalid forward batchnorm mode "%s"' % mode)
# Store the updated running means back into bn_param
bn_param['running_mean'] = running_mean
bn_param['running_var'] = running_var
return out, cache
2. BN层的后向传播
设一个样本数据为
x
⃗
=
[
x
1
,
 
x
2
,
⋯
 
,
 
x
m
]
\vec{x} = [x_1,\, x_2, \cdots,\, x_m]
x=[x1,x2,⋯,xm],归一化后可表示为
x
⃗
^
\hat{\vec{x}}
x^。用于表示数据偏移的参数
γ
⃗
\vec{\gamma}
γ和
β
⃗
\vec{\beta}
β与
x
⃗
\vec{x}
x同维,则BN
层的输出可表示为:
y
⃗
=
γ
⃗
⊙
X
^
+
β
⃗
=
[
−
  
x
⃗
^
1
⊙
γ
⃗
+
β
⃗
  
−
−
  
x
⃗
^
2
⊙
γ
⃗
+
β
⃗
  
−
⋮
−
  
x
⃗
^
n
⊙
γ
⃗
+
β
⃗
  
−
]
\vec{y} =\vec{\gamma}\odot\hat{X} + \vec{\beta} = \left[ \begin{array}{c} -\,\, \hat{\vec{x}}_1 \odot \vec{\gamma} + \vec{\beta} \,\,-\\ -\,\, \hat{\vec{x}}_2 \odot \vec{\gamma} + \vec{\beta} \,\,-\\ \vdots \\ - \,\, \hat{\vec{x}}_n \odot \vec{\gamma} + \vec{\beta} \,\,- \end{array} \right]
y=γ⊙X^+β=⎣⎢⎢⎢⎡−x^1⊙γ+β−−x^2⊙γ+β−⋮−x^n⊙γ+β−⎦⎥⎥⎥⎤
假设反向传播回来的关于
y
⃗
\vec{y}
y的导数为
δ
y
⃗
\delta\vec{y}
δy,则反向传播过程主要关注的是:考虑
δ
y
⃗
\delta\vec{y}
δy后,
δ
γ
⃗
\delta\vec{\gamma}
δγ、
δ
β
⃗
\delta\vec{\beta}
δβ、
δ
X
⃗
\delta\vec{X}
δX的表达。
首先需要明确的是 δ x ⃗ \delta\vec{x} δx的形状与 x ⃗ \vec{x} x的形状一致(这里 x ⃗ \vec{x} x是一个统指,其可为 γ ⃗ \vec{\gamma} γ、 β ⃗ \vec{\beta} β、 X ⃗ \vec{X} X)。其次, δ x ⃗ \delta\vec{x} δx中的一个元素 δ x i \delta x_i δxi是 y ⃗ \vec{y} y中各个元素对 x ⃗ \vec{x} x中的元素 x i x_i xi的导数的累加值。
考虑 y ⃗ \vec{y} y中任意一行 y ⃗ i \vec{y}_i yi对 γ ⃗ \vec{\gamma} γ和 β ⃗ \vec{\beta} β的求导:
d
 
y
⃗
i
d
γ
⃗
=
[
−
 
x
⃗
ˉ
i
 
−
]
,
d
 
y
⃗
i
d
β
⃗
=
[
−
 
1
 
−
]
\frac{d\,\vec{y}_i}{d\vec{\gamma}} = [-\, \bar{\vec{x}}_i\,-], \quad \frac{d\,\vec{y}_i}{d\vec{\beta}} = [-\, 1\,-]
dγdyi=[−xˉi−],dβdyi=[−1−]
因此,
δ
γ
⃗
\delta\vec{\gamma}
δγ和
δ
β
⃗
\delta\vec{\beta}
δβ为:
( δ y ⃗ ⊙ y ⃗ ′ ) ↓ Σ    r o w s ⇒ δ γ ⃗ = ( δ y ⃗ ⊙ X ˉ ) ↓ Σ    r o w s ,   δ β ⃗ = ( δ y ⃗ ) ↓ Σ    r o w s \left .(\delta\vec{y}\odot\vec{y}')\right\downarrow _{\Sigma\,\,\mathrm{rows}} \Rightarrow \quad \delta\vec{\gamma} = \left . (\delta\vec{y}\odot\bar{X})\right\downarrow _{\Sigma\,\,\mathrm{rows}}, \, \delta\vec{\beta} = \left .(\delta\vec{y})\right\downarrow _{\Sigma\,\,\mathrm{rows}} (δy⊙y′)↓⏐Σrows⇒δγ=(δy⊙Xˉ)↓⏐Σrows,δβ=(δy)↓⏐Σrows
下面计算
δ
X
⃗
\delta\vec{X}
δX。为简略计,考虑
y
⃗
\vec{y}
y中任意一行
y
⃗
i
\vec{y}_i
yi对
X
⃗
\vec{X}
X中任意一行
x
⃗
j
\vec{x}_j
xj的导数。更进一步简化,由于
y
⃗
i
\vec{y}_i
yi中每个元素与
x
⃗
j
\vec{x}_j
xj中对应位置的元素的关系式都是一个形式,且对于BN
层,
y
⃗
\vec{y}
y与
X
⃗
\vec{X}
X都是元素级的变换,因此可考虑
y
⃗
i
\vec{y}_i
yi与
x
⃗
j
\vec{x}_j
xj的标量形式
y
i
y_i
yi、
x
j
x_j
xj:
y
i
=
γ
x
i
−
x
ˉ
v
a
r
(
x
)
+
β
,
x
ˉ
=
1
n
∑
x
i
,
v
a
r
(
x
)
=
1
n
∑
(
x
i
−
x
ˉ
)
2
=
1
n
∑
x
i
2
−
x
ˉ
2
y_i = \gamma \frac{x_i - \bar{x}}{\sqrt{\mathrm{var}(x)}}+\beta,\quad \bar{x} = \frac{1}{n}\sum x_i,\quad \mathrm{var}(x)=\frac{1}{n}\sum(x_i-\bar{x})^2=\frac{1}{n}\sum x_i^2 - \bar{x}^2
yi=γvar(x)xi−xˉ+β,xˉ=n1∑xi,var(x)=n1∑(xi−xˉ)2=n1∑xi2−xˉ2
故有:
d y i d x j = γ ⋅ v a r ( x ) ⋅ ( x i − x ˉ ) ′ − ( x i − x ˉ ) ⋅ ( v a r ( x ) ) ′ v a r ( x ) = γ ⋅ [ δ i j − 1 / n v a r ( x ) − x i − x ˉ v a r ( x ) ⋅ x j − x ˉ v a r ( x ) ] = γ v a r ( x ) ⋅ [ δ i j − 1 n − x ^ i ⋅ x ^ j ] \begin{array}{ccl} \dfrac{dy_i}{dx_j} &=& \gamma\cdot\dfrac{\sqrt{\mathrm{var}(x)}\cdot(x_i-\bar{x})'-(x_i-\bar{x})\cdot(\sqrt{\mathrm{var}(x)})'}{\mathrm{var(x)}}\\ \\ & = & \gamma\cdot\left[\dfrac{\delta_{ij} - 1/n}{\sqrt{\mathrm{var}(x)}}-\dfrac{x_i-\bar{x}}{\mathrm{var}(x)}\cdot\dfrac{x_j-\bar{x}}{\sqrt{\mathrm{var}(x)}}\right] \\ \\ &=& \dfrac{\gamma}{\sqrt{\mathrm{var}(x)}}\cdot\left[\delta_{ij} - \dfrac{1}{n} - \hat{x}_i\cdot\hat{x}_j\right] \end{array} dxjdyi===γ⋅var(x)var(x)⋅(xi−xˉ)′−(xi−xˉ)⋅(var(x))′γ⋅[var(x)δij−1/n−var(x)xi−xˉ⋅var(x)xj−xˉ]var(x)γ⋅[δij−n1−x^i⋅x^j]
然后对
y
i
y_i
yi的角标
i
i
i进行求和即可。
layers
文件batchnorm_backward_alt()
函数:
def batchnorm_backward_alt(dout, cache):
dx, dgamma, dbeta = None, None, None
# TODO: Implement the backward pass for batch normalization.
batch_var, x_norm, gamma = cache["batch_var"], cache["x_norm"], cache["gamma"]
dgamma = np.sum(x_norm*dout, axis=0)
dbeta = np.sum(dout, axis=0)
dx = (dout - np.mean(dout, axis=0) - np.mean(x_norm*dout, axis=0)*x_norm)/np.sqrt(batch_var)*gamma
return dx, dgamma, dbeta
三、随机丢弃层(Dropout
)
1. Dropout
层的前向传播
在训练阶段,以一定的概率 1 − p 1-p 1−p将输入数据置零;在预测分类阶段,保证数据的能量(统计参数、数据样本求和)与训练过程一致。
layers
文件dropout_forward()
函数:
def dropout_forward(x, dropout_param):
p, mode = dropout_param['p'], dropout_param['mode']
if 'seed' in dropout_param:
np.random.seed(dropout_param['seed'])
mask = None
out = None
if mode == 'train':
# TODO: Implement training phase forward pass for inverted dropout.
mask = np.random.rand(*x.shape) < p
out = x*mask
elif mode == 'test':
# TODO: Implement the test phase forward pass for inverted dropout.
out = x*p
cache = (dropout_param, mask)
out = out.astype(x.dtype, copy=False)
return out, cache
2. Dropout
层的后向传播
Dropout
层像一个门开关。
layers
文件中的dropout_backward()
函数:
def dropout_backward(dout, cache):
dropout_param, mask = cache
mode = dropout_param['mode']
dx = None
if mode == 'train':
# TODO: Implement training phase backward pass for inverted dropout
dx = dout*mask
elif mode == 'test':
dx = dout
return dx
四、卷积神经网络
1. 卷积层的前向传播
依据是否需要填充等参数,构建好被卷积的数据大小,然后写循环即可。
layers
文件中的conv_forward_naive()
函数:
def conv_forward_naive(x, w, b, conv_param):
out = None
# TODO: Implement the convolutional forward pass.
# Hint: you can use the function np.pad for padding.
N, C, H, W = x.shape
F, _, HH, WW = w.shape
n_pad, n_stride = conv_param["pad"], conv_param["stride"]
if n_pad > 0:
data = np.zeros((N, C, H+2*n_pad, W+2*n_pad))
data[:,:,n_pad:H+n_pad, n_pad:W+n_pad] = x
else:
data = x
N, C, H, W = data.shape
rH, rW = 1 + (H - HH) // n_stride, 1 + (W - WW) // n_stride
out = np.zeros((N, F, rH, rW))
for iH in range(0, rH):
for iW in range(0, rW):
for iF in range(0, F):
for iN in range(0, N):
out[iN, iF, iH, iW] = np.sum(data[iN, :, iH*n_stride:iH*n_stride+HH, iW*n_stride:iW*n_stride+WW]*w[iF, :, :, :]) + b[iF]
cache = (x, w, b, conv_param)
return out, cache
2. 卷积层的后向传播
考虑只有一个输出值的卷积,其实就是一个全连接网络。从这个角度,只需对每个输出值,按照全连接网络的后向传播过程进行迭代即可。
layers
文件中的conv_backward_naive()
函数:
def conv_backward_naive(dout, cache):
dx, dw, db = None, None, None
# TODO: Implement the convolutional backward pass.
x, w, b, conv_param = cache
dw = np.zeros_like(w)
N, C, H, W = x.shape
F, _, HH, WW = w.shape
n_pad, n_stride = conv_param["pad"], conv_param["stride"]
if n_pad > 0:
data = np.zeros((N, C, H+2*n_pad, W+2*n_pad))
data[:,:,n_pad:H+n_pad, n_pad:W+n_pad] = x
else:
data = x
_, _, rH, rW = dout.shape
db = np.zeros_like(b)
dx = np.zeros_like(data)
for iF in range(F):
for iH in range(rH):
for iW in range(rW):
dw[iF, :, :, :] += np.sum(data[:,:,iH*n_stride:iH*n_stride+HH, iW*n_stride:iW*n_stride+WW] * dout[:, iF, iH, iW].reshape((N,1,1,1)), axis = 0)
db[iF] += np.sum(dout[:, iF, iH, iW], axis=0)
for iF in range(F):
for iH in range(rH):
for iW in range(rW):
for iN in range(N):
dx[iN, :, iH*n_stride:iH*n_stride+HH, iW*n_stride:iW*n_stride+WW] += dout[iN, iF, iH, iW]*w[iF,:,:,:]
dx = dx[:, :, n_pad:n_pad+H, n_pad:n_pad+W]
return dx, dw, db
3. 最大池化层的前向传播
layers
文件中的max_pool_forward_naive()
函数:
def max_pool_forward_naive(x, pool_param):
out = None
# TODO: Implement the max pooling forward pass
pool_height, pool_width, n_stride = pool_param["pool_height"], pool_param["pool_height"], pool_param["stride"]
N, F, H, W = x.shape
rH, rW = 1 + (H - pool_height)//n_stride, 1 + (W - pool_width) // n_stride
out = np.zeros((N, F, rH, rW))
for iH in range(rH):
for iW in range(rW):
for iF in range(F):
for iN in range(N):
out[iN, iF, iH, iW] = x[iN, iF, iH*n_stride:iH*n_stride+pool_height, iW*n_stride:iW*n_stride+pool_width].max()
cache = (x, pool_param)
return out, cache
4. 最大池化层的反向传播
layers
文件中的max_pool_backward_naive()
函数:
def max_pool_backward_naive(dout, cache):
dx = None
# TODO: Implement the max pooling backward pass
x, pool_param = cache
pool_height, pool_width, n_stride = pool_param["pool_height"], pool_param["pool_height"], pool_param["stride"]
dx = np.zeros_like(x)
N, F, H, W = x.shape
_, _, rH, rW = dout.shape
for iH in range(rH):
for iW in range(rW):
for iF in range(F):
for iN in range(N):
dx[iN, iF, iH*n_stride:iH*n_stride+pool_height, iW*n_stride:iW*n_stride+pool_width]= dout[iN, iF, iH, iW] * \
(x[iN, iF, iH*n_stride:iH*n_stride+pool_height, iW*n_stride:iW*n_stride+pool_width] == x[iN, iF, iH*n_stride:iH*n_stride+pool_height, iW*n_stride:iW*n_stride+pool_width].max())
return dx
5. 三层卷积网络
cnn
文件中的ThreeLayerConvNet()
对象:
class ThreeLayerConvNet(object):
def __init__(self, input_dim=(3, 32, 32), num_filters=32, filter_size=7,
hidden_dim=100, num_classes=10, weight_scale=1e-3, reg=0.0,
dtype=np.float32):
# Initialize a new network.
self.params = {}
self.reg = reg
self.dtype = dtype
# TODO: Initialize weights and biases for the three-layer convolutional network.
channel_no, img_height, img_width = input_dim
self.params["W1"] = weight_scale * np.random.randn(num_filters, channel_no, filter_size, filter_size)
self.params["b1"] = np.zeros(num_filters)
n_features = num_filters * img_height * img_width//2//2
self.params["W2"] = weight_scale * np.random.randn(n_features, hidden_dim)
self.params["b2"] = np.zeros(hidden_dim)
self.params["W3"] = weight_scale * np.random.randn(hidden_dim, num_classes)
self.params["b3"] = np.zeros(num_classes)
for k, v in self.params.items():
self.params[k] = v.astype(dtype)
def loss(self, X, y=None):
W1, b1 = self.params['W1'], self.params['b1']
W2, b2 = self.params['W2'], self.params['b2']
W3, b3 = self.params['W3'], self.params['b3']
# pass conv_param to the forward pass for the convolutional layer
filter_size = W1.shape[2]
conv_param = {'stride': 1, 'pad': (filter_size - 1) // 2}
# pass pool_param to the forward pass for the max-pooling layer
pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}
scores = None
# TODO: Implement the forward pass for the three-layer convolutional net
# conv - relu - 2x2 max pool - affine - relu - affine - softmax
conv_relu_pool_out, conv_relu_pool_cache = conv_relu_pool_forward(X, W1, b1, conv_param, pool_param)
layer2_affine_relu_out, layer2_affine_relu_cache = affine_relu_forward(conv_relu_pool_out, W2, b2)
final_affine_out, final_affine_cache = affine_forward(layer2_affine_relu_out, W3, b3)
scores = final_affine_out
if y is None:
return scores
loss, grads = 0, {}
# TODO: Implement the backward pass for the three-layer convolutional net
loss, dloss = softmax_loss(final_affine_out, y)
loss += 0.5 * self.reg * (np.sum(np.square(W1)) + np.sum(np.square(W2)) + np.sum(np.square(W3)))
dfinal_affine_out, dW3, db3 = affine_backward(dloss, final_affine_cache)
dlayer2_affine_relu_out, dW2, db2 = affine_relu_backward(dfinal_affine_out, layer2_affine_relu_cache)
_, dW1, db1 = conv_relu_pool_backward(dlayer2_affine_relu_out, conv_relu_pool_cache)
grads["W1"] = dW1 + self.reg * W1
grads["b1"] = db1
grads["W2"] = dW2 + self.reg * W2
grads["b2"] = db2
grads["W3"] = dW3 + self.reg * W3
grads["b3"] = db3
return loss, grads
6. 训练三层卷积网络
以下参数可以在验证集上获取约50%的准确率:
model = ThreeLayerConvNet(weight_scale=0.001, hidden_dim=500, reg=0.001)
solver = Solver(model, data, num_epochs=1, batch_size=50,
update_rule='adam', optim_config={ 'learning_rate': 1e-3, },
verbose=True, print_every=20)
7. Spatial Batch Normalization
的前向传播和后向传播
layers
文件中的spatial_batchnorm_forward()
函数:
def spatial_batchnorm_forward(x, gamma, beta, bn_param):
out, cache = None, None
# TODO: Implement the forward pass for spatial batch normalization.
N, C, H, W = x.shape
x = x.transpose(0, 2, 3, 1).reshape(N * H * W, C)
out, cache = batchnorm_forward(x, gamma, beta, bn_param)
out = out.reshape(N, H, W, C).transpose(0, 3, 1, 2)
return out, cache
layers
文件中的spatial_batchnorm_backward()
函数:
def spatial_batchnorm_backward(dout, cache):
dx, dgamma, dbeta = None, None, None
# TODO: Implement the backward pass for spatial batch normalization.
N, C, H, W = dout.shape
dout = dout.transpose(0, 2, 3, 1).reshape(N * H * W, C)
dx, dgamma, dbeta = batchnorm_backward_alt(dout, cache)
dx = dx.reshape(N, H, W, C).transpose(0, 3, 1, 2)
return dx, dgamma, dbeta