前言:
以斯坦福cs231n课程的python编程任务为主线,展开对该课程主要内容的理解和部分数学推导。建议PC端阅读,该课程的学习资料和代码如下:
视频和PPT
笔记
assignment2初始代码
Part 1:深层全连接神经网络(python编程任务)
我们在Assignment1中完成了简单的2-layer全连接神经网络,但是我们之前的编程不够模块化,所有的计算部分(损失函数的计算、梯度的计算等等)都放在了一个函数块里,使得没有灵活性,即我们无法随意更改网络的结构。这里,我们使用更加模块化的编程方式,每个模块之间相互独立,运行的时候可以相互调用,使得我们的神经网络结构十分灵活。就像这样:
python
def layer_forward(x, w):
""" Receive inputs x and weights w """
# Do some computations ...
z = # ... some intermediate value
# Do some more computations ...
out = # the output
cache = (x, w, z, out) # Values we need to compute gradients
return out, cache
The backward pass will receive upstream derivatives and the cache object,
and will return gradients with respect to the inputs and weights, like this:
python
def layer_backward(dout, cache):
"""
Receive derivative of loss with respect to outputs and cache,
and compute derivative with respect to inputs.
"""
# Unpack cache values
x, w, z, out = cache
# Use values in cache to compute derivatives
dx = # Derivative of loss with respect to x
dw = # Derivative of loss with respect to w
return dx, dw
此外,我们会将前面学过的参数更新策略全部集成到模块中,这样我们可以探索不同的参数更新策略的性能表现;我们也会将Batch Normalization和Dropout应用到模块中,来更高效地优化深度网络。
由于这部分的编程任务较为繁重,我们把任务拆分下来,一步一步地完成:
1. 2-layer全连接神经网络
这部分我们需要完成以下编程任务(此外,需要看懂solver.py):
--> fc_net.py里的TwoLayerNet类
--> layers.py里的前四个函数
--> optim.py
具体代码如下:
---> fc_net.py
__coauthor__ = 'Deeplayer'
# 6.22.2016 #
from layer_utils import *
class TwoLayerNet(object):
"""
A two-layer fully-connected neural network with ReLU nonlinearity and
softmax loss that uses a modular layer design. We assume an input dimension
of D, a hidden dimension of H, and perform classification over C classes.
The architecure should be affine - relu - affine - softmax.
Note that this class does not implement gradient descent; instead, it
will interact with a separate Solver object that is responsible for running
optimization.
The learnable parameters of the model are stored in the dictionary
self.params that maps parameter names to numpy arrays.
"""
def __init__(self, input_dim=3*32*32, hidden_dim=100, num_classes=10,
weight_scale=1e-3, reg=0.0):
"""
Initialize a new network.
Inputs:
- input_dim: An integer giving the size of the input
- hidden_dim: An integer giving the size of the hidden layer
- num_classes: An integer giving the number of classes to classify
- dropout: Scalar between 0 and 1 giving dropout strength.
- weight_scale: Scalar giving the standard deviation for random
initialization of the weights.
- reg: Scalar giving L2 regularization strength.
"""
self.params = {}
self.reg = reg
self.params['W1'] = weight_scale * np.random.randn(input_dim, hidden_dim)
self.params['b1'] = np.zeros((1, hidden_dim))
self.params['W2'] = weight_scale * np.random.randn(hidden_dim, num_classes)
self.params['b2'] = np.zeros((1, num_classes))
def loss(self, X, y=None):
"""
Compute loss and gradient for a minibatch of data.
Inputs:
- X: Array of input data of shape (N, d_1, ..., d_k)
- y: Array of labels, of shape (N,). y[i] gives the label for X[i].
Returns:
If y is None, then run a test-time forward pass of the model and return:
- scores: Array of shape (N, C) giving classification scores, where
scores[i, c] is the classification score for X[i] and class c.
If y is not None, then run a training-time forward and backward pass and
return a tuple of:
- loss: Scalar value giving the loss
- grads: Dictionary with the same keys as self.params, mapping parameter
names to gradients of the loss with respect to those parameters.
"""
scores = None
N = X.shape[0]
# Unpack variables from the params dictionary
W1, b1 = self.params['W1'], self.params['b1']
W2, b2 = self.params['W2'], self.params['b2']
h1, cache1 = affine_relu_forward(X, W1, b1)
out, cache2 = affine_forward(h1, W2, b2)
scores = out # (N,C)
# If y is None then we are in test mode so just return scores
if y is None:
return scores
loss, grads = 0, {}
data_loss, dscores = softmax_loss(scores, y)
reg_loss = 0.5 * self.reg * np.sum(W1*W1) + 0.5 * self.reg * np.sum(W2*W2)
loss = data_loss + reg_loss
# Backward pass: compute gradients
dh1, dW2, db2 = affine_backward(dscores, cache2)
dX, dW1, db1 = affine_relu_backward(dh1, cache1)
# Add the regularization gradient contribution
dW2 += self.reg * W2
dW1 += self.reg * W1
grads['W1'] = dW1
grads['b1'] = db1
grads['W2'] = dW2
grads['b2'] = db2
return loss, grads
---> layers.py
__coauthor__ = 'Deeplayer'
# 6.22.2016
#import numpy as np
def affine_forward(x, w, b):
"""
Computes the forward pass for an affine (fully-connected) layer.
The input x has shape (N, d_1, ..., d_k) and contains a minibatch of N
examples, where each example x[i] has shape (d_1, ..., d_k). We will
reshape each input into a vector of dimension D = d_1 * ... * d_k, and
then transform it to an output vector of dimension M.
Inputs:
- x: A numpy array containing input data, of shape (N, d_1, ..., d_k)
- w: A numpy array of weights, of shape (D, M)
- b: A numpy array of biases, of shape (M,)
Returns a tuple of:
- out: output, of shape (N, M)
- cache: (x, w, b)
"""
out = None
# Reshape x into rows
N = x.shape[0]
x_row = x.reshape(N, -1) # (N,D)
out = np.dot(x_row, w) + b # (N,M)
cache = (x, w, b)
return out, cache
def affine_backward(dout, cache):
"""
Computes the backward pass for an affine layer.
Inputs:
- dout: Upstream derivative, of shape (N, M)
- cache: Tuple of:
- x: Input data, of shape (N, d_1, ... d_k)
- w: Weights, of shape (D, M)
Returns a tuple of:
- dx: Gradient with respect to x, of shape (N, d1, ..., d_k)
- dw: Gradient with respect to w, of shape (D, M)
- db: Gradient with respect to b, of shape (M,)
"""
x, w, b = cache
dx, dw, db = None, None, None
dx = np.dot(dout, w.T) # (N,D)
dx = np.reshape(dx, x.shape) # (N,d1,...,d_k)
x_row = x.reshape(x.shape[0], -1) # (N,D)
dw = np.dot(x_row.T, dout) # (D,M)
db = np.sum(dout, axis=0, keepdims=True) # (1,M)
return dx, dw, db
def relu_forward(x):
"""
Computes the forward pass for a layer of rectified linear units (ReLUs).
Input:
- x: Inputs, of any shape
Returns a tuple of:
- out: Output, of the same shape as x
- cache: x
"""
out = None
out = ReLU(x)
cache = x
return out, cache
def relu_backward(dout, cache):
"""
Computes the backward pass for a layer of rectified linear units (ReLUs).
Input:
- dout: Upstream derivatives, of any shape
- cache: Input x, of same shape as dout
Returns:
- dx: Gradient with respect to x
"""
dx, x = None, cache
dx = dout
dx[x <= 0] = 0
return dx
def svm_loss(x, y):
"""
Computes the loss and gradient using for multiclass SVM classification.
Inputs:
- x: Input data, of shape (N, C) where x[i, j] is the score for the jth class
for the ith input.
- y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and
0 <= y[i] < C
Returns a tuple of:
- loss: Scalar giving the loss
- dx: Gradient of the loss with respect to x
"""
N = x.shape[0]
correct_class_scores = x[np.arange(N), y]
margins = np.maximum(0, x - correct_class_scores[:, np.newaxis] + 1.0)
margins[np.arange(N), y] = 0
loss = np.sum(margins) / N
num_pos = np.sum(margins > 0, axis=1)
dx = np.zeros_like(x)
dx[margins > 0] = 1
dx[np.arange(N), y] -= num_pos
dx /= N
return loss, dx
def softmax_loss(x, y):
"""
Computes the loss and gradient for softmax classification. Inputs:
- x: Input data, of shape (N, C) where x[i, j] is the score for the jth class
for the ith input.
- y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and
0 <= y[i] < C
Returns a tuple of:
- loss: Scalar giving the loss
- dx: Gradient of the loss with respect to x
"""
probs = np.exp(x - np.max(x, axis=1, keepdims=True))
probs /= np.sum(probs, axis=1, keepdims=True)
N = x.shape[0]
loss = -np.sum(np.log(probs[np.arange(N), y])) / N
dx = probs.copy()
dx[np.arange(N), y] -= 1
dx /= N
return loss, dx
def ReLU(x):
"""ReLU non-linearity."""
return np.maximum(0, x)
---> optim.py
__coauthor__ = 'Deeplayer'
# 6.22.2016
import numpy as np
def sgd(w, dw, config=None):
"""
Performs vanilla stochastic gradient descent.
config format:
- learning_rate: Scalar learning rate.
"""
if config is None: config = {}
config.setdefault('learning_rate', 1e-2)
w -= config['learning_rate'] * dw
return w, config
def sgd_momentum(w, dw, config=None):
"""
Performs stochastic gradient descent with momentum.
config format:
- learning_rate: Scalar learning rate.
- momentum: Scalar between 0 and 1 giving the momentum value.
Setting momentum = 0 reduces to sgd.
- velocity: A numpy array of the same shape as w and dw used to store a moving
average of the gradients.
"""
if config is None: config = {}
config.setdefault('learning_rate', 1e-2)
config.setdefault('momentum', 0.9)
v = config.get('velocity', np.zeros_like(w))
next_w = None
v = config['momentum'] * v - config['learning_rate'] * dw
next_w = w + v
config['velocity'] = v
return next_w, config
def rmsprop(x, dx, config=None):
"""
Uses the RMSProp update rule, which uses a moving average of squared gradient
values to set adaptive per-parameter learning rates.
config format:
- learning_rate: Scalar learning rate.
- decay_rate: Scalar between 0 and 1 giving the decay rate for the squared
gradient cache.
- epsilon: Small scalar used for smoothing to avoid dividing by zero.
- cache: Moving average of second moments of gradients.
"""
if config is None: config = {}
config.setdefault('learning_rate', 1e-2)
config.setdefault('decay_rate', 0.99)
config.setdefault('epsilon', 1e-8)
config.setdefault('cache', np.zeros_like(x))
next_x = None
cache = config['cache']
decay_rate = config['decay_rate']
learning_rate = config['learning_rate']
epsilon = config['epsilon']
cache = decay_rate * cache + (1 - decay_rate) * (dx**2)
x += - learning_rate * dx / (np.sqrt(cache) + epsilon)
config['cache'] = cache
next_x = x
return next_x, config
def adam(x, dx, config=None):
"""
Uses the Adam update rule, which incorporates moving averages of both the
gradient and its square and a bias correction term.
config format:
- learning_rate: Scalar learning rate.
- beta1: Decay rate for moving average of first moment of gradient.
- beta2: Decay rate for moving average of second moment of gradient.
- epsilon: Small scalar used for smoothing to avoid dividing by zero.
- m: Moving average of gradient.
- v: Moving average of squared gradient.
- t: Iteration number.
"""
if config is None: config = {}
config.setdefault('learning_rate', 1e-3)
config.setdefault('beta1', 0.9)
config.setdefault('beta2', 0.999)
config.setdefault('epsilon', 1e-8)
config.setdefault('m', np.zeros_like(x))
config.setdefault('v', np.zeros_like(x))
config.setdefault('t', 0)
next_x = None
m = config['m']
v = config['v']
beta1 = config['beta1']
beta2 = config['beta2']
learning_rate = config['learning_rate']
epsilon = config['epsilon']
t = config['t']
t += 1
m = beta1 * m + (1 - beta1) * dx
v = beta2 * v + (1 - beta2) * (dx**2)
m_bias = m / (1 - beta1**t)
v_bias = v / (1 - beta2**t)
x += - learning_rate * m_bias / (np.sqrt(v_bias) + epsilon)
next_x = x
config['m'] = m
config['v'] = v
config['t'] = t
return next_x, config
编程完成后,我们可以用FullyConnectedNets.ipynb里的代码来check我们的代码是否有误。check完之后,我们可以在CIFAR-10上跑一遍,和Assignment1里的2-layer神经网络比较一下,结果应该是差不多的。
这里,我贴一下在CIFAR-10上运行的代码和结果图:
---> two_layer_fc_net_start.py
__coauthor__ = 'Deeplayer'
# 6.22.2016
import matplotlib.pyplot as plt
from fc_net import *
from data_utils import get_CIFAR10_data
from solver import Solver
data = get_CIFAR10_data()
model = TwoLayerNet(reg=0.9)
solver = Solver(model, data,
lr_decay=0.95,
print_every=100, num_epochs=40, batch_size=400,
update_rule='sgd_momentum',
optim_config={'learning_rate': 5e-4, 'momentum': 0.5})
solver.train()
plt.subplot(2, 1, 1)
plt.title('Training loss')
plt.plot(solver.loss_history, 'o')
plt.xlabel('Iteration')
plt.subplot(2, 1, 2)
plt.title('Accuracy')
plt.plot(solver.train_acc_history, '-o', label='train')
plt.plot(solver.val_acc_history, '-o', label='val')
plt.plot([0.5] * len(solver.val_acc_history), 'k--')
plt.xlabel('Epoch')
plt.legend(loc='lower right')
plt.gcf().set_size_inches(15, 12)
plt.show()
best_model = model
y_test_pred = np.argmax(best_model.loss(data['X_test']), axis=1)
y_val_pred = np.argmax(best_model.loss(data['X_val']), axis=1)
print 'Validation set accuracy: ', (y_val_pred == data['y_val']).mean()
print 'Test set accuracy: ', (y_test_pred == data['y_test']).mean()
# Validation set accuracy: about 52.9%
# Test set accuracy: about 54.7%
# Visualize the weights of the best network
from vis_utils import visualize_grid
def show_net_weights(net):
W1 = net.params['W1']
W1 = W1.reshape(3, 32, 32, -1).transpose(3, 1, 2, 0)
plt.imshow(visualize_grid(W1, padding=3).astype('uint8'))
plt.gca().axis('off')
plt.show()show_net_weights(best_model)
Figure_1.png
2. Multilayer全连接网络 + Batch Normalization
这部分我们需要完成以下编程任务:
--> fc_net.py 里的 FullyConnectedNet类
--> layers.py 里的 batchnorm_forward 和 batchnorm_backward函数
具体代码如下:
---> fc_net.py
__coauthor__ = 'Deeplayer'
# 6.22.2016
from layer_utils import *
class FullyConnectedNet(object):
"""
A fully-connected neural network with an arbitrary number of hidden layers,
ReLU nonlinearities, and a softmax loss function. This will also implement
dropout and batch normalization as options. For a network with L layers,
the architecture will be
{affine - [batch norm] - relu - [dropout]} x (L - 1) - affine - softmax
where batch normalization and dropout are optional, and the {...} block is
repeated L - 1 times.
Similar to the TwoLayerNet above, learnable parameters are stored in the
self.params dictionary and will be learned using the Solver class.
def __init__(self, hidden_dims, input_dim=3*32*32,
num_classes=10,
dropout=0, use_batchnorm=False, reg=0.0,
weight_scale=1e-2, dtype=np.float32, seed=None):
"""
def __init__(self, hidden_dims, input_dim=3*32*32,
num_classes=10,
dropout=0, use_batchnorm=False, reg=0.0,
weight_scale=1e-2, dtype=np.float32, seed=None):
self.use_batchnorm = use_batchnorm
self.use_dropout = dropout > 0
self.reg = reg
self.num_layers = 1 + len(hidden_dims)
self.dtype = dtype
self.params = {}
layers_dims = [input_dim] + hidden_dims + [num_classes]
for i in xrange(self.num_layers):
self.params['W' + str(i+1)] = weight_scale * np.random.randn(layers_dims[i], layers_dims[i+1])
self.params['b' + str(i+1)] = np.zeros((1, layers_dims[i+1]))
if self.use_batchnorm and i < len(hidden_dims):
self.params['gamma' + str(i+1)] = np.ones((1, layers_dims[i+1]))
self.params['beta' + str(i+1)] = np.zeros((1, layers_dims[i+1]))
# When using dropout we need to pass a dropout_param dictionary to each
# dropout layer so that the layer knows the dropout probability and the mode
# (train / test). You can pass the same dropout_param to each dropout layer.
self.dropout_param = {}
if self.use_dropout:
self.dropout_param = {'mode': 'train', 'p': dropout}
if seed is not None:
self.dropout_param['seed'] = seed
# With batch normalization we need to keep track of running means and
# variances, so we need to pass a special bn_param object to each batch
# normalization layer. You should pass self.bn_params[0] to the forward pass
# of the first batch normalization layer, self.bn_params[1] to the forward
# pass of the second batch normalization layer, etc.
self.bn_params = []
if self.use_batchnorm:
self.bn_params = [{'mode': 'train'} for i in xrange(self.num_layers - 1)]
# Cast all parameters to the correct datatype
for k, v in self.params.iteritems():
self.params[k] = v.astype(dtype)
def loss(self, X, y=None):
"""
Compute loss and gradient for the fully-connected net.
Input / output: Same as TwoLayerNet above.
"""
X = X.astype(self.dtype)
mode = 'test' if y is None else 'train'
# Set train/test mode for batchnorm params and dropout param since they
# behave differently during training and testing.
if self.dropout_param is not None:
self.dropout_param['mode'] = mode
if self.use_batchnorm:
for bn_param in self.bn_params:
bn_param['mode'] = mode
scores = None
h, cache1, cache2, cache3, bn, out = {}, {}, {}, {}, {}, {}
out[0] = X
# Forward pass: compute loss
for i in xrange(self.num_layers-1):
# Unpack variables from the params dictionary
W, b = self.params['W' + str(i+1)], self.params['b' + str(i+1)]
if self.use_batchnorm:
gamma, beta = self.params['gamma' + str(i+1)], self.params['beta' + str(i+1)]
h[i], cache1[i] = affine_forward(out[i], W, b)
bn[i], cache2[i] = batchnorm_forward(h[i], gamma, beta, self.bn_params[i])
out[i+1], cache3[i] = relu_forward(bn[i])
else:
out[i+1], cache3[i] = affine_relu_forward(out[i], W, b)
W, b = self.params['W' + str(self.num_layers)], self.params['b' + str(self.num_layers)]
scores, cache = affine_forward(out[self.num_layers-1], W, b)
# If test mode return early
if mode == 'test':
return scores
loss, reg_loss, grads = 0.0, 0.0, {}
data_loss, dscores = softmax_loss(scores, y)
for i in xrange(self.num_layers):
reg_loss += 0.5 * self.reg * np.sum(self.params['W' + str(i+1)]*self.params['W' + str(i+1)])
loss = data_loss + reg_loss
# Backward pass: compute gradients
dout, dbn, dh = {}, {}, {}
t = self.num_layers-1
dout[t], grads['W'+str(t+1)], grads['b'+str(t+1)] = affine_backward(dscores, cache)
for i in xrange(t):
if self.use_batchnorm:
dbn[t-1-i] = relu_backward(dout[t-i], cache3[t-1-i])
dh[t-1-i], grads['gamma'+str(t-i)], grads['beta'+str(t-i)] = batchnorm_backward(dbn[t-1-i], cache2[t-1-i])
dout[t-1-i], grads['W'+str(t-i)], grads['b'+str(t-i)] = affine_backward(dh[t-1-i], cache1[t-1-i])
else:
dout[t-1-i], grads['W'+str(t-i)], grads['b'+str(t-i)] = affine_relu_backward(dout[t-i], cache3[t-1-i])
# Add the regularization gradient contribution
for i in xrange(self.num_layers):
grads['W'+str(i+1)] += self.reg * self.params['W' + str(i+1)]
return loss, grads
在给出 batchnorm_forward 和 batchnorm_backward函数代码之前,先给出Batch Normalization的算法和反向求导公式:
Batch Normalization, algorithm1.png
Backpropagate the gradient of loss ℓ .png
---> layers.py
__coauthor__ = 'Deeplayer'
# 6.22.2016
import numpy as np
def batchnorm_forward(x, gamma, beta, bn_param):
mode = bn_param['mode']
eps = bn_param.get('eps', 1e-5)
momentum = bn_param.get('momentum', 0.9)
N, D = x.shape
running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype))
running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype))
out, cache = None, None
if mode == 'train':
sample_mean = np.mean(x, axis=0, keepdims=True) # [1,D]
sample_var = np.var(x, axis=0, keepdims=True) # [1,D]
x_normalized = (x - sample_mean) / np.sqrt(sample_var + eps) # [N,D]
out = gamma * x_normalized + beta
cache = (x_normalized, gamma, beta, sample_mean, sample_var, x, eps)
running_mean = momentum * running_mean + (1 - momentum) * sample_mean
running_var = momentum * running_var + (1 - momentum) * sample_var
elif mode == 'test':
x_normalized = (x - running_mean) / np.sqrt(running_var + eps)
out = gamma * x_normalized + beta
else:
raise ValueError('Invalid forward batchnorm mode "%s"' % mode)
# Store the updated running means back into bn_param
bn_param['running_mean'] = running_mean
bn_param['running_var'] = running_var
return out, cache
def batchnorm_backward(dout, cache):
dx, dgamma, dbeta = None, None, None
x_normalized, gamma, beta, sample_mean, sample_var, x, eps = cache
N, D = x.shape
dx_normalized = dout * gamma # [N,D]
x_mu = x - sample_mean # [N,D]
sample_std_inv = 1.0 / np.sqrt(sample_var + eps) # [1,D]
dsample_var = -0.5 * np.sum(dx_normalized * x_mu, axis=0, keepdims=True) * sample_std_inv**3
dsample_mean = -1.0 * np.sum(dx_normalized * sample_std_inv, axis=0, keepdims=True) - \
2.0 * dsample_var * np.mean(x_mu, axis=0, keepdims=True)
dx1 = dx_normalized * sample_std_inv
dx2 = 2.0/N * dsample_var * x_mu
dx = dx1 + dx2 + 1.0/N * dsample_mean
dgamma = np.sum(dout * x_normalized, axis=0, keepdims=True)
dbeta = np.sum(dout, axis=0, keepdims=True)
return dx, dgamma, dbeta
完成编程后,我们可以用Batch Normalization.ipynb来check我们的code是否有误。下面我会给出在使用Batch Normalization的情况下,6-layer神经网络在CIFAR-10上的performance。可以预见,6-layer神经网络的performance应该不会比2-layer神经网络的performance好多少的(因为会存在我在Assignment1最后提到的问题1)。
在这之前,我们先来看看Batch Normalization对梯度消失现象的缓解能力怎样,同时给出在不同weight_scales下的情况。我们分别以sigmoid和ReLU作为为激活函数的6-layer神经网络为例,测试一下:
---> batchnorm_and_weight_scales.py
__coauthor__ = 'Deeplayer'
# 6.22.2016 #
from fc_net import *
from solver import *
import matplotlib.pyplot as plt
from data_utils import get_CIFAR10_data
# Load the (preprocessed) CIFAR10 data.
data = get_CIFAR10_data()
hidden_dims = [100, 100, 100, 100, 100]
num_train = 5000
small_data = {
'X_train': data['X_train'][:num_train],
'y_train': data['y_train'][:num_train],
'X_val': data['X_val'],
'y_val': data['y_val'],
}
bn_solvers = {}
solvers = {}
weight_scales = np.logspace(-4, 0, num=20)
for i, weight_scale in enumerate(weight_scales):
print 'Running weight scale %d / %d' % (i + 1, len(weight_scales))
bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, use_batchnorm=True)
model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, use_batchnorm=False)
bn_solver = Solver(bn_model, small_data,
num_epochs=10, batch_size=100,
update_rule='adam',
optim_config={'learning_rate': 1e-3, },
verbose=False, print_every=1000)
bn_solver.train()
bn_solvers[weight_scale] = bn_solver
solver = Solver(model, small_data,
num_epochs=10, batch_size=100,
update_rule='adam',
optim_config={'learning_rate': 1e-3, },
verbose=False, print_every=1000)
solver.train()
solvers[weight_scale] = solver
# Plot results of weight scale experiment
best_train_accs, bn_best_train_accs = [], []
best_val_accs, bn_best_val_accs = [], []
final_train_loss, bn_final_train_loss = [], []
for ws in weight_scales:
best_train_accs.append(max(solvers[ws].train_acc_history))
bn_best_train_accs.append(max(bn_solvers[ws].train_acc_history))
best_val_accs.append(max(solvers[ws].val_acc_history))
bn_best_val_accs.append(max(bn_solvers[ws].val_acc_history))
final_train_loss.append(np.mean(solvers[ws].loss_history[-100:]))
bn_final_train_loss.append(np.mean(bn_solvers[ws].loss_history[-100:]))
plt.subplot(3, 1, 1)
plt.title('Best val accuracy vs weight initialization scale')
plt.xlabel('Weight initialization scale')
plt.ylabel('Best val accuracy')
plt.semilogx(weight_scales, best_val_accs, '-o', label='baseline')
plt.semilogx(weight_scales, bn_best_val_accs, '-o', label='batchnorm')
plt.legend(ncol=2, loc='lower right')
plt.subplot(3, 1, 2)
plt.title('Best train accuracy vs weight initialization scale')
plt.xlabel('Weight initialization scale')
plt.ylabel('Best training accuracy')
plt.semilogx(weight_scales, best_train_accs, '-o', label='baseline')
plt.semilogx(weight_scales, bn_best_train_accs, '-o', label='batchnorm')
plt.legend(loc='upper left')
plt.subplot(3, 1, 3)
plt.title('Final training loss vs weight initialization scale')
plt.xlabel('Weight initialization scale')
plt.ylabel('Final training loss')
plt.semilogx(weight_scales, final_train_loss, '-o', label='baseline')
plt.semilogx(weight_scales, bn_final_train_loss, '-o', label='batchnorm')
plt.legend(loc='upper left')
plt.gcf().set_size_inches(10, 15)
plt.show()
Activation Function: Sigmoid.png
Activation Function: ReLU.png
从上图可以看出:
1)、Batch Normalization解决了困扰学术界十几年的sigmoid的过饱和问题(梯度消失问题),bravo!可能你觉得上面的结果不够直接,那么我贴一下每层的权重梯度值:
Left: without Batch Normalization --- Right: with Batch Normalization
2)、即使没有梯度消失现象,sigmoid还是没有ReLU好。
3)、如果weight_scales选得好的话,当激活函数为ReLU时,Batch Normalization对识别率的提升并不多。
现在,我给一下6-layer神经网络在CIFAR-10上的识别结果(激活函数为ReLU):
· Validation set accuracy: 0.554
· Test set accuracy: 0.54
3. Dropout
这部分我们需要完成以下编程任务:
--> 修改fc_net.py,将dropout加进去
vlayers.py 里的 dropout_forward 和 dropout_backward函数
Dropout是我们在实际(深度)神经网络训练中,用得非常多的一种正则化手段,可以很好地抑制过拟合。即:在训练过程中,我们对每个神经元,都以概率p保持它的激活状态。下面给出3-layer神经网络的dropout示意图:
CS231n Convolutional Neural Networks for Visual Recognition.png
具体代码如下:
对于fc_net.py我们只要修改下其中的loss函数:
__coauthor__ = 'Deeplayer'
# 6.22.2016 #
def loss(self, X, y=None):
"""
Compute loss and gradient for the fully-connected net.
Input / output: Same as TwoLayerNet above.
"""
X = X.astype(self.dtype)
mode = 'test' if y is None else 'train'
# Set train/test mode for batchnorm params and dropout param since they
# behave differently during training and testing.
if self.dropout_param is not None:
self.dropout_param['mode'] = mode
if self.use_batchnorm:
for bn_param in self.bn_params:
bn_param['mode'] = mode
scores = None
h, cache1, cache2, cache3, cache4, bn, out = {}, {}, {}, {}, {}, {}, {}
out[0] = X
# Forward pass: compute loss
for i in xrange(self.num_layers-1):
# Unpack variables from the params dictionary
W, b = self.params['W' + str(i+1)], self.params['b' + str(i+1)]
if self.use_batchnorm:
gamma, beta = self.params['gamma' + str(i+1)], self.params['beta' + str(i+1)]
h[i], cache1[i] = affine_forward(out[i], W, b)
bn[i], cache2[i] = batchnorm_forward(h[i], gamma, beta, self.bn_params[i])
out[i+1], cache3[i] = relu_forward(bn[i])
if self.use_dropout:
out[i+1], cache4[i] = dropout_forward(out[i+1], self.dropout_param)
else:
out[i+1], cache3[i] = affine_relu_forward(out[i], W, b)
if self.use_dropout:
out[i+1], cache4[i] = dropout_forward(out[i+1], self.dropout_param)
W, b = self.params['W' + str(self.num_layers)], self.params['b' + str(self.num_layers)]
scores, cache = affine_forward(out[self.num_layers-1], W, b)
# If test mode return early
if mode == 'test':
return scores
loss, reg_loss, grads = 0.0, 0.0, {}
data_loss, dscores = softmax_loss(scores, y)
for i in xrange(self.num_layers):
reg_loss += 0.5 * self.reg * np.sum(self.params['W' + str(i+1)]*self.params['W' + str(i+1)])
loss = data_loss + reg_loss
# Backward pass: compute gradients
dout, dbn, dh, ddrop = {}, {}, {}, {}
t = self.num_layers-1
dout[t], grads['W'+str(t+1)], grads['b'+str(t+1)] = affine_backward(dscores, cache)
for i in xrange(t):
if self.use_batchnorm:
if self.use_dropout:
ddrop[t-1-i] = dropout_backward(dout[t-i], cache4[t-1-i])
dout[t-i] = ddrop[t-1-i]
dbn[t-1-i] = relu_backward(dout[t-i], cache3[t-1-i])
dh[t-1-i], grads['gamma'+str(t-i)], grads['beta'+str(t-i)] = batchnorm_backward(dbn[t-1-i], cache2[t-1-i])
dout[t-1-i], grads['W'+str(t-i)], grads['b'+str(t-i)] = affine_backward(dh[t-1-i], cache1[t-1-i])
else:
if self.use_dropout:
ddrop[t-1-i] = dropout_backward(dout[t-i], cache4[t-1-i])
dout[t-i] = ddrop[t-1-i]
dout[t-1-i], grads['W'+str(t-i)], grads['b'+str(t-i)] = affine_relu_backward(dout[t-i], cache3[t-1-i])
# Add the regularization gradient contribution
for i in xrange(self.num_layers):
grads['W'+str(i+1)] += self.reg * self.params['W' + str(i+1)]
return loss, grads
---> layers.py 里的 dropout_forward 和 dropout_backward函数
__coauthor__ = 'Deeplayer'
# 6.22.2016 #
def dropout_forward(x, dropout_param):
p, mode = dropout_param['p'], dropout_param['mode']
if 'seed' in dropout_param:
np.random.seed(dropout_param['seed'])
mask = None
out = None
if mode == 'train':
mask = (np.random.rand(*x.shape) < p) / p
out = x * mask
elif mode == 'test':
out = x
cache = (dropout_param, mask)
out = out.astype(x.dtype, copy=False)
return out, cache
def dropout_backward(dout, cache):
dropout_param, mask = cache
mode = dropout_param['mode']
dx = None
if mode == 'train':
dx = dout * mask
elif mode == 'test':
dx = dout
return dx
完成编程后,我们可以用Dropout.ipynb里的代码来check你的code是否有误。我们可以用Dropout.ipynb里最后一部分的代码来比较下使用和不使用dropout的区别:
Dropout vs Overfitting.png
Part 2:卷积神经网络(Convolutional Neural Networks, CNNs)
现在我们开始理解本课程的核心内容 —— 卷积神经网络,对于视觉识别任务,CNNs无疑是最出彩的。和我们前面讲过的全连接神经网络相比,CNNs的优越之处在哪呢?我觉得可以列出以下几点:
1)、它的权值共享以及局部(感受野,receptive field)连接的特点,使之更加类似生物神经网络,视觉皮层的神经元就是局部接受信息的(即这些神经元只响应某些特定感受野的刺激);
2)、在我们的图像比较大的情况下(如 96x96、224x224、384x384、512x512等),全连接神经网络将需要训练超大量的参数(权重和偏置),这不仅会使得计算变得非常耗时,还会导致更加严重的过拟合现象。而CNNs的权值共享和局部连接的特点,使得需要训练的参数锐减(指数级的);
3)、CNNs具有强大的特征提取能力(从边缘到局部再到整体),而全连接神经网络基本没有特征提取的能力。
下面我们来具体讨论CNNs的结构特点,讨论之前,先给一张图,方便感受下CNNs的大致结构:
CS231n Convolutional Neural Networks for Visual Recognition.png
1. 卷积层(Convolutional Layer)
卷积层,也可以称之为特征提取层,是CNNs最重要的部分。卷积层需要训练的参数是一系列的过滤器(我更喜欢卷积核这个词),这些过滤器的大小一致,通常都是正方形。假设我们有n个过滤器,每个过滤器的大小为kxk(k通常取3或5),那么这一层我们需要训练的参数就有nxkxk+n/c个(这里的c表示通道数,如果是灰度图像c=1,如果是彩色图像c=3)。权值共享告诉我们,一个过滤器只能提取一种特征,即当过滤器在图像上卷积(滑动)的过程中,只提取了该图像全局范围内的同一个特征。所以,n个过滤器可以提取图像的n个不同特征。这里贴张卷积过程的动图,这里的过滤器个数是6,但事实上是2种(因为有三个通道嘛),所以提取了两种特征:
CS231n Convolutional Neural Networks for Visual Recognition.gif
动图中,你会发现图像外面多了一圈0,而且过滤器移动的步长(stride)为2。补零这个操作,我们称之为zero-padding。我们记补零的圈数为p,过滤器移动步长为s,那么计算输出卷积特征(convolved feature,或者叫activation map)边长的公式为: L=(input_dim-k+2p)/s+1,输出特征的维数则为LxLxn/c。zero-padding这个操作产生的原因是为了保证过滤器的滑动能从头到尾刚刚好,即保证上面的公式能够整除。上面的p,s和n是需要我们提前设定好的三个超参数。对于步长s的设定,s设定得越小,提取的信息就越丰富,但计算量会相对大一点;s设定得越大,计算量会相对小一点,但是提取的信息就少一些。s的通常选择是1。
---> PS: 卷积为什么work?
自然图像有其固有特性,也就是说,图像的一部分的统计特性与其他部分是一样的。这也意味着我们在这一部分学习的特征也能用在另一部分上,所以对于这个图像上的所有位置,我们都能使用同样的学习特征。(摘自UFLDL)
2. 池化层(Pooling Layer)
卷积层的下一层是池化层,但要注意,卷积层的输出会经过激活函数(如ReLU)激活后,进入池化层。池化层的作用是将卷积层输出的维数进一步降低,以此来减少参数的数量和计算量。具体来讲,是将卷积层得到的结果无重合的分成几个子区域,然后选择每一子区域的最大值,或者平均值,或者2范数,我们以取最大值的max pooling为例(相对而言,max pooling的效果更好,所以我们通常采用max pooling),给出一个diagram:
CS231n Convolutional Neural Networks for Visual Recognition.png
通常,池化层的采样窗口大小为2x2。
有些人认为池化层并不是必要的,如Striving for Simplicity: The All Convolutional Net。此外,有人发现去除池化层对于生成式模型(generative models)很重要,例如variational autoencoders(VAEs),generative adversarial networks(GANs)。可能在以后的模型结构中,池化层会逐渐减少或者消失。
3. 全连接层(Fully-connected layer)
现在的很多CNNs模型,在最后几层(一般是1~3层)会采用全连接的方式去学习更多的信息。注意,全连接层的最后一层就是输出层;除了最后一层,其它的全连接层都包含激活函数。
4. 卷积神经网络结构(CNNs Architectures)
CNNs的通常结构,可以表述如下:
INPUT --> [[CONV --> RELU]*N --> POOL?]*M --> [FC --> RELU]*K --> FC(OUTPUT)
其中,"?"是代表池化层是可选的,可有可无;N(一般03),K(一般02)和M(M>=0)是具体层数。
注意,我们倾向于选择多层小size的卷积层,而不是一个大size的卷积层。
现在,我们以3个3x3的卷积层和1个7x7的卷积层为例,加以对比说明。从下图可以看出,这两种方法最终得到的activation map大小是一致的,但3个3x3的卷积层明显更好:
1)、3层的非线性组合要比1层线性组合提取出的特征具备更高的表达能力;
2)、3层小size的卷积层的参数数量要少,3x3x3<7x7;
3)、同样的,为了便于反向传播时的梯度计算,我们需要保留很多中间梯度,3层小size的卷积层需要保留的中间梯度更少。
3_3x3 VS 1_7x7.png
下面我给出一个最简单的CNNs结构的diagram(input+1conv+1pool+2fc):
A simple CNNs architecture.png
这里我们列举几种常见类型的卷积神经网络结构:
· INPUT --> FC/OUT 这其实就是个线性分类器
· INPUT --> CONV --> RELU --> FC/OUT
· INPUT --> [CONV --> RELU --> POOL]*2 --> FC --> RELU --> FC/OUT
· INPUT --> [CONV --> RELU --> CONV --> RELU --> POOL]*3 --> [FC --> RELU]*2 --> FC/OUT
---> PS:
1、对于输入层(图像层),我们一般会把图像大小resize成边长为2的次方的正方形。比如CIFAR-10是32x32x3,STL-10是64x64x3,而ImageNet是224x224x3或者512x512x3。
2、实际工程中,我们得预估一下内存,然后根据内存的情况去设定合理的值。例如输入是224x224x3得图片,过滤器大小为3x3,共64个,zero-padding为1,这样每张图片需要72MB的内存(这里的72MB囊括了图片以及对应的参数、梯度和激活值在内的,所需要的内存空间),但是在GPU上运行的话,内存可能不够(相比于CPU,GPU的内存要小得多),所以需要调整下参数,比如过滤器大小改为7x7,stride改为2(ZF net),或者过滤器大小改为11x11,stride改为4(AlexNet)。
3、构建一个实际可用的深度卷积神经网络最大的瓶颈是GPU的内(显)存。现在很多GPU只有3/4/6GB的内存,单卡最大的也就12G(NVIDIA),所以我们应该在设计卷积神经网的时候,多加考虑内存主要消耗在哪里:
大量的激活值和中间梯度值;
参数,反向传播时的梯度以及使用momentum,Adagrad,or RMSProp时的缓存都会占用储存,所以估计参数占用的内存时,一般至少要乘以3倍;
数据的batch以及其他的类似信息或者来源信息等也会消耗一部分内存。
下面列出一些著名的卷积神经网络:
· LeNet,这是最早成功应用的卷积神经网络,Yann LeCun在论文LeNet中提出。
· AlexNet,2012 ILSVRC竞赛远超第2名的卷积神经网络,掀起了深度学习的浪潮。
· ZF Net,2013 ILSVRC竞赛冠军,调整了Alexnet的结构参数, 扩增了中间卷积层。
· GoogLeNet,2014 ILSVRC竞赛冠军,极大地减少了参数数量(由 60M到4M)。
· VGGNet,2014 ILSVRC,证明了CNNs的深度对于最后的效果有至关重要的作用。
Identity Mappings in Deep Residual Networks。
From Kaiming He's ICML16 tutorial
Part 3:Python编程任务(3-layer CNNs)
这部分我们需要完成以下编程任务:
1)、layers.py里的以下函数:
---> conv_forward_naive
---> conv_backward_naive
---> max_pool_forward_naive
---> max_pool_backward_naive
在给出卷积层的代码前,我们先理解下卷积层的前向和后向传播时,具体是如何计算的。为了理解方便,我们假设某一个batch里的第一张图片为x[0, :, :, :],有RGB三个通道,每个通道大小为7x7,padding为1,stride为2,那么x[0, :, :, :]的大小为1x3x9x9;此外,我们假设有3个过滤器,每个大小为3x3,用w表示所有过滤器中的权重(如第一个滤波器的第一个通道为w[0, 0, :, :]);偏置b的大小为1x3;activation maps用out来表示,大小为3x4x4(如第一个map为out[0, :, :])。
以刚才的假设为例,给出前向传播和后向传播的具体计算过程(反向传播的那张图片分辨率较高,请在新的标签页打开图片并放大,或者下载后观看):
Forward.png
Backward.jpg
具体代码如下:
__coauthor__ = 'Deeplayer'
# 6.25.2016 #
def conv_forward_naive(x, w, b, conv_param):
stride, pad = conv_param['stride'], conv_param['pad']
N, C, H, W = x.shape
F, C, HH, WW = w.shape
x_padded = np.pad(x, ((0, 0), (0, 0), (pad, pad), (pad, pad)), mode='constant')
H_new = 1 + (H + 2 * pad - HH) / stride
W_new = 1 + (W + 2 * pad - WW) / stride
s = stride
out = np.zeros((N, F, H_new, W_new))
for i in xrange(N): # ith image
for f in xrange(F): # fth filter
for j in xrange(H_new):
for k in xrange(W_new):
out[i, f, j, k] = np.sum(x_padded[i, :, j*s:HH+j*s, k*s:WW+k*s] * w[f]) + b[f]
cache = (x, w, b, conv_param)
return out, cache
def conv_backward_naive(dout, cache):
x, w, b, conv_param = cache
pad = conv_param['pad']
stride = conv_param['stride']
F, C, HH, WW = w.shape
N, C, H, W = x.shape
H_new = 1 + (H + 2 * pad - HH) / stride
W_new = 1 + (W + 2 * pad - WW) / stride
dx = np.zeros_like(x)
dw = np.zeros_like(w)
db = np.zeros_like(b)
s = stride
x_padded = np.pad(x, ((0, 0), (0, 0), (pad, pad), (pad, pad)), 'constant')
dx_padded = np.pad(dx, ((0, 0), (0, 0), (pad, pad), (pad, pad)), 'constant')
for i in xrange(N): # ith image
for f in xrange(F): # fth filter
for j in xrange(H_new):
for k in xrange(W_new):
window = x_padded[i, :, j*s:HH+j*s, k*s:WW+k*s]
db[f] += dout[i, f, j, k]
dw[f] += window * dout[i, f, j, k]
dx_padded[i, :, j*s:HH+j*s, k*s:WW+k*s] += w[f] * dout[i, f, j, k]
# Unpad
dx = dx_padded[:, :, pad:pad+H, pad:pad+W]
return dx, dw, db
完成编程后,可以用ConvolutionalNetworks.ipynb里的代码来check编程是否有误。
下面给出池化层(最大值池化)的代码:
__coauthor__ = 'Deeplayer'
# 6.25.2016 #
def max_pool_forward_naive(x, pool_param):
HH, WW = pool_param['pool_height'], pool_param['pool_width']
s = pool_param['stride']
N, C, H, W = x.shape
H_new = 1 + (H - HH) / s
W_new = 1 + (W - WW) / s
out = np.zeros((N, C, H_new, W_new))
for i in xrange(N):
for j in xrange(C):
for k in xrange(H_new):
for l in xrange(W_new):
window = x[i, j, k*s:HH+k*s, l*s:WW+l*s]
out[i, j, k, l] = np.max(window)
cache = (x, pool_param)
return out, cache
def max_pool_backward_naive(dout, cache):
x, pool_param = cache
HH, WW = pool_param['pool_height'], pool_param['pool_width']
s = pool_param['stride']
N, C, H, W = x.shape
H_new = 1 + (H - HH) / s
W_new = 1 + (W - WW) / s
dx = np.zeros_like(x)
for i in xrange(N):
for j in xrange(C):
for k in xrange(H_new):
for l in xrange(W_new):
window = x[i, j, k*s:HH+k*s, l*s:WW+l*s]
m = np.max(window)
dx[i, j, k*s:HH+k*s, l*s:WW+l*s] = (window == m) * dout[i, j, k, l]
return dx
同样,可以用ConvolutionalNetworks.ipynb里的代码来check编程是否有误。
上面的编程中,我们使用了多层for循环,这会使得运行速度过慢。为了加快运行速度,Assignment2里提供了fast_layers.py,但需要借助Cython来生成C扩展,加快运行速度。这里,我给出naive版和fast版在运行速度上的对比,从下图可以看出,运行速度得到了极大的提升:
Naive vs Fast.png
2)、cnn.py,具体代码如下:
__coauthor__ = 'Deeplayer'
# 6.25.2016 #
from layer_utils import *
class ThreeLayerConvNet(object):
"""
A three-layer convolutional network with the following architecture:
conv - relu - 2x2 max pool - affine - relu - affine - softmax
"""
def __init__(self, input_dim=(3, 32, 32), num_filters=32, filter_size=7,
hidden_dim=100, num_classes=10, weight_scale=1e-3, reg=0.0,
dtype=np.float32):
self.params = {}
self.reg = reg
self.dtype = dtype
# Initialize weights and biases
C, H, W = input_dim
self.params['W1'] = weight_scale * np.random.randn(num_filters, C, filter_size, filter_size)
self.params['b1'] = np.zeros((1, num_filters))
self.params['W2'] = weight_scale * np.random.randn(num_filters*H*W/4, hidden_dim)
self.params['b2'] = np.zeros((1, hidden_dim))
self.params['W3'] = weight_scale * np.random.randn(hidden_dim, num_classes)
self.params['b3'] = np.zeros((1, num_classes))
for k, v in self.params.iteritems():
self.params[k] = v.astype(dtype)
def loss(self, X, y=None):
W1, b1 = self.params['W1'], self.params['b1']
W2, b2 = self.params['W2'], self.params['b2']
W3, b3 = self.params['W3'], self.params['b3']
# pass conv_param to the forward pass for the convolutional layer
filter_size = W1.shape[2]
conv_param = {'stride': 1, 'pad': (filter_size - 1) / 2}
# pass pool_param to the forward pass for the max-pooling layer
pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}
# compute the forward pass
a1, cache1 = conv_relu_pool_forward(X, W1, b1, conv_param, pool_param)
a2, cache2 = affine_relu_forward(a1, W2, b2)
scores, cache3 = affine_forward(a2, W3, b3)
if y is None:
return scores
# compute the backward pass
data_loss, dscores = softmax_loss(scores, y)
da2, dW3, db3 = affine_backward(dscores, cache3)
da1, dW2, db2 = affine_relu_backward(da2, cache2)
dX, dW1, db1 = conv_relu_pool_backward(da1, cache1)
# Add regularization
dW1 += self.reg * W1
dW2 += self.reg * W2
dW3 += self.reg * W3
reg_loss = 0.5 * self.reg * sum(np.sum(W * W) for W in [W1, W2, W3])
loss = data_loss + reg_loss
grads = {'W1': dW1, 'b1': db1, 'W2': dW2, 'b2': db2, 'W3': dW3, 'b3': db3}
return loss, grads
完成编程后,可以用ConvolutionalNetworks.ipynb里的代码来check编程是否有误。
3)、layers.py里的spatial_batchnorm_forward和spatial_batchnorm_backward函数。在给出代码前,我放张图,方便大家理解CNNs里的Batch Normalization是怎么计算卷积层的均值mean和标准差std的:
ConvNet Batch Normalization.png
具体代码如下:
__coauthor__ = 'Deeplayer'
# 6.25.2016 #
def spatial_batchnorm_forward(x, gamma, beta, bn_param):
N, C, H, W = x.shape
x_new = x.transpose(0, 2, 3, 1).reshape(N*H*W, C)
out, cache = batchnorm_forward(x_new, gamma, beta, bn_param)
out = out.reshape(N, H, W, C).transpose(0, 3, 1, 2)
return out, cache
def spatial_batchnorm_backward(dout, cache):
N, C, H, W = dout.shape
dout_new = dout.transpose(0, 2, 3, 1).reshape(N*H*W, C)
dx, dgamma, dbeta = batchnorm_backward(dout_new, cache)
dx = dx.reshape(N, H, W, C).transpose(0, 3, 1, 2)
return dx, dgamma, dbeta
完成编程后,可以用ConvolutionalNetworks.ipynb里的代码来check编程是否有误。
以上面完成的ThreeLayerConvNet为例,比较下使用和不使用Batch Normalization对收敛速度的影响。从下图中的结果可以看出,使用Batch Normalization明显加快了收敛,使得训练速度大幅提升(因为需要的epoch更少):
with BN --vs-- without BN.png
---> PS:
1、数据扩增(Data Augmentation)
当数据集较小的情况下,这一操作还是十分有效的,可以一定程度提高识别率。具体的扩增方法如下:
1)、水平翻转(Horizontal flips)
Horizontal flips.png
2)、随机剪裁(Random crops/scales)
Random crops/scales.png
3)、色彩抖动(Color jitter)
Randomly jitter contrast.png
4)、发挥想象力(Get creative)
比如:平移、旋转、拉伸、切变、光学畸变等等。
下面我给出一个CNN模型,测试其在CIFAR-10上的表现(进行简单的水平翻转来扩增数据),training set: 49000x2, validation set: 1000, test set: 10000。CNN层数结构如下:
[[conv - relu]x3 - pool]x3 - affine - relu - affine - softmax
训练结果如下:
· Validation set accuracy: 0.904
· Test set accuracy: 0.892
Training loss & Accuracy
CONV layer 1: filters
Part 4: 可视化卷积神经网络
可视化手段可以直观地揭开CNNs的神秘面纱,帮助我们更好地理解CNNs究竟学到了什么,下面我们讨论下具体的可视化技术:
1. 可视化权重和激活值
以AlexNet为例,给出每层部分权重和激活值的可视化如下:
CONV layer 1: filters(left) and activations(right)
CONV layer 2: filters(left) and activations(right)
CONV layer 3: activations
CONV layer 4: activations
CONV layer 5: activations
Fully-connected layer 1 & 2
Output layer
2. 检索能最大限度激活神经元的图片
我们可以将大量图片输入网络,追踪那些可以最大限度激活神经元的图片,然后我们可以可视化这些图片,以此来理解神经元在它的感受野里究竟在寻找什么,以便能够正确地分类图片?下图是AlexNet的第五个pooling层(光头躺枪 O__O "…):
AlexNet: pooling layer 5
3. 利用t-SNE和CNNs的特征向量来可视化图片
CNNs可以表示为对输入图像进行逐层转化,最终形成一个可以用线性分类器进行分类的representation,这个最终形成的representation就是CNN codes(例如AlexNet里输入分类器之前的那个4096维向量),即特征向量。
t-SNE作为对高维数据降维并可视化的最好的方法之一,其可视化结果有非常棒的视觉效果。我们可以将CNN codes输入t-SNE,得到每一张图片(对应一个特征向量)对应的二维向量,然后可以可视化出如下结果(靠的越近的图片,在CNNs眼里越相似):
t-SNE visualization of CNN codes
4. 局部遮挡图片
为了判断CNNs是否是依靠图片中正确的目标进行进行分类(而不是靠蒙的),我们可以对图片进行局部遮挡,来测试CNNs。从下图可以看出,CNNs确实是依靠正确的目标进行分类的:
Occluding parts of the image
Part 5: 迁移学习(Transfer Learning)
实际中,我们很少从头开始训练一个CNNs,因为通常我们没有足够的数据。我们常采取的做法是:使用已经在大数据集(例如ImageNet)上训练好的CNNs作为我们的初始模型或者一个固定的特征提取器,然后用在新的数据集上。上张图以便说明:
CS231n Convolutional Neural Networks for Visual Recognition.png
当新数据集和预训练时的数据集不相似的情况下(如医学图像),上图的策略需要稍稍调整下:若新数据集较小,我们需要训练除线性分类器之外更前面的几层;若新数据集较大,我们需要微调所有层。