Assignment2 code: downloaded here
Q1: Multi-Layer Fully Connected Neural Networks
In this question, we need to implement a fully connected network with an arbitrary number of hidden layers.
The main thing to do is implement the initiailzation, forward and backward. We can re-use the methods affine_forward, affine_backward, relu_forward, relu_backward, and softmax_loss.
Recap:
-
What is a multi-layer fully connected neural network (MLP)?
-
-
architecture
- at least 3 layers: input, hidden, and output with an activation function.
- Brifly, input->hidden layer->activation function ->output
-
formula:
-
Initial loss and Gradient Check
Suppose the input number is 2, the input dimension is 15, the number of first hidden layer is 20, the number of second hidden layer is 30, the number of output is 10.
The input size is (2, 15), and the classes is 10
N, D, H1, H2, C = [2, 15, 20, 30, 10]
X = np.random.randn(N, D)
y = np.random.randint(c, size=(N,))
The question has two regularizations, which is 0 and 3.14. By calling the FullyConnectedNet class’s method to calculate the loss.
for reg in [0, 3.14]:
print("Running check with reg = ", reg)
model = FullyConnectedNet(
[H1, H2], # [20, 30]
input_dim=D, # [15]
num_classes=C, # 10
reg=reg, # 0, 3.14
weight_scale=5e-2,
dtype=np.float64
)
loss, grads = model.loss(X, y)
print("Initial loss: ", loss)
1. Initialize parameters
In the FullyConnectedNet class, we are supposed to initialize the parameters of the network.
TODO: Initialize the parameters of the network, storing all values in # # the self.params dictionary. Store weights and biases for the first layer # in W1 and b1; for the second layer use W2 and b2, etc. Weights should be # initialized from a normal distribution centered at 0 with standard # deviation equal to weight_scale. Biases should be initialized to zero. # When using batch normalization, store scale and shift parameters for the # first layer in gamma1 and beta1; for the second layer use gamma2 and # beta2, etc. Scale parameters should be initialized to ones and shift # parameters should be initialized to zeros.
In this example, there are two hidden layers.
Use the for loop. Set input_dim, hidden_dim, and num_classes into truple. And use the zip() function to return a zip object. Iterate the truples.
# note: the hidden_dims is a list, we need to use * notation to pass in params
for n, (i, j) in enumerate zip([input_dims, *hidden_dims],[*hidden_dims, num_classes]):
'''
n = 0, i = 15, j = 20
n = 1, i = 20, j = 30
n = 2, i = 30, j = 10
'''
self.params[f'W{n+1}'] = np.random.randn(i, j) * weight_scale
self.params[f'b{b+1}'] = np.zeros(j)
'''
in the normalization, the scale (gamma) and shift (beta)
'''
if self.normalization and n < self.num_layers - 1:
self.params[f'gamma{l + 1}'] = np.ones(j)
self.params[f'beta{l + 1}'] = np.zeros(j)
2. Implement forward pass
# TODO: Implement the forward pass for the fully connected net, computing # the class scores for X and storing them in the scores variable. # When using dropout, you'll need to pass self.dropout_param to each # dropout forward pass. # When using batch normalization, you'll need to pass self.bn_params[0] to # the forward pass for the first batch normalization layer, pass # self.bn_params[1] to the forward pass for the second batch normalization # layer, etc.
- loop each layer
- use affine_relu_forward calculate the scores
- determine the batch_normalization
- pass the out to relu
- dropout forward
# loop the layer
for l in range(self.num_layers):
# get the params of each layer
keys = [f'W{l+1}', f'b{l+1}', f'gamma{l+1}', f'beta{l+1}']
# get parmas val
w, b, gamma, beta = (self.params.get(k, None) for k in keys)
# determine the bn is exist
bn = self.bn_params[l] if gamma is not None else None
# determine the dropout is exist
do = self.dropout_params if self.use_dropout else None
# affine forward
out, fc_cache = affine_forward(X, w, b)
# the layer != last layer
if l != self.num_layer-1:
if bn is not None:
if 'mode' in bn:
out, bn_cache = batchnorm_forward(out, gamma, beta, bn_param)
else:
out, ln_cache = layernorm_forward(out, gamma, beta, bn_param)
# activation function
out, relu_cache = relu_forward(out)
# dropout
if do is not None:
out, dropout_cache = dropout_forward(out, do)
3. Implement backward pass
- calculate the loss by ‘softmax_loss’
- add reg to loss
- calculate the grads of each layer
- add reg to dW
# calculate the loss
loss, dloss = softmax_loss(scores, y)
# add reg to loss
loss += 0.5 * self.reg * (np.sum(self.params['W1'] * np.sum(self.params['W1'])
+ np.sum(self.params['W2'] * np.sum(self.params['W2']))
# loop layer from the last to the first
for l in reverse(self.num_layer):
# dropout backward
dout = dropout_backward(dloss, dropout_cache)
# if relu performered
if relu_cache is not None:
dout = relu_backward(dout, relu_cache)
# If norm was performed
if bn_cache is not None:
dout, dgamma, dbeta = batchnorm_backward(dout, bn_cache)
elif ln_cache is not None:
dout, dgamma, dbeta = layernorm_backward(dout, ln_cache)
# Affine backward is a must
dx, dw, db = affine_backward(dout, fc_cache)
# save params into dict
# add reg to dW
grads[f'W{l+1}'] = dW + self.reg * self.params[f'W{l+1}']
grads[f'b{l+1}'] = db
if dgamma is not None and l < self.num_layers-1:
grads[f'gamma{l+1}'] = dgamma
grads[f'beta{l+1}'] = dbeta
Optimization
NOTE: more detail about optimization in this article
1. SGD+Momentum
formula of SGD
v
n
e
w
=
β
⋅
v
o
l
d
+
α
⋅
∂
l
o
s
s
∂
w
o
l
d
o
r
v
n
e
w
=
β
⋅
v
o
l
d
+
α
⋅
d
W
v_{new} = \beta \cdot v_{old} + \alpha\cdot\frac{\partial loss}{\partial w_{old}}\\or\\ v_{new} = \beta \cdot v_{old} + \alpha\cdot dW
vnew=β⋅vold+α⋅∂wold∂lossorvnew=β⋅vold+α⋅dW
beta here is the momentum, alpha is learning rate.
# if config is None:
# config = {}
# config.setdefault("learning_rate", 1e-2)
# config.setdefault("momentum", 0.9)
# v = config.get("velocity", np.zeros_like(w))
# next_w = None
v = config['momentum'] * v + config['learning_rate'] * dw
w += v
next_v = w
config["velocity"] = v
2. RMSProp and Adam
Formula:
E
[
g
r
a
d
n
e
w
2
]
=
β
E
[
g
r
a
d
o
l
d
2
]
+
(
1
−
β
)
∂
(
L
o
s
s
)
∂
(
w
o
l
d
)
w
n
e
w
=
w
o
l
d
−
α
E
[
g
r
a
d
n
e
w
2
]
∂
(
L
o
s
s
)
∂
(
w
o
l
d
)
E[grad^2_{new}]= \beta E[grad^2_{old}] + (1-\beta)\frac{\partial(Loss)}{\partial (w_{old})}\\ w_{new} = w_{old} - \frac{\alpha }{\sqrt{E[grad^2_{new}]}} \frac{\partial(Loss)}{\partial(w_{old})}
E[gradnew2]=βE[gradold2]+(1−β)∂(wold)∂(Loss)wnew=wold−E[gradnew2]α∂(wold)∂(Loss)
the E[grad^2_new] is the moving average of squared gradient.
config.setdefault('learning_rate', 1e-2)
config.setdefault('decay_rate', 0.99)
config.setdefault('epsilon', 1e-8)
config.setdefault('cache', torch.zeros_like(w))
next_w = None
# calcualate new grad
gsq = config['decay_rate'] * config['cache'] + (1 - config['decay_rate']) * dw * dw
next_w = w - config['learning_rate'] * dw / (gsq.sqrt() + config['epsilon'])
config['cache'] = gsq
3. Adam
Combined SGD and RMSProp
"""
config format:
- learning_rate: Scalar learning rate.
- beta1: Decay rate for moving average of first moment of gradient.
- beta2: Decay rate for moving average of second moment of gradient.
- epsilon: Small scalar used for smoothing to avoid dividing by zero.
- m: Moving average of gradient.
- v: Moving average of squared gradient.
- t: Iteration number.
"""
if config is None:
config = {}
config.setdefault('learning_rate', 1e-3)
config.setdefault('beta1', 0.9)
config.setdefault('beta2', 0.999)
config.setdefault('epsilon', 1e-8)
config.setdefault('m', torch.zeros_like(w))
config.setdefault('v', torch.zeros_like(w))
config.setdefault('t', 0)
next_w = None
config['t'] += 1
# moving average of gradient
config['m'] = config['beta1'] * config['m'] + (1 - config['beta1']) * dw
# moving average of squared gradient
config['v'] = config['beta2'] * config['v'] + (1 - config['beta2']) * dw * dw
# modify the estimate
m_hat = config['m'] / (1 - config['beta1']) ** config['t']
v_hat = config['v'] / (1 - config['beta2']) ** config['t']
# update
w = w - config['learning_rate'] * m_hat / (torch.sqrt(v_hat) + config['epsilon'])
next_w = w
Q2: Batch Normalization
The note which talked deteily about batch normalization
1. forward
PIPELINE:
-
calculate the mean in each batch of each channel.
-
(Bacth_0_channel_0+batch_1_channel_0 )/(channel_size*batch_size)
-
mu = x.mean(axis=0)
-
-
calculate the variance in each batch of each channel.
-
(each pixel - mean)^2 / (channel_size*batch_size)
-
sigma = x.var(axis=0)
-
-
normalize
-
# eps = bn_param.get("eps", 1e-5) # eps: Constant for numeric stability std = np.sqrt(sigma + eps) # batch standard deviation for each feature x_hat = (x - mu) / std # standartized x
-
scale and shift
-
#- gamma: Scale parameter of shape (D,) #- beta: Shift paremeter of shape (D,) out = gamma * x_hat + beta #---------- shape = bn_param.get('shape', (N, D)) # reshape used in backprop axis = bn_param.get('axis', 0) # axis to sum used in backprop cache = x, mu, var, std, gamma, x_hat, shape, axis # save for backprop
2. backward
PIPELINE:
-
-
dbeta = dout.sum(axis = cache['axis'])
-
-
-
dgamma = np.sum(dout * x_hat, axis=cache['axis'])
-
-
-
dx_hat = dout * gamma
-
-
-
dstd = -np.sum(dx_hat * (x-mu), axis=0) / (std**2)
-
-
-
dvar = 0.5 * dstd / std
-
-
-
dx1 = dx_hat / std + 2 * (x-mu) * dvar / len(dout)
-
-
-
dmu = -np.sum(dx1, axis=0)
-
-
-
dx2 = dmu / len(dout)
-
-
-
dx = dx1 + dx2
-
Q3. Dropout
The princple of dropout is set the neuron to 0.
1.forward
# 1. forward dropout p's neurons
"""
x: Input data, of any shape
In training mode, mask is the dropout mask that was used to multiply the input; in test mode, mask is None.
p is the probability of keeping a neuron output, as opposed to the probability of dropping a neuron output.
"""
p, mode = dropout_param["p"], dropout_param["mode"]
mask = None
out = None
mask = (np.random.rand(*x.shape) < p) / p
out = x * mask
2. backward
dx = dout * mask
Q.4 Convolutional Neural Networks
The convolutional layer takes an input �x with shape (N, C, H, W), where each parameter represents the batch size, number of channels, height, and width, respectively. This input is then convolved with a filter. There are various ways to perform the convolution, and one important aspect is the stride. To ensure that edge pixels are captured, padding is often used during the convolution process.
1. Conv
1. Forward
"""
params:
Returns a tuple of:
- out: Output data, of shape (N, F, H', W') where H' and W' are given by
H' = 1 + (H + 2 * pad - HH) / stride
W' = 1 + (W + 2 * pad - WW) / stride
- cache: (x, w, b, conv_param)
"""
"""
-
get the padding
pad = conv_param['pad'] # up = right = down = left
-
get stride
stride = conv_param['stride']
-
get the input dim of x
N, C, HI, WI = x.shape
-
get the filter dims of w
F, _, HF, WF = w.shape
-
output height
HO = 1 + (HI + 2 * pad - HF) // stride # H' = 1 + (H + 2 * pad - HH) / stride
-
output width
WO = 1 + (WI + 2 * pad - WF) // stride # W' = 1 + (W + 2 * pad - WW) / stride
-
create output tensor after convolution layer
# create output tensor after convolution layer out = np.zeros((N, F, HO, WO))
-
padding all output data
x_pad = np.pad(x, ((0,0), (0,0),(pad,pad),(pad,pad)), 'constant') H_pad, W_pad = x_pad.shape[2], x_pad.shape[3]
-
create w_row matrix and x_col
w_row = w.reshape(F, C*FH*FW) x_col = np.zeros((C*FH*FW, outH*outW))
-
implement stride to each input
# loop all the batch for index in range(N): neuron = 0 # loop height for i in range(0, H_pad-FH+1, stride): # loop width for j in range(0, W_pad-FW+1,stride): x_col[:,neuron] = x_pad[index,:,i:i+FH,j:j+FW].reshape(C*FH*FW) neuron += 1 out[index] = (w_row.dot(x_col) + b.reshape(F,1)).reshape(F, outH, outW)
2. backward
# initialize gradients
dx = np.zeros((N, C, Hpad - 2*pad, Wpad - 2*pad))
dw, db = np.zeros(w.shape), np.zeros(b.shape)
# create w_row matrix
w_row = w.reshape(F, C*FH*FW)
# create x_col matrix with values that each neuron is connected to
x_col = np.zeros((C*FH*FW, outH*outW))
for index in range(N):
out_col = dout[index].reshape(F, outH*outW)
w_out = w_row.T.dot(out_col)
dx_cur = np.zeros((C, Hpad, Wpad))
neuron = 0
for i in range(0, Hpad-FH+1, stride):
for j in range(0, Wpad-FW+1, stride):
dx_cur[:,i:i+FH,j:j+FW] += w_out[:,neuron].reshape(C,FH,FW)
x_col[:,neuron] = x_pad[index,:,i:i+FH,j:j+FW].reshape(C*FH*FW)
neuron += 1
dx[index] = dx_cur[:,pad:-pad, pad:-pad]
dw += out_col.dot(x_col.T).reshape(F,C,FH,FW)
db += out_col.sum(axis=1)
2. Max-pool
1. forward
N, C, H, W = x.shape
stride = pool_param['stride']
PH = pool_param['pool_height']
PW = pool_param['pool_width']
outH = 1 + (H - PH) / stride
outW = 1 + (W - PW) / stride
# create output tensor for pooling layer
out = np.zeros((N, C, outH, outW))
for index in range(N):
out_col = np.zeros((C, outH*outW))
neuron = 0
for i in range(0, H - PH + 1, stride):
for j in range(0, W - PW + 1, stride):
pool_region = x[index,:,i:i+PH,j:j+PW].reshape(C,PH*PW)
out_col[:,neuron] = pool_region.max(axis=1)
neuron += 1
out[index] = out_col.reshape(C, outH, outW)
2. backward
x, pool_param = cache
N, C, outH, outW = dout.shape
H, W = x.shape[2], x.shape[3]
stride = pool_param['stride']
PH, PW = pool_param['pool_height'], pool_param['pool_width']
# initialize gradient
dx = np.zeros(x.shape)
for index in range(N):
dout_row = dout[index].reshape(C, outH*outW)
neuron = 0
for i in range(0, H-PH+1, stride):
for j in range(0, W-PW+1, stride):
pool_region = x[index,:,i:i+PH,j:j+PW].reshape(C,PH*PW)
max_pool_indices = pool_region.argmax(axis=1)
dout_cur = dout_row[:,neuron]
neuron += 1
# pass gradient only through indices of max pool
dmax_pool = np.zeros(pool_region.shape)
dmax_pool[np.arange(C),max_pool_indices] = dout_cur
dx[index,:,i:i+PH,j:j+PW] += dmax_pool.reshape(C,PH,PW)
Q.5 PyTorch
GPU
Training model on GPU.
Detecting whether the computer has a GPU.
# using cuda
If torch.cuda.is_available():
device = torch.device('cuda')
else:
device = torch.device('cpu')
# using mac silicon chip
If torch.backends.mps.is_available():
device = torch.device('mps')
Part I. Preparation
Using torch to load CIFAR-10 dataset.
Trochvision.transforms:
- preprocessing data
- data agumentation
import torchvision.transforms as T
# set up a transform to preprocess the data by subtracting the mean RGB value and dividing by the standard deviation of each RGB value
transform = T.Compose([
T.ToTensor(), # let images to tensor
T.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)) # normalize
])
# split dataset into train/val/test
# wrap dataset in DataLoader with iterates through the Dataset and forms minibatches
NUM_TRAIN = 49000
cifar10_train = dset.CIFAR10('./cs231n/datasets', train=True, download=True,
transform=transform)
loader_train = DataLoader(cifar10_train, batch_size=64,
sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN)))
cifar10_val = dset.CIFAR10('./cs231n/datasets', train=True, download=True,
transform=transform)
loader_val = DataLoader(cifar10_val, batch_size=64,
sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN, 50000)))
cifar10_test = dset.CIFAR10('./cs231n/datasets', train=False, download=True,
transform=transform)
loader_test = DataLoader(cifar10_test, batch_size=64)
Part II. Barebones PyTorch
PyTorch Tensors: Flatten Function
flatten input into a vetor
def flatten(x):
return x.view(x.shape[0], -1)
Barebones PyTorch: Two-Layer Network
define
two_layer_fc
, performs forward pass of a two-layer fully-connected ReLU network on a batch of image data.
import torch.nn.functional as F
def two_layer_fc(x, params):
x = faltten(x)
w1, w2 = params
# fully connected -> ReLU -> fully connected layer
x = F.relu(x.mm(w1))
x = x.mm(w2)
return x
Barebones PyTorch: Three-Layer ConvNet
The network should have the following architecture: 1. A convolutional layer (with bias) with `channel_1` filters, each with shape `KW1 x KH1`, and zero-padding of two 2. ReLU nonlinearity 3. A convolutional layer (with bias) with `channel_2` filters, each with shape `KW2 x KH2`, and zero-padding of one 4. ReLU nonlinearity 5. Fully-connected layer with bias, producing scores for C classes.
def three_layer_convnet(x, params):
# 1. A convolutional layer (with bias) with `channel_1` filters, each with shape `KW1 x KH1`, and zero-padding of two
# 2. ReLU nonlinearity
x = F.relu(F.conv2d(x, conv_w1, conv_b1, padding=2))
# 3. A convolutional layer (with bias) with `channel_2` filters, each with shape `KW2 x KH2`, and zero-padding of one
# 4. ReLU nonlinearity
x = F.relu(F.conv2d(x, conv_w2, conv_b2, padding=1))
# Fully-connected layer with bias, producing scores for C classes.
scores = flatten(x).mm(fc_w) + fc_b
Barebones PyTorch: Initialization
initialize the weight
random_weight(shape)
initializes a weight tensor with the Kaiming normalization method.zero_weight(shape)
initializes a weight tensor with all zeros. Useful for instantiating bias parameters.
def random_weight(shape):
if len(shape) == 2: # FC weight
fan_in = shape[0]
else:
fan_in = np.prod(shape[1:]) # conv weight [out_channel, in_channel, kH, kW]
# sqrt(2 / fan_in)
w = torch.randn(shape, device=device, dtype=dtype) * np.sqrt(2. / fan_in)
w.requires_grad = True
return w
def zero_weight(shape):
return torch.zeros(shape, device=device, dtype=dtype, requires_grad=True)
Barebones PyTorch: Check Accuracy
def check_accuracy_part2(loader, model_fn, params):
"""
Inputs:
- loader: A DataLoader for the data split we want to check
- model_fn: A function that performs the forward pass of the model,
with the signature scores = model_fn(x, params)
- params: List of PyTorch Tensors giving parameters of the model
"""
# split dataset
split = 'val' if loader.dataset.train else 'test'
print('Checking accuracy on the %s set' % split)
num_correct, num_samples = 0, 0
# no needs to compute grads
with torch.no_grad():
for x, y in loader:
# move to device
x = x.to(device=device, dtyp=dtype)
y = y.to(device=device, dtyp=torch.int64) # y is label
scores - model_fn(x, params)
# get predict
_, preds = scores.max(1)
# calculate correct num
num_correct += (preds == y).sum()
num_samples += preds.size(0)
acc = float(num_correct) / num_samples
print('Got %d / %d correct (%.2f%%)' % (num_correct, num_samples, 100 * acc))
BareBones PyTorch: Training Loop
train the model using stochastic gradient descent without momentum
def train_part2(model_fn, params, learning_rate):
for t, (x, y) in enumerate(loader_train):
# move to device
x = x.to(device=device, dtyp=dtype)
y = y.to(device=device, dtyp=torch.long)
# compute scores and loss
scores = model_fn(x, params)
loss = F.cross_entropy(scores, y)
# backward
loss.backward()
# update weight
with torch.no_grad()
# SGD
for w in params:
w -= learning_rate * w.grad
w.grad.zero_()
if t % print_every == 0:
print('Iteration %d, loss = %.4f' % (t, loss.item()))
check_accuracy_part2(loader_val, model_fn, params)
print()
BareBones PyTorch: Train a Two-Layer Network
hidden_layer_size = 4000
learning_rate = 1e-2
w1 = random_weight((3 * 32 * 32, hidden_layer_size))
w2 = random_weight((hidden_layer_size, 10))
train_part2(two_layer_fc, [w1, w2], learning_rate)
BareBones PyTorch: Training a ConvNet
The network should have the following architecture:
- Convolutional layer (with bias) with 32 5x5 filters, with zero-padding of 2
- ReLU
- Convolutional layer (with bias) with 16 3x3 filters, with zero-padding of 1
- ReLU
- Fully-connected layer (with bias) to compute scores for 10 classes
# init params
learning_rate = 3e-3
channel_1 = 32
channel_2 = 16
conv_w1 = None
conv_b1 = None
conv_w2 = None
conv_b2 = None
fc_w = None
fc_b = None
# Conv1
conv_w1 = random_weight((channel_1, 3, 5, 5)) # output_channel_size, input_channel_size, HF, HW
conv_b1 = zero_weight(channel_1)
#Conv 2
conv_w2 = random_weight((channel_2, channel_1, 3, 3))
conv_b2 = zero_weight(channel_2)
# fc
fc_w = random_weight((channel_2 * 32 * 32, 10))
fc_b = zero_weight(10)
params = [conv_w1, conv_b1, conv_w2, conv_b2, fc_w, fc_b]
train_part2(three_layer_convnet, params, learning_rate)
Part III. PyTorch Module API
Module API: Two-Layer Network
# define TwoLayerFC class
class TwoLayerFC(nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
super().__init__()
# assign layer objects to class attributes
# fc1
self.fc1 = nn.Linear(input_size, hidden_size)
nn.init.kaiming_normal_(self.fc1.weight)
# fc2
self.fc2 = nn.Linear(hidden_size, num_classes)
nn.init.kaiming_normal_(self.fc2.weight)
def forward(self, x):
# forward always defines connectivity
x = flatten(x)
scores = self.fc2(F.relu(self.fc1(x)))
return scores
def test_TwoLayerFC():
input_size = 50
x = torch.zeros((64, input_size), dtype=dtype) # minibatch size 64, feature dimension 50
model = TwoLayerFC(input_size, 42, 10)
scores = model(x)
print(scores.size()) # you should see [64, 10]
test_TwoLayerFC()
Module API: Three-Layer ConvNet
implement a 3-layer ConvNet followed by a fully connected layer
- Convolutional layer with
channel_1
5x5 filters with zero-padding of 2- ReLU
- Convolutional layer with
channel_2
3x3 filters with zero-padding of 1- ReLU
- Fully-connected layer to
num_classes
classes
class ThreeLayerConvNet(nn.Module):
def __init__(self, in_channel, channel_1, channel_2, num_classes):
super().__init__()
self.conv1 = nn.Conv2d(in_channel, channel_1, 5, padding=2)
nn.init.kaiming_normal_(self.conv1.weight)
self.conv2 = nn.Conv2d(channel_1, channel_2, 3, padding=1)
nn.init.kaiming_normal_(self.conv2.weight)
self.fc = nn.Linear(channel_2*32*32, num_classes)
nn.init.kaiming_normal_(self.fc.weight)
def forward(self, x):
scores = None
x = F.relu(self.conv1(x))
x = F.relu(self.conv2(x))
scores = self.fc(flatten(x))
return scores
def test_ThreeLayerConvNet():
x = torch.zeros((64, 3, 32, 32), dtype=dtype) # minibatch size 64, image size [3, 32, 32]
model = ThreeLayerConvNet(in_channel=3, channel_1=12, channel_2=8, num_classes=10)
scores = model(x)
print(scores.size()) # you should see [64, 10]
test_ThreeLayerConvNet()
# torch.Size([64, 10])
Module API: Check Accuracy
def check_accuracy_part34(loader, model):
if loader.dataset.train:
print('Checking accuracy on validation set')
else:
print('Checking accuracy on test set')
num_correct = 0
num_samples = 0
model.eval() # set model to evaluation mode
with torch.no_grad():
for x, y in loader:
x = x.to(device=device, dtype=dtype) # move to device, e.g. GPU
y = y.to(device=device, dtype=torch.long)
scores = model(x)
_, preds = scores.max(1)
num_correct += (preds == y).sum()
num_samples += preds.size(0)
acc = float(num_correct) / num_samples
print('Got %d / %d correct (%.2f)' % (num_correct, num_samples, 100 * acc))
Module API: Training Loop
def train_part34(model, optimizer, epochs=1):
model = model.to(device=device) # move the model parameters to CPU/GPU
# epoch
for e in range(epochs):
for t, (x, y) in enumerate(loader_train):
model.train() # put model to training mode
x = x.to(device=device, dtype=dtype) # move to device, e.g. GPU
y = y.to(device=device, dtype=torch.long)
scores = model(x)
loss = F.cross_entropy(scores, y)
# Zero out all of the gradients for the variables which the optimizer
# will update.
optimizer.zero_grad()
# This is the backwards pass: compute the gradient of the loss with
# respect to each parameter of the model.
loss.backward()
# Actually update the parameters of the model using the gradients
# computed by the backwards pass.
optimizer.step()
if t % print_every == 0:
print('Iteration %d, loss = %.4f' % (t, loss.item()))
check_accuracy_part34(loader_val, model)
print()
Module API: Train a Two-Layer Network
hidden_layer_size = 4000
learning_rate = 1e-2
model = TwoLayerFC(3 * 32 * 32, hidden_layer_size, 10)
# SGD
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
train_part34(model, optimizer)
Module API: Train a Three-Layer ConvNet
learning_rate = 3e-3
channel_1 = 32
channel_2 = 16
model = None
optimizer = None
model = ThreeLayerConvNet(3, channel_1, channel_2, 10)
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
train_part34(model, optimizer)
Part IV. PyTorch Sequential API
Sequential API: Two-Layer Network
class Flatten(nn.Module):
def forward(self, x):
return flatten(x)
hidden_layer_size = 4000
learning_rate = 1e-2
model = nn.Sequential(
Flatten(),
nn.Linear(3 * 32 * 32, hidden_layer_size),
nn.ReLU(),
nn.Linear(hidden_layer_size, 10),
)
# you can use Nesterov momentum in optim.SGD
optimizer = optim.SGD(model.parameters(), lr=learning_rate,
momentum=0.9, nesterov=True)
train_part34(model, optimizer)
Sequential API: Three-Layer ConvNet
train a three-layer ConvNet with the same architecture
- Convolutional layer (with bias) with 32 5x5 filters, with zero-padding of 2
- ReLU
- Convolutional layer (with bias) with 16 3x3 filters, with zero-padding of 1
- ReLU
- Fully-connected layer (with bias) to compute scores for 10 classes
channel_1 = 32
channel_2 = 16
learning_rate = 1e-2
model = None
optimizer = None
model = nn.Sequential(
nn.Conv2d(3, channel_1, 5, padding=2),
nn.ReLU(),
nn.Conv2d(channel_1, channel_2, 3, padding=1),
nn.ReLU(),
nn.Flatten(),
nn.Linear(channel_2 * 32 * 32, 10)
)
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9, nesterov=True)
train_part34(model, optimizer)
Part V. CIFAR-10 open-ended challenge
- DenseNet
Reference
-
https://blog.csdn.net/weixin_43399179/article/details/134241238?spm=1001.2014.3001.5501