cs231n-AssignmentNote01-Assignment 2

itspollyyy

已于 2023-11-23 17:19:12 修改

阅读量97

点赞数

分类专栏： CV 文章标签：深度学习

于 2023-11-23 17:13:43 首次发布

本文链接：https://blog.csdn.net/weixin_43399179/article/details/134581722

版权

CV 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

Assignment2 code: downloaded here

Q1: Multi-Layer Fully Connected Neural Networks

In this question, we need to implement a fully connected network with an arbitrary number of hidden layers.

The main thing to do is implement the initiailzation, forward and backward. We can re-use the methods affine_forward, affine_backward, relu_forward, relu_backward, and softmax_loss.

Recap：

What is a multi-layer fully connected neural network (MLP)?
- architecture
  at least 3 layers: input, hidden, and output with an activation function.
  - Brifly, input->hidden layer->activation function ->output
- formula:
  - $h_1 = \sigma(W_1X+b_1) \ score = h_1W_2+b_2$

Initial loss and Gradient Check

Suppose the input number is 2, the input dimension is 15, the number of first hidden layer is 20, the number of second hidden layer is 30, the number of output is 10.

The input size is (2, 15), and the classes is 10

N, D, H1, H2, C = [2, 15, 20, 30, 10]
X = np.random.randn(N, D)
y = np.random.randint(c, size=(N,))

The question has two regularizations, which is 0 and 3.14. By calling the FullyConnectedNet class’s method to calculate the loss.

for reg in [0, 3.14]:
    print("Running check with reg = ", reg)
    model = FullyConnectedNet(
        [H1, H2], # [20, 30]
        input_dim=D, # [15]
        num_classes=C, # 10
        reg=reg, # 0, 3.14
        weight_scale=5e-2,
        dtype=np.float64
    )

    loss, grads = model.loss(X, y)
    print("Initial loss: ", loss)

1. Initialize parameters

In the FullyConnectedNet class, we are supposed to initialize the parameters of the network.

TODO: Initialize the parameters of the network, storing all values in    #
# the self.params dictionary. Store weights and biases for the first layer 
# in W1 and b1; for the second layer use W2 and b2, etc. Weights should be 
# initialized from a normal distribution centered at 0 with standard       
# deviation equal to weight_scale. Biases should be initialized to zero.                                                                          
# When using batch normalization, store scale and shift parameters for the 
# first layer in gamma1 and beta1; for the second layer use gamma2 and     
# beta2, etc. Scale parameters should be initialized to ones and shift     
# parameters should be initialized to zeros.

In this example, there are two hidden layers.

Use the for loop. Set input_dim, hidden_dim, and num_classes into truple. And use the zip() function to return a zip object. Iterate the truples.

# note: the hidden_dims is a list, we need to use * notation to pass in params
for n, (i, j) in enumerate zip([input_dims, *hidden_dims],[*hidden_dims, num_classes]):
    '''
    n = 0, i = 15, j = 20
    n = 1, i = 20, j = 30
    n = 2, i = 30, j = 10
    '''
    
    self.params[f'W{n+1}'] = np.random.randn(i, j) * weight_scale 
    self.params[f'b{b+1}'] = np.zeros(j)
    '''
    in the normalization, the scale (gamma) and shift (beta) 
    '''
    if self.normalization and n < self.num_layers - 1:
        self.params[f'gamma{l + 1}'] = np.ones(j)
        self.params[f'beta{l + 1}'] = np.zeros(j)

2. Implement forward pass

# TODO: Implement the forward pass for the fully connected net, computing  
# the class scores for X and storing them in the scores variable.                                                                                 
# When using dropout, you'll need to pass self.dropout_param to each       
# dropout forward pass.                                                                                                                            
# When using batch normalization, you'll need to pass self.bn_params[0] to 
# the forward pass for the first batch normalization layer, pass           
# self.bn_params[1] to the forward pass for the second batch normalization 
# layer, etc.

loop each layer
use affine_relu_forward calculate the scores
determine the batch_normalization
pass the out to relu
dropout forward

# loop the layer
for l in range(self.num_layers):
  # get the params of each layer
  keys = [f'W{l+1}', f'b{l+1}', f'gamma{l+1}', f'beta{l+1}']
  # get parmas val
  w, b, gamma, beta = (self.params.get(k, None) for k in keys)
  # determine the bn is exist
  bn = self.bn_params[l] if gamma is not None else None
  # determine the dropout is exist
  do = self.dropout_params if self.use_dropout else None
  # affine forward
  out, fc_cache = affine_forward(X, w, b)
  
  # the layer != last layer
  if l != self.num_layer-1:
    if bn is not None:
      if 'mode' in bn:
        out, bn_cache = batchnorm_forward(out, gamma, beta, bn_param)
      else:
        out, ln_cache = layernorm_forward(out, gamma, beta, bn_param)
        
  # activation function
  out, relu_cache = relu_forward(out)
  
  # dropout
  if do is not None:
    out, dropout_cache = dropout_forward(out, do)

3. Implement backward pass

calculate the loss by ‘softmax_loss’
add reg to loss
calculate the grads of each layer
add reg to dW

# calculate the loss
loss, dloss = softmax_loss(scores, y)
# add reg to loss
loss += 0.5 * self.reg * (np.sum(self.params['W1'] * np.sum(self.params['W1']) 
                                 + np.sum(self.params['W2'] * np.sum(self.params['W2']))
# loop layer from the last to the first
for l in reverse(self.num_layer):   
  # dropout backward
  dout = dropout_backward(dloss, dropout_cache)
  # if relu performered        
  if relu_cache is not None:
  	dout = relu_backward(dout, relu_cache)
  # If norm was performed
  if bn_cache is not None:
      dout, dgamma, dbeta = batchnorm_backward(dout, bn_cache)
  elif ln_cache is not None:
      dout, dgamma, dbeta = layernorm_backward(dout, ln_cache)
  # Affine backward is a must
  dx, dw, db = affine_backward(dout, fc_cache)    
  # save params into dict
  # add reg to dW                               
  grads[f'W{l+1}'] = dW + self.reg * self.params[f'W{l+1}']
  
  grads[f'b{l+1}'] = db

  if dgamma is not None and l < self.num_layers-1:
      grads[f'gamma{l+1}'] = dgamma
      grads[f'beta{l+1}'] = dbeta

Optimization

NOTE: more detail about optimization in this article

1. SGD+Momentum

formula of SGD
$v_{new} = \beta \cdot v_{old} + \alpha\cdot\frac{\partial loss}{\partial w_{old}}\\or\\ v_{new} = \beta \cdot v_{old} + \alpha\cdot dW$
beta here is the momentum, alpha is learning rate.

# if config is None:
#     config = {}
# config.setdefault("learning_rate", 1e-2)
# config.setdefault("momentum", 0.9)
# v = config.get("velocity", np.zeros_like(w))

# next_w = None

v = config['momentum'] * v + config['learning_rate'] * dw
w += v
next_v = w
config["velocity"] = v

2. RMSProp and Adam

Formula:
$E[grad^2_{new}]= \beta E[grad^2_{old}] + (1-\beta)\frac{\partial(Loss)}{\partial (w_{old})}\\ w_{new} = w_{old} - \frac{\alpha }{\sqrt{E[grad^2_{new}]}} \frac{\partial(Loss)}{\partial(w_{old})}$
the E[grad^2_new] is the moving average of squared gradient.

config.setdefault('learning_rate', 1e-2)
config.setdefault('decay_rate', 0.99)
config.setdefault('epsilon', 1e-8)
config.setdefault('cache', torch.zeros_like(w))

next_w = None
# calcualate new grad
gsq = config['decay_rate'] * config['cache'] + (1 - config['decay_rate']) * dw * dw

next_w = w - config['learning_rate'] * dw / (gsq.sqrt() + config['epsilon'])

config['cache'] = gsq

3. Adam

Combined SGD and RMSProp

"""
config format:
    - learning_rate: Scalar learning rate.
    - beta1: Decay rate for moving average of first moment of gradient.
    - beta2: Decay rate for moving average of second moment of gradient.
    - epsilon: Small scalar used for smoothing to avoid dividing by zero.
    - m: Moving average of gradient.
    - v: Moving average of squared gradient.
    - t: Iteration number.
"""
if config is None:
    config = {}
config.setdefault('learning_rate', 1e-3)
config.setdefault('beta1', 0.9)
config.setdefault('beta2', 0.999)
config.setdefault('epsilon', 1e-8)
config.setdefault('m', torch.zeros_like(w))
config.setdefault('v', torch.zeros_like(w))
config.setdefault('t', 0)

next_w = None

config['t'] += 1
# moving average of gradient 
config['m'] = config['beta1'] * config['m'] + (1 - config['beta1']) * dw
# moving average of squared gradient
config['v'] = config['beta2'] * config['v'] + (1 - config['beta2']) * dw * dw
# modify the estimate
m_hat = config['m'] / (1 - config['beta1']) ** config['t']
v_hat = config['v'] / (1 - config['beta2']) ** config['t']
# update
w = w - config['learning_rate'] * m_hat / (torch.sqrt(v_hat) + config['epsilon'])
next_w = w

Q2: Batch Normalization

The note which talked deteily about batch normalization

1. forward

PIPELINE:

calculate the mean in each batch of each channel.
1. (Bacth_0_channel_0+batch_1_channel_0 )/(channel_size*batch_size)
4. ```
mu = x.mean(axis=0)
```
calculate the variance in each batch of each channel.
1. (each pixel - mean)^2 / (channel_size*batch_size)
2. ```
sigma = x.var(axis=0)
```

normalize

# eps = bn_param.get("eps", 1e-5)
# eps: Constant for numeric stability
std = np.sqrt(sigma + eps) # batch standard deviation for each feature
x_hat = (x - mu) / std # standartized x

scale and shift

#- gamma: Scale parameter of shape (D,)
#- beta: Shift paremeter of shape (D,)
out = gamma * x_hat + beta
#----------
shape = bn_param.get('shape', (N, D))  # reshape used in backprop
axis = bn_param.get('axis', 0)  # axis to sum used in backprop
cache = x, mu, var, std, gamma, x_hat, shape, axis  # save for backprop

2. backward

PIPELINE:

1. ```
dbeta = dout.sum(axis = cache['axis'])
```

dgamma = np.sum(dout * x_hat, axis=cache['axis'])

1. ```
dx_hat = dout * gamma
```

dstd = -np.sum(dx_hat * (x-mu), axis=0) / (std**2)

1. ```
dvar = 0.5 * dstd / std
```

 dx1 = dx_hat / std + 2 * (x-mu) * dvar / len(dout)

1. ```
dmu = -np.sum(dx1, axis=0)  
```
1. ```
dx2 = dmu / len(dout)  
```
1. ```
dx = dx1 + dx2     
```

Q3. Dropout

The princple of dropout is set the neuron to 0.

1.forward

# 1. forward dropout p's neurons
"""
x: Input data, of any shape
In training mode, mask is the dropout mask that was used to multiply the input; in test mode, mask is None.
p is the probability of keeping a neuron output, as opposed to the probability of dropping a neuron output.
"""
p, mode = dropout_param["p"], dropout_param["mode"]
mask = None
out = None

mask = (np.random.rand(*x.shape) < p) / p
out = x * mask

2. backward

dx = dout * mask

Q.4 Convolutional Neural Networks

The convolutional layer takes an input �x with shape (N, C, H, W), where each parameter represents the batch size, number of channels, height, and width, respectively. This input is then convolved with a filter. There are various ways to perform the convolution, and one important aspect is the stride. To ensure that edge pixels are captured, padding is often used during the convolution process.

1. Conv

1. Forward

"""
params:
Returns a tuple of:
    - out: Output data, of shape (N, F, H', W') where H' and W' are given by
      H' = 1 + (H + 2 * pad - HH) / stride
      W' = 1 + (W + 2 * pad - WW) / stride
    - cache: (x, w, b, conv_param)
    """
"""

get the padding

pad = conv_param['pad'] # up = right = down = left

get stride
```
stride = conv_param['stride']
```
get the input dim of x
```
N, C, HI, WI = x.shape
```
get the filter dims of w
```
F, _, HF, WF = w.shape
```

output height

HO = 1 + (HI + 2 * pad - HF) // stride # H' = 1 + (H + 2 * pad - HH) / stride

output width

WO = 1 + (WI + 2 * pad - WF) // stride # W' = 1 + (W + 2 * pad - WW) / stride

create output tensor after convolution layer

# create output tensor after convolution layer
out = np.zeros((N, F, HO, WO))

padding all output data

x_pad = np.pad(x, ((0,0), (0,0),(pad,pad),(pad,pad)), 'constant') 
H_pad, W_pad = x_pad.shape[2], x_pad.shape[3]

create w_row matrix and x_col

w_row = w.reshape(F, C*FH*FW)  
x_col = np.zeros((C*FH*FW, outH*outW))

implement stride to each input

# loop all the batch
for index in range(N):
  neuron = 0 
  # loop height
  for i in range(0, H_pad-FH+1, stride):
    # loop width
    for j in range(0, W_pad-FW+1,stride):
      x_col[:,neuron] = x_pad[index,:,i:i+FH,j:j+FW].reshape(C*FH*FW)
      neuron += 1
	out[index] = (w_row.dot(x_col) + b.reshape(F,1)).reshape(F, outH, outW)

2. backward

# initialize gradients
dx = np.zeros((N, C, Hpad - 2*pad, Wpad - 2*pad))
dw, db = np.zeros(w.shape), np.zeros(b.shape)

# create w_row matrix
w_row = w.reshape(F, C*FH*FW)                            

# create x_col matrix with values that each neuron is connected to
x_col = np.zeros((C*FH*FW, outH*outW))  

for index in range(N):
  
  out_col = dout[index].reshape(F, outH*outW)   
  
  w_out = w_row.T.dot(out_col)   
  
  dx_cur = np.zeros((C, Hpad, Wpad))
  
  neuron = 0
  for i in range(0, Hpad-FH+1, stride):
    for j in range(0, Wpad-FW+1, stride):
      dx_cur[:,i:i+FH,j:j+FW] += w_out[:,neuron].reshape(C,FH,FW)
      x_col[:,neuron] = x_pad[index,:,i:i+FH,j:j+FW].reshape(C*FH*FW)
      neuron += 1
  dx[index] = dx_cur[:,pad:-pad, pad:-pad]
  dw += out_col.dot(x_col.T).reshape(F,C,FH,FW)
  db += out_col.sum(axis=1)

2. Max-pool

1. forward

N, C, H, W = x.shape
stride = pool_param['stride']
PH = pool_param['pool_height']
PW = pool_param['pool_width']
outH = 1 + (H - PH) / stride
outW = 1 + (W - PW) / stride

# create output tensor for pooling layer
out = np.zeros((N, C, outH, outW))
for index in range(N):
  out_col = np.zeros((C, outH*outW))
  neuron = 0
  for i in range(0, H - PH + 1, stride):
    for j in range(0, W - PW + 1, stride):
      pool_region = x[index,:,i:i+PH,j:j+PW].reshape(C,PH*PW)
      out_col[:,neuron] = pool_region.max(axis=1)
      neuron += 1
  out[index] = out_col.reshape(C, outH, outW)

2. backward

x, pool_param = cache
N, C, outH, outW = dout.shape
H, W = x.shape[2], x.shape[3]
stride = pool_param['stride']
PH, PW = pool_param['pool_height'], pool_param['pool_width']

# initialize gradient
dx = np.zeros(x.shape)

for index in range(N):
  dout_row = dout[index].reshape(C, outH*outW)
  neuron = 0
  for i in range(0, H-PH+1, stride):
    for j in range(0, W-PW+1, stride):
      pool_region = x[index,:,i:i+PH,j:j+PW].reshape(C,PH*PW)
      
      max_pool_indices = pool_region.argmax(axis=1)
      
      dout_cur = dout_row[:,neuron]
      neuron += 1
      
      # pass gradient only through indices of max pool
      dmax_pool = np.zeros(pool_region.shape)
      
      dmax_pool[np.arange(C),max_pool_indices] = dout_cur
      
      dx[index,:,i:i+PH,j:j+PW] += dmax_pool.reshape(C,PH,PW)

Q.5 PyTorch

GPU

Training model on GPU.

Detecting whether the computer has a GPU.

# using cuda
If torch.cuda.is_available():
	device = torch.device('cuda')
else:
	device = torch.device('cpu')
# using mac silicon chip
If torch.backends.mps.is_available():
	device = torch.device('mps')

Part I. Preparation

Using torch to load CIFAR-10 dataset.

Trochvision.transforms:

preprocessing data
data agumentation

import torchvision.transforms as T
# set up a transform to preprocess the data by subtracting the mean RGB value and dividing by the standard deviation of each RGB value
transform = T.Compose([
  T.ToTensor(), # let images to tensor
  T.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)) # normalize
])

# split dataset into train/val/test
# wrap dataset in DataLoader with iterates through the Dataset and forms minibatches
NUM_TRAIN = 49000
cifar10_train = dset.CIFAR10('./cs231n/datasets', train=True, download=True,
                             transform=transform)
loader_train = DataLoader(cifar10_train, batch_size=64, 
                          sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN)))

cifar10_val = dset.CIFAR10('./cs231n/datasets', train=True, download=True,
                           transform=transform)
loader_val = DataLoader(cifar10_val, batch_size=64, 
                        sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN, 50000)))

cifar10_test = dset.CIFAR10('./cs231n/datasets', train=False, download=True, 
                            transform=transform)
loader_test = DataLoader(cifar10_test, batch_size=64)

Part II. Barebones PyTorch

PyTorch Tensors: Flatten Function

flatten input into a vetor

def flatten(x):
  return x.view(x.shape[0], -1)

Barebones PyTorch: Two-Layer Network

define two_layer_fc, performs forward pass of a two-layer fully-connected ReLU network on a batch of image data.

import torch.nn.functional as F 
def two_layer_fc(x, params):
  x = faltten(x)
  w1, w2 = params
  # fully connected -> ReLU -> fully connected layer
  x = F.relu(x.mm(w1))
  x = x.mm(w2) 
  return x

Barebones PyTorch: Three-Layer ConvNet

The network should have the following architecture:
1. A convolutional layer (with bias) with `channel_1` filters, each with shape `KW1 x KH1`, and zero-padding of two
2. ReLU nonlinearity
3. A convolutional layer (with bias) with `channel_2` filters, each with shape `KW2 x KH2`, and zero-padding of one
4. ReLU nonlinearity
5. Fully-connected layer with bias, producing scores for C classes.

def three_layer_convnet(x, params):
	# 1. A convolutional layer (with bias) with `channel_1` filters, each with shape `KW1 x KH1`, and zero-padding of two
	# 2. ReLU nonlinearity
	x = F.relu(F.conv2d(x, conv_w1, conv_b1, padding=2))
	# 3. A convolutional layer (with bias) with `channel_2` filters, each with shape `KW2 x KH2`, and zero-padding of one
	# 4. ReLU nonlinearity
	x = F.relu(F.conv2d(x, conv_w2, conv_b2, padding=1))
	# Fully-connected layer with bias, producing scores for C classes.
	scores = flatten(x).mm(fc_w) + fc_b

Barebones PyTorch: Initialization

initialize the weight

random_weight(shape) initializes a weight tensor with the Kaiming normalization method.
zero_weight(shape) initializes a weight tensor with all zeros. Useful for instantiating bias parameters.

def random_weight(shape):
	if len(shape) == 2:  # FC weight
		fan_in = shape[0]
	else:
		fan_in = np.prod(shape[1:]) # conv weight [out_channel, in_channel, kH, kW]
		# sqrt(2 / fan_in)
	w = torch.randn(shape, device=device, dtype=dtype) * np.sqrt(2. / fan_in)
	w.requires_grad = True
	return w
	
def zero_weight(shape):
	return torch.zeros(shape, device=device, dtype=dtype, requires_grad=True)

Barebones PyTorch: Check Accuracy

def check_accuracy_part2(loader, model_fn, params):
	"""
	Inputs:
    - loader: A DataLoader for the data split we want to check
    - model_fn: A function that performs the forward pass of the model,
      with the signature scores = model_fn(x, params)
    - params: List of PyTorch Tensors giving parameters of the model
	"""
	# split dataset
	split = 'val' if loader.dataset.train else 'test'
	print('Checking accuracy on the %s set' % split)
	num_correct, num_samples = 0, 0
	# no needs to compute grads
	with torch.no_grad():
		for x, y in loader: 
			# move to device
			x = x.to(device=device, dtyp=dtype)
			y = y.to(device=device, dtyp=torch.int64) # y is label
			scores - model_fn(x, params)
			# get predict
			_, preds = scores.max(1)
			# calculate correct num
			num_correct += (preds == y).sum()
			num_samples += preds.size(0)
		acc = float(num_correct) / num_samples
		print('Got %d / %d correct (%.2f%%)' % (num_correct, num_samples, 100 * acc))

BareBones PyTorch: Training Loop

train the model using stochastic gradient descent without momentum

def train_part2(model_fn, params, learning_rate):
	for t, (x, y) in enumerate(loader_train):
		# move to device
			x = x.to(device=device, dtyp=dtype)
			y = y.to(device=device, dtyp=torch.long) 
		# compute scores and loss
		scores = model_fn(x, params)
		loss = F.cross_entropy(scores, y)
		
		# backward
		loss.backward()
		# update weight
		with torch.no_grad()
		# SGD
			for w in params:
				w -= learning_rate * w.grad
				w.grad.zero_()
		if t % print_every == 0:
			print('Iteration %d, loss = %.4f' % (t, loss.item()))
			check_accuracy_part2(loader_val, model_fn, params)
			print()

BareBones PyTorch: Train a Two-Layer Network

hidden_layer_size = 4000 
learning_rate = 1e-2
w1 = random_weight((3 * 32 * 32, hidden_layer_size))
w2 = random_weight((hidden_layer_size, 10))
train_part2(two_layer_fc, [w1, w2], learning_rate)

BareBones PyTorch: Training a ConvNet

The network should have the following architecture:

Convolutional layer (with bias) with 32 5x5 filters, with zero-padding of 2
ReLU
Convolutional layer (with bias) with 16 3x3 filters, with zero-padding of 1
ReLU
Fully-connected layer (with bias) to compute scores for 10 classes

# init params
learning_rate = 3e-3

channel_1 = 32
channel_2 = 16

conv_w1 = None
conv_b1 = None
conv_w2 = None
conv_b2 = None
fc_w = None
fc_b = None

# Conv1 
conv_w1 = random_weight((channel_1, 3, 5, 5)) # output_channel_size, input_channel_size, HF, HW
conv_b1 = zero_weight(channel_1)
#Conv 2
conv_w2 = random_weight((channel_2, channel_1, 3, 3))
conv_b2 = zero_weight(channel_2)
# fc
fc_w = random_weight((channel_2 * 32 * 32, 10))
fc_b = zero_weight(10)

params = [conv_w1, conv_b1, conv_w2, conv_b2, fc_w, fc_b]
train_part2(three_layer_convnet, params, learning_rate)

Part III. PyTorch Module API

Module API: Two-Layer Network

# define TwoLayerFC class
class TwoLayerFC(nn.Module):
	def __init__(self, input_size, hidden_size, num_classes):
		super().__init__()
		# assign layer objects to class attributes
		# fc1
    self.fc1 = nn.Linear(input_size, hidden_size)
    nn.init.kaiming_normal_(self.fc1.weight)
    # fc2
    self.fc2 = nn.Linear(hidden_size, num_classes)
    nn.init.kaiming_normal_(self.fc2.weight)
    
  def forward(self, x):
    # forward always defines connectivity
    x = flatten(x)
    scores = self.fc2(F.relu(self.fc1(x)))
    return scores
    
def test_TwoLayerFC():
  input_size = 50
  x = torch.zeros((64, input_size), dtype=dtype)  # minibatch size 64, feature dimension 50
  model = TwoLayerFC(input_size, 42, 10)
  scores = model(x)
  print(scores.size())  # you should see [64, 10]
test_TwoLayerFC()

Module API: Three-Layer ConvNet

implement a 3-layer ConvNet followed by a fully connected layer

Convolutional layer with channel_1 5x5 filters with zero-padding of 2
ReLU
Convolutional layer with channel_2 3x3 filters with zero-padding of 1
ReLU
Fully-connected layer to num_classes classes

class ThreeLayerConvNet(nn.Module):
    def __init__(self, in_channel, channel_1, channel_2, num_classes):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channel, channel_1, 5, padding=2)
        nn.init.kaiming_normal_(self.conv1.weight)
        self.conv2 = nn.Conv2d(channel_1, channel_2, 3, padding=1)
        nn.init.kaiming_normal_(self.conv2.weight)
        self.fc = nn.Linear(channel_2*32*32, num_classes)
        nn.init.kaiming_normal_(self.fc.weight)
    def forward(self, x):
        scores = None
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        scores = self.fc(flatten(x))
        return scores
def test_ThreeLayerConvNet():
    x = torch.zeros((64, 3, 32, 32), dtype=dtype)  # minibatch size 64, image size [3, 32, 32]
    model = ThreeLayerConvNet(in_channel=3, channel_1=12, channel_2=8, num_classes=10)
    scores = model(x)
    print(scores.size())  # you should see [64, 10]
test_ThreeLayerConvNet()
# torch.Size([64, 10])

Module API: Check Accuracy

def check_accuracy_part34(loader, model):
    if loader.dataset.train:
        print('Checking accuracy on validation set')
    else:
        print('Checking accuracy on test set')   
    num_correct = 0
    num_samples = 0
    model.eval()  # set model to evaluation mode
    with torch.no_grad():
        for x, y in loader:
            x = x.to(device=device, dtype=dtype)  # move to device, e.g. GPU
            y = y.to(device=device, dtype=torch.long)
            scores = model(x)
            _, preds = scores.max(1)
            num_correct += (preds == y).sum()
            num_samples += preds.size(0)
        acc = float(num_correct) / num_samples
        print('Got %d / %d correct (%.2f)' % (num_correct, num_samples, 100 * acc))

Module API: Training Loop

def train_part34(model, optimizer, epochs=1):
    model = model.to(device=device)  # move the model parameters to CPU/GPU
    # epoch
    for e in range(epochs):
        for t, (x, y) in enumerate(loader_train):
            model.train()  # put model to training mode
            x = x.to(device=device, dtype=dtype)  # move to device, e.g. GPU
            y = y.to(device=device, dtype=torch.long)

            scores = model(x)
            loss = F.cross_entropy(scores, y)

            # Zero out all of the gradients for the variables which the optimizer
            # will update.
            optimizer.zero_grad()

            # This is the backwards pass: compute the gradient of the loss with
            # respect to each  parameter of the model.
            loss.backward()

            # Actually update the parameters of the model using the gradients
            # computed by the backwards pass.
            optimizer.step()

            if t % print_every == 0:
                print('Iteration %d, loss = %.4f' % (t, loss.item()))
                check_accuracy_part34(loader_val, model)
                print()

Module API: Train a Two-Layer Network

hidden_layer_size = 4000
learning_rate = 1e-2
model = TwoLayerFC(3 * 32 * 32, hidden_layer_size, 10)
# SGD
optimizer = optim.SGD(model.parameters(), lr=learning_rate)

train_part34(model, optimizer)

Module API: Train a Three-Layer ConvNet

learning_rate = 3e-3
channel_1 = 32
channel_2 = 16

model = None
optimizer = None

model = ThreeLayerConvNet(3, channel_1, channel_2, 10)
optimizer = optim.SGD(model.parameters(), lr=learning_rate)

train_part34(model, optimizer)

Part IV. PyTorch Sequential API

Sequential API: Two-Layer Network

class Flatten(nn.Module):
    def forward(self, x):
        return flatten(x)

hidden_layer_size = 4000
learning_rate = 1e-2

model = nn.Sequential(
    Flatten(),
    nn.Linear(3 * 32 * 32, hidden_layer_size),
    nn.ReLU(),
    nn.Linear(hidden_layer_size, 10),
)

# you can use Nesterov momentum in optim.SGD
optimizer = optim.SGD(model.parameters(), lr=learning_rate,
                     momentum=0.9, nesterov=True)

train_part34(model, optimizer)

Sequential API: Three-Layer ConvNet

train a three-layer ConvNet with the same architecture

Convolutional layer (with bias) with 32 5x5 filters, with zero-padding of 2
ReLU
Convolutional layer (with bias) with 16 3x3 filters, with zero-padding of 1
ReLU
Fully-connected layer (with bias) to compute scores for 10 classes

channel_1 = 32
channel_2 = 16
learning_rate = 1e-2

model = None
optimizer = None
model = nn.Sequential(
    nn.Conv2d(3, channel_1, 5, padding=2),
    nn.ReLU(),
    nn.Conv2d(channel_1, channel_2, 3, padding=1),
    nn.ReLU(),
    nn.Flatten(),
    nn.Linear(channel_2 * 32 * 32, 10)
)
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9, nesterov=True)
train_part34(model, optimizer)