全链接网络是啥?
每一个神经元都与每层连接
线性层
向前传播(forward)
输出=输入*weight+bias;out = x*w+b
out = x.view(x.shape[0], -1).mm(w) + b
x的形状为(N, d_1, ..., d_k),其中N是样本的数量
每个样本的形状为:(d_1, ..., d_k)
w shape:(D, M)
b shape:(M,)
因输入和权重的形状不一样,故用x.view()修改x的形状为(N,D)。
向后传播(backward)
; ;
用上游梯度计算梯度
downstream gradient = local gradient × upstream gradient
# dw = dx * dout
# dout (N, M)
dw = x.view(x.shape[0], -1).mm(dout)
# w (D, M)
# dx = dw * dout
dx = dout.mm(w.t.()). reshape(x.shape)
# b(M,) M-dim
db = torch.sum(dout, dim=0)
激活函数:
1. ReLU
forward
out = x.clone()
out[out < 0] = 0
backward
dx = dout * (x > 0)
"Sandwich" layers
input layer -> hidden layer(activation function)->output layer
ReLU 激活函数
第一层:线性层 ()
第二层:ReLU ()
第三层:输出层
两层网络
第一层:
第二层:
初始化
weight_scale=1e-3
input_dim=3 * 32 * 32
hidden_dim=100
# first layer
# initialized from a Gaussian centered at 0.0 with standard deviation equal to weight_scale
self.W1 = weight_scale * torch.randn(input_dim, hidden_dim, dtype=dtype).to(device)
# biases should be initialized to zero
self.b1 = torch.zeros(hidden_dim, dtype=dtype).to(device)
# second layer
self.W2 = weight_scale * torch.randn(input_dim, hidden_dim, dtype=dtype).to(device)
self.b2 = torch.zeros(hidden_dim, dtype=dtype).to(device)
# is stored in the dictionary. params
self.params = {'W1': self.W1, 'b1': self.b1, 'W2': self.W2, 'b2': self.b2}
计算loss
step 1: 计算h1
N = X.shape[0]
X_mat = X.view(N, -1) # 为了X和W计算,修改X的形状为(N,D)
# h1 = relu(W1*x+b1)
h1, cache1 = Linear_ReLU.forward(X_mat, self.params['W1'], self.params['b1'])
step 2:计算score(预测值)
scores, cache2 = Linear.forward(h1, self.params['W2'], self.params['b2'])
step 3: 用softmax计算loss,用L2 正则化惩罚。并计算超参数偏导
loss, dloss = softmax_loss(scores, y)
loss += self.reg * (torch.sum(self.params['W1'] * self.params['W1']) + torch.sum(self.params['W2'] * self.params['W2']))
dh1, dW2, db2 = Linear.backward(dloss, cache2)
dx, dW1, db1 = Linear_ReLU.backward(dh1, cache1)
dW1 += 2 * self.reg * self.params['W1']
dW2 += 2 * self.reg * self.params['W2']
grads = {'W1': dW1, 'b1': db1, 'W2': dW2, 'b2': db2}
多层网络
{linear - relu - [dropout]} x (L - 1) - linear - softmax
初始化loss和梯度验证
step 1: 初始化w和b
params = []
self.hidden_dims = hidden_dims
# input layer
params.append(('W1', weight_scale * torch.randn(input_dim, hidden_dims[0], device=device)))
params.append(('b1', torch.zeros(hidden_dims[0]).to(device)))
# hidden layer
for i in range(2, len(hidden_dims) + 1):
params.append(('W' + str(i), weight_scale * torch.randn(hidden_dims[i - 2], hidden_dims[i - 1], device=device)))
params.append(('b' + str(i), torch.zeros(hidden_dims[i - 1], device=device)))
# output layer
params.append(('W' + str(len(hidden_dims) + 1),weight_scale * torch.randn(hidden_dims[-1], num_classes, device=device)))
params.append(('b' + str(len(hidden_dims) + 1), torch.zeros(num_classes, device=device)))
self.params = dict(params)
step 2: 向前传播计算scores
# input layer + hidden layer
for n in range(self.num_layers - 1):
i = n + 1
last_out, cache_dict['cache_LR{}'.format(i)] = Linear_ReLU.forward(last_out, self.params['W{}'.format(i)], self.params['b{}'.format(i)])
if self.use_dropout:
last_out, cache_dict['cache_Dropout{}'.format(i)] = Dropout.forward(last_out, self.dropout_param)
# output layer
i += 1
last_out, cache_dict['cache_L{}'.format(i)] = Linear.forward(last_out, self.params['W{}'.format(i)],self.params['b{}'.format(i)])
scores = last_out
step 3: 计算loss (softmax), 反向传播计算梯度
# softmax
loss, dout = softmax_loss(scores, y)
# regularization
loss += (self.params['W{}'.format(i)] * self.params['W{}'.format(i)]).sum() * self.reg
last_dout, dw, db = Linear.backward(dout, cache_dict['cache_L{}'.format(i)])
grads['W{}'.format(i)] = dw + 2 * self.params['W{}'.format(i)] * self.reg
grads['b{}'.format(i)] = db
for n in range(self.num_layers - 1)[::-1]:
i = n + 1
if self.use_dropout:
last_dout = Dropout.backward(last_dout, cache_dict['cache_Dropout{}'.format(i)])
last_dout, dw, db = Linear_ReLU.backward(last_dout, cache_dict['cache_LR{}'.format(i)])
grads['W{}'.format(i)] = dw + 2 * self.params['W{}'.format(i)] * self.reg
loss += (self.params['W{}'.format(i)] * self.params['W{}'.format(i)]).sum() * self.reg
优化器
Dropout
在输入和隐藏层中,舍弃一些神经元。
如图
目的: 防止过拟合,因每层的数据的特征都进行学习,会有过多的特征出现。舍去一些,避免不必要的特征进行学习
向前传播:
假设舍弃的神经元的概率为p,则
# model = ‘train'
mask = torch.rand(x.shape) > p
out = x.clone()
out[mask] = 0
# model = 'test'
out = x
反向传播
# model = 'train'
# dout: Upstream derivatives, of any shape
dx = dout
dx[mask] = 0
#model = 'test'
dx = dout
dropout更详细的解释 by Lei Mao
Ref:
1. Upstream, Downstream, and Local Gradients