一个简单的例子
import torch
import torch.nn as nn
x = torch.randn(10, 3)
y = torch.randn(10, 2)
# Build a fully connected layer.
linear = nn.Linear(3, 2)
# Build loss function and optimizer.
criterion = nn.MSELoss()
# 优化方法选用随机梯度下降,学习率为0.01
optimizer = torch.optim.SGD(linear.parameters(), lr=0.01)
# Forward pass.
pred = linear(x)
# Compute loss.
loss = criterion(pred, y)
print('loss:', loss.item())
# Backward pass.
loss.backward()
print('dL/dw: ', linear.weight.grad)
print('dL/db: ', linear.bias.grad)
# 1-step gradient descent.
optimizer.step()
# Print out the loss after 1-step gradient descent.
pred = linear(x)
loss = criterion(pred, y)
print(loss.item())
MSELoss的直观实现方法
# MSELoss()等同于:
def mseLoss(pred, y):
return ((pred - y) ** 2).mean()
M S E = 1 m ∑ i = 1 M ( y i ^ − y i ) 2 MSE=\frac{1}{m}\sum_{i=1}^M(\hat{y_i}-y_i)^2 MSE=m1i=1∑M(yi^−yi)2
SGD的直观实现方法
optimizer.step()
# optimizer.step()等同于:
linear.weight.data.sub_(0.01 * linear.weight.grad.data)
linear.bias.data.sub_(0.01 * linear.bias.grad.data)
其中,0.01
是lr
, sub_()
方法是原地减,就像t_()
方法是原地转置一样
w
和b
的初始值是随机选取的,然后按照
w
1
←
w
0
−
η
d
L
d
w
∣
w
=
w
0
,
b
=
b
0
w^1\leftarrow w^0-\eta\frac{dL}{dw}|_{w=w^0,b=b^0}\\
w1←w0−ηdwdL∣w=w0,b=b0
b 1 ← b 0 − η d L d b ∣ w = w 0 , b = b 0 b^1\leftarrow b^0-\eta\frac{dL}{db}|_{w=w^0,b=b^0} b1←b0−ηdbdL∣w=w0,b=b0