这是《动手学深度学习》14天公益课程的笔记。希望能坚持下去,好好学习。
1.1 线性回归
线性回归假设输出与输入之间是线性关系。
使用线性模型来生成数据集
y
=
ω
1
x
1
+
ω
2
x
2
+
b
y=\omega_1x_1+\omega_2x_2+b
y=ω1x1+ω2x2+b
ω
\omega
ω是权重,
b
b
b是偏差,是单个变量。
#特征数
num_inputs=2
#样本数
num_examples=1000
#设置真实的权重以及偏差
true_w=[2.5,-1.8]
true_b=2.1
features=torch.randn(num_examples,num_inputs,dtype=torch.float32)
labels=true_w[0]*features[:,0]+true_w[1]*features[:,1]+true_b
labels += torch.tensor(np.random.normal(0, 0.01, size=labels.size()),
dtype=torch.float32)
定义模型
def linreg(X, w, b):
return torch.mm(X, w) + b
损失函数用于衡量预测值与真实值之间的误差,常用平方函数
l
(
i
)
(
ω
,
b
)
=
1
2
(
y
^
(
i
)
−
y
(
i
)
)
2
l^{(i)}(\omega,b)=\frac{1}{2}(\hat{y}^{(i)}-y^{(i)})^2
l(i)(ω,b)=21(y^(i)−y(i))2
def squared_loss(y_hat, y):
return (y_hat - y.view(y_hat.size())) ** 2 / 2
大多数深度学习模型无解析解,使用优化算法降低损失函数的值,得到数值解。例如使用小批量随机梯度下降:先选取参数的初始值,在负梯度方向上迭代更新参数。在每次迭代中随机选取小批量训练样本,求出这些样本的平均损失关于模型参数的导数(梯度),用此结果与一个设定的正数的乘积作为减少量。
def sgd(params, lr, batch_size):
for param in params:
param.data -= lr * param.grad / batch_size # ues .data to operate param without gradient track
#param.grad指学习率(步长大小)
模型训练
lr = 0.03
num_epochs = 5
net = linreg
loss = squared_loss
# training
for epoch in range(num_epochs): # training repeats num_epochs times
# in each epoch, all the samples in dataset will be used once
# X is the feature and y is the label of a batch sample
for X, y in data_iter(batch_size, features, labels):
l = loss(net(X, w, b), y).sum()
# calculate the gradient of batch sample loss
l.backward()
# using small batch random gradient descent to iter model parameters
sgd([w, b], lr, batch_size)
# reset parameter gradient
w.grad.data.zero_()
b.grad.data.zero_()
train_l = loss(net(features, w, b), labels)
#最后得到的权重是[ 2.4999],[-1.8002],偏差2.1004
使用pyTorch 定义模型
class LinearNet(nn.Module):
def __init__(self, n_feature):
super(LinearNet, self).__init__() # call father function to init
self.linear = nn.Linear(n_feature, 1) # function prototype: `torch.nn.Linear(in_features, out_features, bias=True)`
def forward(self, x):
y = self.linear(x)
return y
net = LinearNet(num_inputs)
# ways to init a multilayer network
# method one
net = nn.Sequential(
nn.Linear(num_inputs, 1)
# other layers can be added here
)
# method two
#直接调用神经网络的Sequential函数
net = nn.Sequential()
net.add_module('linear', nn.Linear(num_inputs, 1))
# net.add_module ......
# method three
from collections import OrderedDict
net = nn.Sequential(OrderedDict([
('linear', nn.Linear(num_inputs, 1))
# ......
]))
直接调用nn的均方误差函数
loss = nn.MSELoss()
1.2 Softmax和分类模型
Softmax回归是单层神经网络,用于离散分类。它的输出层是一个全连接层。
o
i
=
x
ω
i
+
b
i
o_i=x\omega_{i}+b_{i}
oi=xωi+bi
softmax运算符将输出值变换为正且和为1的概率分布
y
^
1
,
y
^
2
,
y
^
3
=
softmax
(
o
1
,
o
2
,
o
3
)
\hat{y}_1,\hat{y}_2,\hat{y}_3=\text{softmax}(o_1,o_2,o_3)
y^1,y^2,y^3=softmax(o1,o2,o3)
其中
y
^
j
=
e
x
p
(
o
j
)
∑
i
=
1
3
e
x
p
(
o
i
)
\hat{y}_j=\frac{exp(o_j)}{\sum_{i=1}^3exp(o_i)}
y^j=∑i=13exp(oi)exp(oj)
softmax运算符不改变预测类别的输出。
def softmax(X):
X_exp = X.exp()
partition = X_exp.sum(dim=1, keepdim=True)
return X_exp / partition # 这里应用了广播机制
def net(X):
return softmax(torch.mm(X.view((-1, num_inputs)), W) + b)
交叉熵损失函数更适合衡量两个概率分布差异。交叉熵
H
(
y
(
i
)
,
y
^
(
i
)
)
=
−
∑
j
=
1
q
y
j
(
i
)
l
o
g
y
^
j
(
i
)
H(y^{(i)},\hat{y}^{(i)})=-\sum_{j=1}^{q}y_j^{(i)}log\hat{y}^{(i)}_j
H(y(i),y^(i))=−j=1∑qyj(i)logy^j(i)
交叉熵损失函数就是取均值
def cross_entropy(y_hat, y):
return - torch.log(y_hat.gather(1, y.view(-1, 1)))
模型训练
def train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size,
params=None, lr=None, optimizer=None):
for epoch in range(num_epochs):
train_l_sum, train_acc_sum, n = 0.0, 0.0, 0
for X, y in train_iter:
y_hat = net(X)
l = loss(y_hat, y).sum()
# 梯度清零
if optimizer is not None:
optimizer.zero_grad()
elif params is not None and params[0].grad is not None:
for param in params:
param.grad.data.zero_()
l.backward()
if optimizer is None:
d2l.sgd(params, lr, batch_size)
else:
optimizer.step()
train_l_sum += l.item()
train_acc_sum += (y_hat.argmax(dim=1) == y).sum().item()
n += y.shape[0]
test_acc = evaluate_accuracy(test_iter, net)
print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f'
% (epoch + 1, train_l_sum / n, train_acc_sum / n, test_acc))
train_ch3(net, train_iter, test_iter, cross_entropy, num_epochs, batch_size, [W, b], lr)
1.3 多层感知机
假设多层感知机只有一个隐藏层,设输出为H。隐藏层与输出层都是全连接层,有对应的参数和偏差
W
h
,
b
h
,
W
o
,
b
o
W_h,b_h,W_o,b_o
Wh,bh,Wo,bo
输出的计算
H
=
X
W
h
+
b
h
O
=
H
W
o
+
b
o
H=XW_h+b_h \\ O=HW_o+b_o
H=XWh+bhO=HWo+bo
将式子联立之后可以发现依然等价于一个单层神经网络。
解决方法是引入非线性变换,使隐藏层的输出与输出层输出呈非线性关系。这样的非线性函数称为激活函数。
常用的激活函数有
ReLu函数
R
e
L
U
(
x
)
=
m
a
x
(
x
,
0
)
ReLU(x)=max(x,0)
ReLU(x)=max(x,0)
def relu(X):
return torch.max(input=X, other=torch.tensor(0.0))
只能在隐藏层中使用。由于计算较为简单,在层数较多时最好使用。
Sigmoid函数
s
i
g
m
o
i
d
(
x
)
=
1
1
+
e
x
p
(
−
x
)
sigmoid(x)=\frac{1}{1+exp(-x)}
sigmoid(x)=1+exp(−x)1
模型训练
def train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size,
params=None, lr=None, optimizer=None):
for epoch in range(num_epochs):
train_l_sum, train_acc_sum, n = 0.0, 0.0, 0
for X, y in train_iter:
y_hat = net(X)
l = loss(y_hat, y).sum()
# 梯度清零
if optimizer is not None:
optimizer.zero_grad()
elif params is not None and params[0].grad is not None:
for param in params:
param.grad.data.zero_()
l.backward()
if optimizer is None:
d2l.sgd(params, lr, batch_size)
else:
optimizer.step()
train_l_sum += l.item()
train_acc_sum += (y_hat.argmax(dim=1) == y).sum().item()
n += y.shape[0]
test_acc = evaluate_accuracy(test_iter, net)