欢迎关注 “小白玩转Python”,发现更多 “有趣”
本文的目的是展示如何保存一个模型并加载它,以便在上一个 epoch 之后继续训练并进行预测。如果您正在阅读本文,我假定您熟悉深度学习和 PyTorch 的基本知识。
你是否经历过这样的情况:你花了几个小时或几天的时间来训练你的模型,然后它在中途停止了?或者你对自己的模型表现不满意,想继续训练?出于多种原因,我们可能需要一种灵活的方式来保存和加载我们的模型。
现在有很多免费的云服务,如 Kaggle、 Google Colab 等都有空闲超时功能,这会导致你的笔记本电脑断开连接,而且一旦超时,笔记本电脑就会被断开或中断。除非你用 GPU 训练一小段 epoch,否则这个过程需要时间。能够保存模型会给你带来巨大的优势,从而挽救局面。为了灵活起见,我将同时保存最新的 ckpt 和最好的 ckpt。
本文中的数据集使用比较常用的 Fashion_MNIST_data,我们将从导入数据中编写一个完整的流程来进行预测。(本文将使用 Kaggle 进行训练)
第一步:准备
在 Kaggle 默认情况下,您正在处理的文件被称为__notebook__.ipyn
创建两个目录来存储 ckpt 和最佳模型
# uncomment if you want to create directory checkpoint, best_model
%mkdir checkpoint best_model
第二步:导入相关库并创建辅助函数
导入库
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt
import torch
import shutil
from torch import nn
from torch import optim
import torch.nn.functional as F
from torchvision import datasets, transforms
import numpy as np
# check if CUDA is available
use_cuda = torch.cuda.is_available()
保存功能
save_ckp 是为了保存 ckpt 文件而创建的,它是最新的也是最好的。这创建了灵活性:您可能对最新 ckpt 的状态感兴趣,也可能对最好的 ckpt 感兴趣。
def save_ckp(state, is_best, checkpoint_path, best_model_path):
"""
state: checkpoint we want to save
is_best: is this the best checkpoint; min validation loss
checkpoint_path: path to save checkpoint
best_model_path: path to save best model
"""
f_path = checkpoint_path
# save checkpoint data to the path given, checkpoint_path
torch.save(state, f_path)
# if it is a best model, min validation loss
if is_best:
best_fpath = best_model_path
# copy that checkpoint file to best path given, best_model_path
shutil.copyfile(f_path, best_fpath)
在我们的例子中,我们希望保存一个 ckpt,允许我们使用这些信息来继续我们的模型训练。以下是我们需要的信息:
epoch:所有训练向量用于更新权重的次数
valid_loss_min:最小的验证损失,这是必需的,以便在我们继续训练时,可以从此值开始,而不是从np.Inf值开始。
state_dict:模型架构信息。它包括每个图层的参数矩阵。
optimizer:需要保存优化器参数,特别是在使用 Adam 作为优化器时。Adam 是一个在线机机器学习率方法,也就是说,它为不同的参数计算个人的学习率,如果我们想继续我们的训练,我们就需要这些参数。
加载函数
def load_ckp(checkpoint_fpath, model, optimizer):
"""
checkpoint_path: path to save checkpoint
model: model that we want to load checkpoint parameters into
optimizer: optimizer we defined in previous training
"""
# load check point
checkpoint = torch.load(checkpoint_fpath)
# initialize state_dict from checkpoint to model
model.load_state_dict(checkpoint['state_dict'])
# initialize optimizer from checkpoint to optimizer
optimizer.load_state_dict(checkpoint['optimizer'])
# initialize valid_loss_min from checkpoint to valid_loss_min
valid_loss_min = checkpoint['valid_loss_min']
# return model, optimizer, epoch value, min validation loss
return model, optimizer, checkpoint['epoch'], valid_loss_min.item()
为加载模型创建 load_chkp。它需要:
被保存的 ckpt 的位置
要将状态加载到的模型实例
优化器
第三步:导入数据集 Fashion _MNIST_ data 并创建数据加载器
# Define a transform to normalize the data
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
# Download and load the training data
trainset = datasets.FashionMNIST('F_MNIST_data/', download=True, train=True, transform=transform)
# Download and load the test data
testset = datasets.FashionMNIST('F_MNIST_data/', download=True, train=False, transform=transform)
loaders = {
'train' : torch.utils.data.DataLoader(trainset,batch_size = 64,shuffle=True),
'test' : torch.utils.data.DataLoader(testset,batch_size = 64,shuffle=True),
}
第四步:定义和创建模型
# Define your network ( Simple Example )
class FashionClassifier(nn.Module):
def __init__(self):
super().__init__()
input_size = 784
self.fc1 = nn.Linear(input_size, 512)
self.fc2 = nn.Linear(512, 256)
self.fc3 = nn.Linear(256, 128)
self.fc4 = nn.Linear(128, 64)
self.fc5 = nn.Linear(64,10)
self.dropout = nn.Dropout(p=0.2)
def forward(self, x):
x = x.view(x.shape[0], -1)
x = self.dropout(F.relu(self.fc1(x)))
x = self.dropout(F.relu(self.fc2(x)))
x = self.dropout(F.relu(self.fc3(x)))
x = self.dropout(F.relu(self.fc4(x)))
x = F.log_softmax(self.fc5(x), dim=1)
return x
# Create the network, define the criterion and optimizer
model = FashionClassifier()
# move model to GPU if CUDA is available
if use_cuda:
model = model.cuda()
print(model)
模型结构输出:
FashionClassifier(
(fc1): Linear(in_features=784, out_features=512, bias=True)
(fc2): Linear(in_features=512, out_features=256, bias=True)
(fc3): Linear(in_features=256, out_features=128, bias=True)
(fc4): Linear(in_features=128, out_features=64, bias=True)
(fc5): Linear(in_features=64, out_features=10, bias=True)
(dropout): Dropout(p=0.2)
)
第五步:训练网络并保存模型
训练函数使我们能够设置 epoch 值、学习率和其他参数。
定义损失函数和优化器
下面,我们将使用 Adam 优化器和交叉熵损失,因为我们将类别得分作为输出。我们计算损失并执行反向传播。
#define loss function and optimizer
criterion = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
定义训练方法
def train(start_epochs, n_epochs, valid_loss_min_input, loaders, model, optimizer, criterion, use_cuda, checkpoint_path, best_model_path):
"""
Keyword arguments:
start_epochs -- the real part (default 0.0)
n_epochs -- the imaginary part (default 0.0)
valid_loss_min_input
loaders
model
optimizer
criterion
use_cuda
checkpoint_path
best_model_path
returns trained model
"""
# initialize tracker for minimum validation loss
valid_loss_min = valid_loss_min_input
for epoch in range(start_epochs, n_epochs+1):
# initialize variables to monitor training and validation loss
train_loss = 0.0
valid_loss = 0.0
###################
# train the model #
###################
model.train()
for batch_idx, (data, target) in enumerate(loaders['train']):
# move to GPU
if use_cuda:
data, target = data.cuda(), target.cuda()
## find the loss and update the model parameters accordingly
# clear the gradients of all optimized variables
optimizer.zero_grad()
# forward pass: compute predicted outputs by passing inputs to the model
output = model(data)
# calculate the batch loss
loss = criterion(output, target)
# backward pass: compute gradient of the loss with respect to model parameters
loss.backward()
# perform a single optimization step (parameter update)
optimizer.step()
## record the average training loss, using something like
## train_loss = train_loss + ((1 / (batch_idx + 1)) * (loss.data - train_loss))
train_loss = train_loss + ((1 / (batch_idx + 1)) * (loss.data - train_loss))
######################
# validate the model #
######################
model.eval()
for batch_idx, (data, target) in enumerate(loaders['test']):
# move to GPU
if use_cuda:
data, target = data.cuda(), target.cuda()
## update the average validation loss
# forward pass: compute predicted outputs by passing inputs to the model
output = model(data)
# calculate the batch loss
loss = criterion(output, target)
# update average validation loss
valid_loss = valid_loss + ((1 / (batch_idx + 1)) * (loss.data - valid_loss))
# calculate average losses
train_loss = train_loss/len(loaders['train'].dataset)
valid_loss = valid_loss/len(loaders['test'].dataset)
# print training/validation statistics
print('Epoch: {} \tTraining Loss: {:.6f} \tValidation Loss: {:.6f}'.format(
epoch,
train_loss,
valid_loss
))
# create checkpoint variable and add important data
checkpoint = {
'epoch': epoch + 1,
'valid_loss_min': valid_loss,
'state_dict': model.state_dict(),
'optimizer': optimizer.state_dict(),
}
# save checkpoint
save_ckp(checkpoint, False, checkpoint_path, best_model_path)
## TODO: save the model if validation loss has decreased
if valid_loss <= valid_loss_min:
print('Validation loss decreased ({:.6f} --> {:.6f}). Saving model ...'.format(valid_loss_min,valid_loss))
# save checkpoint as best model
save_ckp(checkpoint, True, checkpoint_path, best_model_path)
valid_loss_min = valid_loss
# return trained model
return model
训练模型
trained_model = train(1, 3, np.Inf, loaders, model, optimizer, criterion, use_cuda, "./checkpoint/current_checkpoint.pt", "./best_model/best_model.pt")
输出:
Epoch: 1 Training Loss: 0.000010 Validation Loss: 0.000044
Validation loss decreased (inf --> 0.000044). Saving model ...
Epoch: 2 Training Loss: 0.000007 Validation Loss: 0.000040
Validation loss decreased (0.000044 --> 0.000040). Saving model ...
Epoch: 3 Training Loss: 0.000007 Validation Loss: 0.000040
Validation loss decreased (0.000040 --> 0.000040). Saving model ...
让我们关注一下我们上面使用的几个参数:
start_epoch:训练 epoch 的起始值
n_epochs:用于设置训练的 epoch 的结束值
valid_loss_min_input = np.Inf
checkpoint_path:保存训练的最新 ckpt 状态的完整路径
best_model_path:保存训练的最佳 ckpt 状态的完整路径
验证是否保存了模型
列出 best_model 目录中的所有文件
%ls ./best_model/
输出:
best_model.pt
列出 ckpt 目录中的所有文件
%ls ./checkpoint/
输出:
current_checkpoint.pt
第六步:加载模型
重构模型
model = FashionClassifier()
# move model to GPU if CUDA is available
if use_cuda:
model = model.cuda()
print(model)
输出:
FashionClassifier(
(fc1): Linear(in_features=784, out_features=512, bias=True)
(fc2): Linear(in_features=512, out_features=256, bias=True)
(fc3): Linear(in_features=256, out_features=128, bias=True)
(fc4): Linear(in_features=128, out_features=64, bias=True)
(fc5): Linear(in_features=64, out_features=10, bias=True)
(dropout): Dropout(p=0.2)
)
定义优化器和检查点文件路径
# define optimzer
optimizer = optim.Adam(model.parameters(), lr=0.001)
# define checkpoint saved path
ckp_path = "./checkpoint/current_checkpoint.pt"
使用 load_ckp 函数加载模型
# load the saved checkpoint
model, optimizer, start_epoch, valid_loss_min = load_ckp(ckp_path, model, optimizer)
我打印出了从 load_ckp 得到的值,以确保一切正确。
print("model = ", model)
print("optimizer = ", optimizer)
print("start_epoch = ", start_epoch)
print("valid_loss_min = ", valid_loss_min)
print("valid_loss_min = {:.6f}".format(valid_loss_min))
输出:
model = FashionClassifier(
(fc1): Linear(in_features=784, out_features=512, bias=True)
(fc2): Linear(in_features=512, out_features=256, bias=True)
(fc3): Linear(in_features=256, out_features=128, bias=True)
(fc4): Linear(in_features=128, out_features=64, bias=True)
(fc5): Linear(in_features=64, out_features=10, bias=True)
(dropout): Dropout(p=0.2)
)
optimizer = Adam (
Parameter Group 0
amsgrad: False
betas: (0.9, 0.999)
eps: 1e-08
lr: 0.001
weight_decay: 0
)
start_epoch = 4
valid_loss_min = 3.952759288949892e-05
valid_loss_min = 0.000040
加载所有需要的信息之后,我们也可以继续训练,从 epoch = 4开始。之前,我们把模型从1训练到3。
第七步:继续训练和/或推理
继续训练
我们可以继续使用训练函数来训练我们的模型,并提供我们从上面的 load_ckp 函数得到的 ckpt 值。
trained_model = train(start_epoch, 6, valid_loss_min, loaders, model, optimizer, criterion, use_cuda, "./checkpoint/current_checkpoint.pt", "./best_model/best_model.pt")
输出:
Epoch: 4 Training Loss: 0.000006 Validation Loss: 0.000040
Epoch: 5 Training Loss: 0.000006 Validation Loss: 0.000037
Validation loss decreased (0.000040 --> 0.000037). Saving model ...
Epoch: 6 Training Loss: 0.000006 Validation Loss: 0.000036
Validation loss decreased (0.000037 --> 0.000036). Saving model ...
注意:epoch 现在从4开始到6结束 (start _ epoch = 4)
验证损失从上一个训练 ckpt 继续
在epoch = 3时,最小验证损失是0.000040
在这里,最小验证损失以0.000040开始,而不是 INF
模型推理
在运行推理之前,必须调用 model.eval()将 dropout 和 batch、 normalization 层设置为 evaluation 模式。不这样做将导致不一致的推论结果。
trained_model.eval()
test_acc = 0.0
for samples, labels in loaders['test']:
with torch.no_grad():
samples, labels = samples.cuda(), labels.cuda()
output = trained_model(samples)
# calculate accuracy
pred = torch.argmax(output, dim=1)
correct = pred.eq(labels)
test_acc += torch.mean(correct.float())
print('Accuracy of the network on {} test images: {}%'.format(len(testset), round(test_acc.item()*100.0/len(loaders['test']), 2)))
输出:
Accuracy of the network on 10000 test images: 86.58%
在哪里可以找到 Kaggle 笔记本的输出/保存文件:
在你的 Kaggle 笔记本中,你可以向下滚动到页面的底部。前面的操作中保存了一些文件。
完整代码链接:https://www.kaggle.com/vortanasay/saving-loading-and-cont-training-model-in-pytorch
· END ·
HAPPY LIFE