使用 PyTorch 保存和加载模型 | 附完整代码

最新推荐文章于 2024-09-17 11:05:38 发布

小北的北

最新推荐文章于 2024-09-17 11:05:38 发布

阅读量3.1k

点赞数 6

文章标签： python 深度学习 tensorflow java 人工智能

本文链接：https://blog.csdn.net/weixin_38739735/article/details/114317581

版权

欢迎关注 “小白玩转Python”，发现更多 “有趣”

本文的目的是展示如何保存一个模型并加载它，以便在上一个 epoch 之后继续训练并进行预测。如果您正在阅读本文，我假定您熟悉深度学习和 PyTorch 的基本知识。

你是否经历过这样的情况：你花了几个小时或几天的时间来训练你的模型，然后它在中途停止了？或者你对自己的模型表现不满意，想继续训练？出于多种原因，我们可能需要一种灵活的方式来保存和加载我们的模型。

现在有很多免费的云服务，如 Kaggle、 Google Colab 等都有空闲超时功能，这会导致你的笔记本电脑断开连接，而且一旦超时，笔记本电脑就会被断开或中断。除非你用 GPU 训练一小段 epoch，否则这个过程需要时间。能够保存模型会给你带来巨大的优势，从而挽救局面。为了灵活起见，我将同时保存最新的 ckpt 和最好的 ckpt。

本文中的数据集使用比较常用的 Fashion_MNIST_data，我们将从导入数据中编写一个完整的流程来进行预测。（本文将使用 Kaggle 进行训练）

第一步：准备

在 Kaggle 默认情况下，您正在处理的文件被称为__notebook__.ipyn

# uncomment if you want to create directory checkpoint, best_model
%mkdir checkpoint best_model

第二步：导入相关库并创建辅助函数

导入库

%matplotlib inline
%config InlineBackend.figure_format = 'retina'


import matplotlib.pyplot as plt
import torch
import shutil
from torch import nn
from torch import optim
import torch.nn.functional as F
from torchvision import datasets, transforms
import numpy as np

# check if CUDA is available
use_cuda = torch.cuda.is_available()

保存功能

save_ckp 是为了保存 ckpt 文件而创建的，它是最新的也是最好的。这创建了灵活性：您可能对最新 ckpt 的状态感兴趣，也可能对最好的 ckpt 感兴趣。

def save_ckp(state, is_best, checkpoint_path, best_model_path):
    """
    state: checkpoint we want to save
    is_best: is this the best checkpoint; min validation loss
    checkpoint_path: path to save checkpoint
    best_model_path: path to save best model
    """
    f_path = checkpoint_path
    # save checkpoint data to the path given, checkpoint_path
    torch.save(state, f_path)
    # if it is a best model, min validation loss
    if is_best:
        best_fpath = best_model_path
        # copy that checkpoint file to best path given, best_model_path
        shutil.copyfile(f_path, best_fpath)

在我们的例子中，我们希望保存一个 ckpt，允许我们使用这些信息来继续我们的模型训练。以下是我们需要的信息：

epoch：所有训练向量用于更新权重的次数
valid_loss_min：最小的验证损失，这是必需的，以便在我们继续训练时，可以从此值开始，而不是从np.Inf值开始。
state_dict：模型架构信息。它包括每个图层的参数矩阵。
optimizer：需要保存优化器参数，特别是在使用 Adam 作为优化器时。Adam 是一个在线机机器学习率方法，也就是说，它为不同的参数计算个人的学习率，如果我们想继续我们的训练，我们就需要这些参数。

加载函数

def load_ckp(checkpoint_fpath, model, optimizer):
    """
    checkpoint_path: path to save checkpoint
    model: model that we want to load checkpoint parameters into       
    optimizer: optimizer we defined in previous training
    """
    # load check point
    checkpoint = torch.load(checkpoint_fpath)
    # initialize state_dict from checkpoint to model
    model.load_state_dict(checkpoint['state_dict'])
    # initialize optimizer from checkpoint to optimizer
    optimizer.load_state_dict(checkpoint['optimizer'])
    # initialize valid_loss_min from checkpoint to valid_loss_min
    valid_loss_min = checkpoint['valid_loss_min']
    # return model, optimizer, epoch value, min validation loss 
    return model, optimizer, checkpoint['epoch'], valid_loss_min.item()

为加载模型创建 load_chkp。它需要：

被保存的 ckpt 的位置
要将状态加载到的模型实例
优化器

第三步：导入数据集 Fashion _MNIST_ data 并创建数据加载器

# Define a transform to normalize the data
transform = transforms.Compose([transforms.ToTensor(),
                                transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
# Download and load the training data
trainset = datasets.FashionMNIST('F_MNIST_data/', download=True, train=True, transform=transform)


# Download and load the test data
testset = datasets.FashionMNIST('F_MNIST_data/', download=True, train=False, transform=transform)


loaders = {
    'train' : torch.utils.data.DataLoader(trainset,batch_size = 64,shuffle=True),
    'test'  : torch.utils.data.DataLoader(testset,batch_size = 64,shuffle=True),
}

第四步：定义和创建模型

# Define your network ( Simple Example )
class FashionClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        input_size = 784
        self.fc1 = nn.Linear(input_size, 512)
        self.fc2 = nn.Linear(512, 256)
        self.fc3 = nn.Linear(256, 128)
        self.fc4 = nn.Linear(128, 64)
        self.fc5 = nn.Linear(64,10)
        self.dropout = nn.Dropout(p=0.2)
        
    def forward(self, x):
        x = x.view(x.shape[0], -1)
        x = self.dropout(F.relu(self.fc1(x)))
        x = self.dropout(F.relu(self.fc2(x)))
        x = self.dropout(F.relu(self.fc3(x)))
        x = self.dropout(F.relu(self.fc4(x)))
        x = F.log_softmax(self.fc5(x), dim=1)
        return x

# Create the network, define the criterion and optimizer
model = FashionClassifier()


# move model to GPU if CUDA is available
if use_cuda:
    model = model.cuda()
    
print(model)

模型结构输出：

FashionClassifier(
  (fc1): Linear(in_features=784, out_features=512, bias=True)
  (fc2): Linear(in_features=512, out_features=256, bias=True)
  (fc3): Linear(in_features=256, out_features=128, bias=True)
  (fc4): Linear(in_features=128, out_features=64, bias=True)
  (fc5): Linear(in_features=64, out_features=10, bias=True)
  (dropout): Dropout(p=0.2)
)

第五步：训练网络并保存模型

训练函数使我们能够设置 epoch 值、学习率和其他参数。

定义损失函数和优化器

下面，我们将使用 Adam 优化器和交叉熵损失，因为我们将类别得分作为输出。我们计算损失并执行反向传播。

#define loss function and optimizer
criterion = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

定义训练方法

def train(start_epochs, n_epochs, valid_loss_min_input, loaders, model, optimizer, criterion, use_cuda, checkpoint_path, best_model_path):
    """
    Keyword arguments:
    start_epochs -- the real part (default 0.0)
    n_epochs -- the imaginary part (default 0.0)
    valid_loss_min_input
    loaders
    model
    optimizer
    criterion
    use_cuda
    checkpoint_path
    best_model_path
    
    returns trained model
    """
    # initialize tracker for minimum validation loss
    valid_loss_min = valid_loss_min_input 
    
    for epoch in range(start_epochs, n_epochs+1):
        # initialize variables to monitor training and validation loss
        train_loss = 0.0
        valid_loss = 0.0
        
        ###################
        # train the model #
        ###################
        model.train()
        for batch_idx, (data, target) in enumerate(loaders['train']):
            # move to GPU
            if use_cuda:
                data, target = data.cuda(), target.cuda()
            ## find the loss and update the model parameters accordingly
            # clear the gradients of all optimized variables
            optimizer.zero_grad()
            # forward pass: compute predicted outputs by passing inputs to the model
            output = model(data)
            # calculate the batch loss
            loss = criterion(output, target)
            # backward pass: compute gradient of the loss with respect to model parameters
            loss.backward()
            # perform a single optimization step (parameter update)
            optimizer.step()
            ## record the average training loss, using something like
            ## train_loss = train_loss + ((1 / (batch_idx + 1)) * (loss.data - train_loss))
            train_loss = train_loss + ((1 / (batch_idx + 1)) * (loss.data - train_loss))
        
        ######################    
        # validate the model #
        ######################
        model.eval()
        for batch_idx, (data, target) in enumerate(loaders['test']):
            # move to GPU
            if use_cuda:
                data, target = data.cuda(), target.cuda()
            ## update the average validation loss
            # forward pass: compute predicted outputs by passing inputs to the model
            output = model(data)
            # calculate the batch loss
            loss = criterion(output, target)
            # update average validation loss 
            valid_loss = valid_loss + ((1 / (batch_idx + 1)) * (loss.data - valid_loss))
            
        # calculate average losses
        train_loss = train_loss/len(loaders['train'].dataset)
        valid_loss = valid_loss/len(loaders['test'].dataset)


        # print training/validation statistics 
        print('Epoch: {} \tTraining Loss: {:.6f} \tValidation Loss: {:.6f}'.format(
            epoch, 
            train_loss,
            valid_loss
            ))
        
        # create checkpoint variable and add important data
        checkpoint = {
            'epoch': epoch + 1,
            'valid_loss_min': valid_loss,
            'state_dict': model.state_dict(),
            'optimizer': optimizer.state_dict(),
        }
        
        # save checkpoint
        save_ckp(checkpoint, False, checkpoint_path, best_model_path)
        
        ## TODO: save the model if validation loss has decreased
        if valid_loss <= valid_loss_min:
            print('Validation loss decreased ({:.6f} --> {:.6f}).  Saving model ...'.format(valid_loss_min,valid_loss))
            # save checkpoint as best model
            save_ckp(checkpoint, True, checkpoint_path, best_model_path)
            valid_loss_min = valid_loss
            
    # return trained model
    return model

训练模型

trained_model = train(1, 3, np.Inf, loaders, model, optimizer, criterion, use_cuda, "./checkpoint/current_checkpoint.pt", "./best_model/best_model.pt")

输出：

Epoch: 1  Training Loss: 0.000010  Validation Loss: 0.000044
Validation loss decreased (inf --> 0.000044).  Saving model ...
Epoch: 2  Training Loss: 0.000007  Validation Loss: 0.000040
Validation loss decreased (0.000044 --> 0.000040).  Saving model ...
Epoch: 3  Training Loss: 0.000007  Validation Loss: 0.000040
Validation loss decreased (0.000040 --> 0.000040).  Saving model ...

让我们关注一下我们上面使用的几个参数：

start_epoch：训练 epoch 的起始值
n_epochs：用于设置训练的 epoch 的结束值
valid_loss_min_input = np.Inf
checkpoint_path：保存训练的最新 ckpt 状态的完整路径
best_model_path：保存训练的最佳 ckpt 状态的完整路径

验证是否保存了模型

列出 best_model 目录中的所有文件

%ls ./best_model/

输出：

best_model.pt

%ls ./checkpoint/

输出：

current_checkpoint.pt

第六步：加载模型

重构模型

model = FashionClassifier()


# move model to GPU if CUDA is available
if use_cuda:
    model = model.cuda()
    
print(model)

输出：

FashionClassifier(
  (fc1): Linear(in_features=784, out_features=512, bias=True)
  (fc2): Linear(in_features=512, out_features=256, bias=True)
  (fc3): Linear(in_features=256, out_features=128, bias=True)
  (fc4): Linear(in_features=128, out_features=64, bias=True)
  (fc5): Linear(in_features=64, out_features=10, bias=True)
  (dropout): Dropout(p=0.2)
)

定义优化器和检查点文件路径

# define optimzer
optimizer = optim.Adam(model.parameters(), lr=0.001)


# define checkpoint saved path
ckp_path = "./checkpoint/current_checkpoint.pt"

使用 load_ckp 函数加载模型

# load the saved checkpoint
model, optimizer, start_epoch, valid_loss_min = load_ckp(ckp_path, model, optimizer)

我打印出了从 load_ckp 得到的值，以确保一切正确。

print("model = ", model)
print("optimizer = ", optimizer)
print("start_epoch = ", start_epoch)
print("valid_loss_min = ", valid_loss_min)
print("valid_loss_min = {:.6f}".format(valid_loss_min))

输出：

model =  FashionClassifier(
  (fc1): Linear(in_features=784, out_features=512, bias=True)
  (fc2): Linear(in_features=512, out_features=256, bias=True)
  (fc3): Linear(in_features=256, out_features=128, bias=True)
  (fc4): Linear(in_features=128, out_features=64, bias=True)
  (fc5): Linear(in_features=64, out_features=10, bias=True)
  (dropout): Dropout(p=0.2)
)
optimizer =  Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    eps: 1e-08
    lr: 0.001
    weight_decay: 0
)
start_epoch =  4
valid_loss_min =  3.952759288949892e-05
valid_loss_min = 0.000040

加载所有需要的信息之后，我们也可以继续训练，从 epoch = 4开始。之前，我们把模型从1训练到3。

第七步：继续训练和/或推理

继续训练

我们可以继续使用训练函数来训练我们的模型，并提供我们从上面的 load_ckp 函数得到的 ckpt 值。

trained_model = train(start_epoch, 6, valid_loss_min, loaders, model, optimizer, criterion, use_cuda, "./checkpoint/current_checkpoint.pt", "./best_model/best_model.pt")

输出：

Epoch: 4   Training Loss: 0.000006   Validation Loss: 0.000040
Epoch: 5   Training Loss: 0.000006   Validation Loss: 0.000037
Validation loss decreased (0.000040 --> 0.000037).  Saving model ...
Epoch: 6   Training Loss: 0.000006   Validation Loss: 0.000036
Validation loss decreased (0.000037 --> 0.000036).  Saving model ...

注意：epoch 现在从4开始到6结束 (start _ epoch = 4)
验证损失从上一个训练 ckpt 继续
在epoch = 3时，最小验证损失是0.000040
在这里，最小验证损失以0.000040开始，而不是 INF

模型推理

在运行推理之前，必须调用 model.eval()将 dropout 和 batch、 normalization 层设置为 evaluation 模式。不这样做将导致不一致的推论结果。

trained_model.eval()

test_acc = 0.0
for samples, labels in loaders['test']:
    with torch.no_grad():
        samples, labels = samples.cuda(), labels.cuda()
        output = trained_model(samples)
        # calculate accuracy
        pred = torch.argmax(output, dim=1)
        correct = pred.eq(labels)
        test_acc += torch.mean(correct.float())
print('Accuracy of the network on {} test images: {}%'.format(len(testset), round(test_acc.item()*100.0/len(loaders['test']), 2)))

输出：

Accuracy of the network on 10000 test images: 86.58%

在哪里可以找到 Kaggle 笔记本的输出/保存文件：

在你的 Kaggle 笔记本中，你可以向下滚动到页面的底部。前面的操作中保存了一些文件。

完整代码链接：https://www.kaggle.com/vortanasay/saving-loading-and-cont-training-model-in-pytorch

·  END  ·


HAPPY LIFE

小北的北

关注

6
点赞
踩
30

收藏

觉得还不错? 一键收藏
2
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫