torch 断电恢复训练

@勿用
已于 2024-01-16 10:33:12 修改
阅读量253
点赞数
分类专栏：机器学习-Bert 文章标签：人工智能
于 2023-10-11 17:07:34 首次发布
本文链接：https://blog.csdn.net/tm_tsm/article/details/133774574
版权
机器学习-Bert 专栏收录该内容
2 篇文章
订阅专栏
ps：機器A經由tokenizer生成的訓練集、驗證集，放到B機器上繼續訓練，F1值等和A機器訓練沒斷的情況下不一樣，故此斷電恢復訓練仍有待測試
1、示例：pytorch断点续训/恢复训练，以及绘制学习率曲线 - 知乎
2、chatgpt3.5问答：
torch有累计梯度的模型，中断后怎么恢复训练？
ChatGPT
如果你正在使用PyTorch训练模型，并且使用了累计梯度 (gradient accumulation) 来处理大型数据集或增大batch size，当训练被中断后，你可能希望从上次中断的地方恢复训练。下面是一些步骤，帮助你实现这个目标：

保存模型和优化器状态: 在训练过程中，定期保存模型的权重和优化器的状态。这不仅包括模型的参数，还应该包括优化器的状态，因为优化器（如Adam）有自己的内部参数（如动量）。
python
Copy code
torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'loss': loss,
            ...
            }, 'checkpoint.pth')
加载模型和优化器状态: 如果训练中断，你可以从最后一个checkpoint加载模型和优化器的状态。
python
Copy code
checkpoint = torch.load('checkpoint.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']
恢复累计梯度: 如果你正在进行累计梯度，那么除了上述内容之外，你还需要跟踪当前的累计次数，并在加载checkpoint后恢复这个计数。
为了恢复累计梯度，你需要:

在保存checkpoint时，存储当前的累计次数。
加载checkpoint时，恢复这个累计次数。
python
Copy code
# 保存时
torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'loss': loss,
            'accumulation_step': accumulation_step,
            ...
            }, 'checkpoint.pth')

# 加载时
checkpoint = torch.load('checkpoint.pth')
...
accumulation_step = checkpoint['accumulation_step']
继续训练: 一旦你加载了模型、优化器的状态和累计步骤，你可以继续训练。
python
Copy code
for epoch in range(starting_epoch, num_epochs):
    ...
    for batch in dataloader:
        ...
        loss.backward()
        if (batch_idx + 1) % accumulation_steps == 0:  # 累计一定步骤后进行一次参数更新
            optimizer.step()
            optimizer.zero_grad()
这样，即使在使用累计梯度的情况下，你的训练中断，你也可以从中断的地方恢复训练，而不是从头开始。




User
有详细的代码示例吗？
ChatGPT
当然可以。以下是一个详细的代码示例，演示了如何在使用累积梯度的情况下保存和恢复训练。

python
Copy code
import torch
import torch.nn as nn
import torch.optim as optim

# 定义一个简单的模型
class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc = nn.Linear(10, 1)

    def forward(self, x):
        return self.fc(x)

# 初始化模型和优化器
model = SimpleModel()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# 定义累积梯度的步骤数
accumulation_steps = 4

# 保存训练状态的函数
def save_checkpoint(epoch, model, optimizer, accumulation_step, filename='checkpoint.pth'):
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'accumulation_step': accumulation_step
    }, filename)

# 加载训练状态的函数
def load_checkpoint(filename='checkpoint.pth'):
    checkpoint = torch.load(filename)
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    return checkpoint['epoch'], checkpoint['accumulation_step']

# 假设数据集
data = torch.randn(100, 10)
labels = torch.randn(100, 1)

# 开始训练
starting_epoch = 0
accumulation_step = 0
num_epochs = 5

# 如果需要从checkpoint恢复
# starting_epoch, accumulation_step = load_checkpoint()

for epoch in range(starting_epoch, num_epochs):
    for batch_idx, (input_data, target) in enumerate(zip(data, labels)):
        outputs = model(input_data)
        loss = nn.MSELoss()(outputs, target)
        loss.backward()

        if (accumulation_step + 1) % accumulation_steps == 0:
            optimizer.step()
            optimizer.zero_grad()
            accumulation_step = 0
        else:
            accumulation_step += 1

        # 假设每20个batch保存一次checkpoint
        if batch_idx % 20 == 0:
            save_checkpoint(epoch, model, optimizer, accumulation_step)

# 通过上面的代码，即使中断训练，也可以通过加载checkpoint从上次的位置恢复训练。
上述示例中定义了一个简单的模型和优化器，使用MSE作为损失函数，并模拟了一个简单的数据集。在训练过程中，每经过一个累积步骤，会更新模型的参数。同时，代码每20个batch会保存一个checkpoint。如果训练被中断，只需加载checkpoint然后继续训练即可。




User
那如果还使用了lr_scheduler.step()呢？
ChatGPT
如果你使用了学习率调度器 (lr_scheduler)，那么它的状态也需要在每个checkpoint中保存和恢复。学习率调度器有它自己的内部状态，这通常与训练的epoch或步骤相关联，所以当你恢复训练时，也需要确保从正确的位置恢复学习率调度。

以下是如何修改前面的示例来包括lr_scheduler的步骤：

初始化lr_scheduler：
python
Copy code
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)
在保存checkpoint时保存lr_scheduler的状态：
python
Copy code
def save_checkpoint(epoch, model, optimizer, accumulation_step, scheduler, filename='checkpoint.pth'):
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'scheduler_state_dict': scheduler.state_dict(),
        'accumulation_step': accumulation_step
    }, filename)
在加载checkpoint时恢复lr_scheduler的状态：
python
Copy code
def load_checkpoint(filename='checkpoint.pth'):
    checkpoint = torch.load(filename)
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
    return checkpoint['epoch'], checkpoint['accumulation_step']
在训练过程中使用lr_scheduler：
python
Copy code
for epoch in range(starting_epoch, num_epochs):
    for batch_idx, (input_data, target) in enumerate(zip(data, labels)):
        outputs = model(input_data)
        loss = nn.MSELoss()(outputs, target)
        loss.backward()

        if (accumulation_step + 1) % accumulation_steps == 0:
            optimizer.step()
            optimizer.zero_grad()
            accumulation_step = 0
        else:
            accumulation_step += 1

        # 假设每20个batch保存一次checkpoint
        if batch_idx % 20 == 0:
            save_checkpoint(epoch, model, optimizer, accumulation_step, scheduler)

    # 调用scheduler.step()来更新学习率
    scheduler.step()
注意：确保在所有epoch结束后调用scheduler.step()，除非你使用的是ReduceLROnPlateau，那么可能需要在每次loss不再下降时调用它。

通过这种方式，即使中断训练，你也可以从上次的位置恢复训练，并确保学习率调度器的状态与中断时相匹配。




User
scheduler.step()跟步骤关联的怎么恢复？
ChatGPT
如果scheduler.step()与每一步（或batch）而不是每一轮（epoch）相关联，那么恢复过程会稍微复杂一些。你需要在保存和加载checkpoint时，额外记录已完成的总步骤数，并在恢复后，确保scheduler正确地从上次中断的步骤继续。

初始化lr_scheduler：
python
Copy code
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)
定义一个变量来跟踪总的步骤数，并在保存checkpoint时保存这个数值：
python
Copy code
total_steps = 0

def save_checkpoint(epoch, model, optimizer, accumulation_step, scheduler, total_steps, filename='checkpoint.pth'):
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'scheduler_state_dict': scheduler.state_dict(),
        'accumulation_step': accumulation_step,
        'total_steps': total_steps
    }, filename)
在加载checkpoint时恢复lr_scheduler的状态和总步骤数：
python
Copy code
def load_checkpoint(filename='checkpoint.pth'):
    checkpoint = torch.load(filename)
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
    return checkpoint['epoch'], checkpoint['accumulation_step'], checkpoint['total_steps']
在训练循环中，每完成一个batch时更新总步骤数和调用scheduler.step()：
python
Copy code
# 加载checkpoint
starting_epoch, accumulation_step, total_steps = load_checkpoint()

for epoch in range(starting_epoch, num_epochs):
    for batch_idx, (input_data, target) in enumerate(zip(data, labels)):
        outputs = model(input_data)
        loss = nn.MSELoss()(outputs, target)
        loss.backward()

        if (accumulation_step + 1) % accumulation_steps == 0:
            optimizer.step()
            optimizer.zero_grad()
            
            # 更新总步骤数并调用scheduler
            total_steps += 1
            scheduler.step()
            
            accumulation_step = 0
        else:
            accumulation_step += 1

        # 假设每20个batch保存一次checkpoint
        if batch_idx % 20 == 0:
            save_checkpoint(epoch, model, optimizer, accumulation_step, scheduler, total_steps)
通过上述方式，无论你的scheduler.step()与epochs还是steps相关联，你都可以确保在训练中断后，从中断的地方恢复，并且学习率调度器的状态与中断时一致。




User
上面total_steps好像没什么用？
ChatGPT
抱歉，你是对的。在上面的代码中，我在save_checkpoint中存储了total_steps，但在加载和恢复时并没有真正利用它来确保scheduler与具体的训练步骤同步。

如果scheduler根据步骤更新（例如每N个步骤调整一次学习率），则确实需要知道从哪个步骤开始以继续调度。

以下是如何更有效地使用total_steps来确保在加载checkpoint后正确地恢复scheduler：

在开始训练之前初始化total_steps：
python
Copy code
total_steps = 0
在每个训练步骤中递增total_steps并根据需要调用scheduler.step()：
python
Copy code
for epoch in range(starting_epoch, num_epochs):
    for batch_idx, (input_data, target) in enumerate(zip(data, labels)):
        outputs = model(input_data)
        loss = nn.MSELoss()(outputs, target)
        loss.backward()

        if (accumulation_step + 1) % accumulation_steps == 0:
            optimizer.step()
            optimizer.zero_grad()
            
            # 更新总步骤数
            total_steps += 1
            
            # 根据总步骤数决定是否调用scheduler.step()
            if total_steps % N == 0:  # 例如每N个步骤调整一次学习率
                scheduler.step()
            
            accumulation_step = 0
        else:
            accumulation_step += 1

        # 假设每20个batch保存一次checkpoint
        if batch_idx % 20 == 0:
            save_checkpoint(epoch, model, optimizer, accumulation_step, scheduler, total_steps)
当从checkpoint加载时，你可以从上一次保存的total_steps值恢复：
python
Copy code
def load_checkpoint(filename='checkpoint.pth'):
    checkpoint = torch.load(filename)
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
    return checkpoint['epoch'], checkpoint['accumulation_step'], checkpoint['total_steps']

starting_epoch, accumulation_step, total_steps = load_checkpoint()
这样，total_steps就会记录已完成的总步骤数，并确保在从checkpoint恢复后，scheduler可以正确地继续工作。




User
使用上述方法后，loss和中断之前相差很大
ChatGPT
如果从checkpoint恢复后，你发现loss与中断前有显著差异，可能是由于以下原因：

随机性：在深度学习中，许多操作都涉及到随机性。例如，随机初始化、数据增强、Dropout等。如果你的数据加载过程有任何随机性（例如随机数据增强或shuffling），那么每次重新启动训练时，你的模型都可能会得到不同的数据样本。

Batch Normalization：如果模型中使用了Batch Normalization，那么中断和恢复可能会导致统计数据的不连续性，从而影响loss。这是因为Batch Normalization层在训练时会维护一个运行的平均值和方差。

加载checkpoint的问题：确保正确加载了模型状态、优化器状态和学习率调度器状态。任何遗漏都可能导致loss的变化。

累积梯度：如果使用累积梯度，确保正确地恢复了累积状态。

要解决这个问题，你可以采取以下步骤：

固定随机种子：在训练开始前固定所有的随机种子。这包括PyTorch、NumPy和Python的随机种子。

python
Copy code
torch.manual_seed(0)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(0)
random.seed(0)
更频繁地保存checkpoint：这样，即使训练中断，你也不会丢失太多的进度。

Double-check checkpoint保存和加载：确保模型、优化器和scheduler的状态都被正确保存和加载。

考虑累积状态：确保累积梯度的状态正确保存和恢复。

确保其他状态保持一致：例如，如果你使用了特定的数据拆分或特定顺序的数据加载，确保这在重新启动训练时保持不变。

Batch Normalization：如果使用Batch Normalization，并且认为这可能是问题所在，考虑保存更多关于BatchNorm层的信息，例如运行平均值和方差。

如果采取了上述措施仍然遇到问题，考虑简化问题：使用一个小数据集和一个简单的模型进行测试，看看是否能够成功地保存和恢复训练，然后逐步增加复杂性以确定问题所在。




User
我用bert做ner，有用到Batch Normalization吗？
ChatGPT
BERT（Bidirectional Encoder Representations from Transformers）本身在其原始架构中并没有使用Batch Normalization。BERT主要由多层Transformer组成，这些Transformer使用了Layer Normalization而不是Batch Normalization。

Layer Normalization与Batch Normalization的主要区别在于正则化的维度：

Batch Normalization对一个batch中所有样本的同一个特征进行正则化。
Layer Normalization对单个样本的所有特征进行正则化。
当你使用BERT进行NER（命名实体识别）时，如果你没有向模型中明确地添加Batch Normalization层，那么BERT模型应该是不包含Batch Normalization的。大部分预训练的BERT模型和相关的实现（例如HuggingFace的Transformers库）按照原始BERT论文的架构，使用Layer Normalization而不是Batch Normalization。

因此，如果你的模型在加载checkpoint后loss发生了显著变化，可能与Batch Normalization无关。你可能需要检查其他因素，例如正确的模型初始化、正确的权重加载、优化器状态等。