1. apex 使用出问题:
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:569)
解决办法:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)
torch.cuda.set_device(device) # 添加这一行
2.
Incoming model is an instance of torch.nn.parallel.DataParallel. Parallel wrappers should only be applied to the model(s) AFTER the model(s) have been returned from amp.initialize.
解决办法:在amp.initialize()之后调用分布式代码 DistributedDataParallel
model, optimizer = amp.initialize(model_quest_bert_LSTM.to(device), optimizer, opt_level="O1") # O1 O2 # 欧一, 不是 零一
if torch.cuda.device_count() > 1: # 使用多GPU
print("Let's use", torch.cuda.device_count(), "GPUs!")
# model = nn.DataParallel(model_quest_bert_LSTM)
# model_quest_bert_LSTM = model_quest_bert_LSTM.cuda(device) # cuda(device)
model = nn.parallel.DistributedDataParallel(model, find_unused_parameters=True) # 分布是训练.
3. 学习使用apex
PyTorch必备神器 | 唯快不破:基于Apex的混合精度加速
一文详解Apex的安装和使用教程(一款基于 PyTorch 的混合精度训练加速神器)
4. amp.initialize只能调用一次
model, optim = amp.initialize(model, optim,...)
model, [optim0, optim1] = amp.initialize(model, [optim0, optim1],...)
[model0, model1], optim = amp.initialize([model0, model1], optim,...)
[model0, model1], [optim0, optim1] = amp.initialize([model0, model1], [optim0, optim1],...)
5. LSTM使用问题
RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
在调用LSTM之前,使用:
self.lstm.flatten_parameters() # ?
output, _ = self.lstm(sequence_output)