kohya_ss用accelerate的deepspeed,在双卡上尝试作分布式训练,结果报错如下:
Traceback (most recent call last):
File "/root/codes/kohya_ss/train_db.py", line 369, in <module>
train(args)
File "/root/codes/kohya_ss/train_db.py", line 167, in train
unet, text_encoder, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
File "/root/anaconda3/envs/py310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1090, in prepare
result = self._prepare_deepspeed(*args)
File "/root/anaconda3/envs/py310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1368, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/root/anaconda3/envs/py310/lib/python3.10/site-packages/deepspeed/__init__.py", line 125, in initialize
engine = DeepSpeedEngine(args=args,
File "/root/anaconda3/envs/py310/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 336, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/root/anaconda3/envs/py310/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1284, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/root/anaconda3/envs/py310/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1533, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/root/anaconda3/envs/py310/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 527, in __init__
self._param_slice_mappings = self._create_param_mapping()
File "/root/anaconda3/envs/py310/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 539, in _create_param_mapping
lp_name = self.param_names[lp]
KeyError: Parameter containing:
tensor([[ 0.0319, 0.0228, 0.0857, ..., 0.0231, 0.0391, -0.0423],
[ 0.0115, -0.0042, -0.0543, ..., 0.0235, -0.0141, 0.0138],
[-0.0026, -0.0279, -0.0614, ..., -0.0201, -0.0490, -0.0111],
...,
[-0.0074, -0.0169, -0.0187, ..., 0.0189, -0.0353, -0.0099],
[ 0.0367, -0.0019, -0.0233, ..., -0.0576, -0.0236, 0.0201],
[ 0.0363, -0.0916, 0.0092, ..., 0.0130, 0.0254, -0.0237]],
device='cuda:1', requires_grad=True)
查到原文代码stage_1_and_2.html,错误栈self.param_names[lp]
所在的函数如下:
def _create_param_mapping(self):
param_mapping = []
for i, _ in enumerate(self.optimizer.param_groups):
param_mapping_per_group = OrderedDict()
for lp in self.bit16_groups[i]:
if lp._hp_mapping is not None:
lp_name = self.param_names[lp]
param_mapping_per_group[
lp_name] = lp._hp_mapping.get_hp_fragment_address()
param_mapping.append(param_mapping_per_group)
return param_mapping
经过调试,我们发现:
self.bit16_groups
是一个list,长度为1,self.bit16_groups[0]
是一个list,包含多个Tensorself.param_names
是一个dict,key为tensor,value为张量的名字。
前者的长度大概为800+,后者的长度为100+,两者长度不相等,所以在遍历找key时,有很多tensor是在后者中查询不到的。
在多次调试后,尝试在train_db.py
的恰当位置,输出unet和text_encoder的长度,法线两者的参数数量分别是600+和100+,两者加起来刚好是800+。所以我们有理由去怀疑,在调用过程中,self.bit16_groups
包含了unet和text_encoder的所有tensor,而self.param_names
只包含了text_encoder的所有tensor。
print("len trainable_params: unet and text_encoder", len(list(unet.parameters())), len(list(text_encoder.parameters())))
我们看到train_db.py
在调用accelerate时,会将所有的模型、优化器等等直接丢进去:
unet, text_encoder, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
unet, text_encoder, optimizer, train_dataloader, lr_scheduler)
prepare
函数则是在if-else判断为deepspeed框架后,直接调用_prepare_deepspeed
,并把参数也跟着丢进去:
if self.distributed_type == DistributedType.DEEPSPEED:
result = self._prepare_deepspeed(*args)
而_prepare_deepspeed
函数内容如下,我自行加了两行代码用于调试。result里包含了开始时传进来的一箩筐参数。
# 调试时添加的代码
result_strs = [arg.__class__.__name__ for arg in result]
print("_prepare_deepspeed result", result_strs)
model = None
optimizer = None
scheduler = None
for obj in result:
if isinstance(obj, torch.nn.Module):
model = obj
elif isinstance(obj, (torch.optim.Optimizer, DummyOptim)):
optimizer = obj
elif (isinstance(obj, (LRScheduler, DummyScheduler))) or (
type(obj).__name__ in deepspeed.runtime.lr_schedules.VALID_LR_SCHEDULES
):
scheduler = obj
...
kwargs = dict(model=model, config_params=self.deepspeed_config)
...
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
看到这里就知道了,传进来的torch.nn.Module
类型的参数有两个,unet和text_encoder,但由于for循环假设只有一种model的缘故,导致model设置为了text_encoder,而忽略了unet,最终导致参数不齐的情况。
原因总结
所以,这是由于kohya_ss作者对_prepare_deepspeed
的理解不深,不知道现在的deepspeed传一个model参数的缘故,导致最后发生了KeyError。其实,其它网友也遇到了同样问题:Passing multiple models with DeepSpeed will fail
解决办法
解决办法是,让传入prepare函数的nn.torch.Module
参数只有unet即可。如下所示:
unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
unet, optimizer, train_dataloader, lr_scheduler)
然后记得加上一行train_text_encoder = False
,让train_db.py
不要训练text_encoder即可。这样再使用deepspeed分布式训练就不会报错了。