kohya_ss用accelerate的deepspeed作分布式训练失败,报错KeyError

kohya_ss用accelerate的deepspeed,在双卡上尝试作分布式训练,结果报错如下:

Traceback (most recent call last):
  File "/root/codes/kohya_ss/train_db.py", line 369, in <module>
    train(args)
  File "/root/codes/kohya_ss/train_db.py", line 167, in train
    unet, text_encoder, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
  File "/root/anaconda3/envs/py310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1090, in prepare
    result = self._prepare_deepspeed(*args)
  File "/root/anaconda3/envs/py310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1368, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/root/anaconda3/envs/py310/lib/python3.10/site-packages/deepspeed/__init__.py", line 125, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/root/anaconda3/envs/py310/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 336, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/root/anaconda3/envs/py310/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1284, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/root/anaconda3/envs/py310/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1533, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer(
  File "/root/anaconda3/envs/py310/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 527, in __init__
    self._param_slice_mappings = self._create_param_mapping()
  File "/root/anaconda3/envs/py310/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 539, in _create_param_mapping
    lp_name = self.param_names[lp]
KeyError: Parameter containing:
tensor([[ 0.0319,  0.0228,  0.0857,  ...,  0.0231,  0.0391, -0.0423],
        [ 0.0115, -0.0042, -0.0543,  ...,  0.0235, -0.0141,  0.0138],
        [-0.0026, -0.0279, -0.0614,  ..., -0.0201, -0.0490, -0.0111],
        ...,
        [-0.0074, -0.0169, -0.0187,  ...,  0.0189, -0.0353, -0.0099],
        [ 0.0367, -0.0019, -0.0233,  ..., -0.0576, -0.0236,  0.0201],
        [ 0.0363, -0.0916,  0.0092,  ...,  0.0130,  0.0254, -0.0237]],
       device='cuda:1', requires_grad=True)

查到原文代码stage_1_and_2.html,错误栈self.param_names[lp]所在的函数如下:

def _create_param_mapping(self):
        param_mapping = []
        for i, _ in enumerate(self.optimizer.param_groups):
            param_mapping_per_group = OrderedDict()
            for lp in self.bit16_groups[i]:
                if lp._hp_mapping is not None:
                    lp_name = self.param_names[lp]
                    param_mapping_per_group[
                        lp_name] = lp._hp_mapping.get_hp_fragment_address()
            param_mapping.append(param_mapping_per_group)

        return param_mapping

经过调试,我们发现:

  1. self.bit16_groups是一个list,长度为1,self.bit16_groups[0]是一个list,包含多个Tensor
  2. self.param_names是一个dict,key为tensor,value为张量的名字。

前者的长度大概为800+,后者的长度为100+,两者长度不相等,所以在遍历找key时,有很多tensor是在后者中查询不到的。

在多次调试后,尝试在train_db.py的恰当位置,输出unet和text_encoder的长度,法线两者的参数数量分别是600+和100+,两者加起来刚好是800+。所以我们有理由去怀疑,在调用过程中,self.bit16_groups包含了unet和text_encoder的所有tensor,而self.param_names只包含了text_encoder的所有tensor。

print("len trainable_params: unet and text_encoder", len(list(unet.parameters())), len(list(text_encoder.parameters())))

我们看到train_db.py在调用accelerate时,会将所有的模型、优化器等等直接丢进去:

unet, text_encoder, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
        unet, text_encoder, optimizer, train_dataloader, lr_scheduler)

prepare函数则是在if-else判断为deepspeed框架后,直接调用_prepare_deepspeed,并把参数也跟着丢进去:

if self.distributed_type == DistributedType.DEEPSPEED:
		result = self._prepare_deepspeed(*args)

_prepare_deepspeed函数内容如下,我自行加了两行代码用于调试。result里包含了开始时传进来的一箩筐参数。

        # 调试时添加的代码
        result_strs = [arg.__class__.__name__ for arg in result]
        print("_prepare_deepspeed result", result_strs)
        
        model = None
        optimizer = None
        scheduler = None
        for obj in result:
            if isinstance(obj, torch.nn.Module):
                model = obj
            elif isinstance(obj, (torch.optim.Optimizer, DummyOptim)):
                optimizer = obj
            elif (isinstance(obj, (LRScheduler, DummyScheduler))) or (
                type(obj).__name__ in deepspeed.runtime.lr_schedules.VALID_LR_SCHEDULES
            ):
                scheduler = obj
           ...
           kwargs = dict(model=model, config_params=self.deepspeed_config)
           ...
           engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)

看到这里就知道了,传进来的torch.nn.Module类型的参数有两个,unet和text_encoder,但由于for循环假设只有一种model的缘故,导致model设置为了text_encoder,而忽略了unet,最终导致参数不齐的情况。

原因总结

所以,这是由于kohya_ss作者对_prepare_deepspeed的理解不深,不知道现在的deepspeed传一个model参数的缘故,导致最后发生了KeyError。其实,其它网友也遇到了同样问题:Passing multiple models with DeepSpeed will fail

解决办法

解决办法是,让传入prepare函数的nn.torch.Module参数只有unet即可。如下所示:

unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
        unet, optimizer, train_dataloader, lr_scheduler)

然后记得加上一行train_text_encoder = False,让train_db.py不要训练text_encoder即可。这样再使用deepspeed分布式训练就不会报错了。

### 加速训练AccelerateDeepSpeed 的主要区别 #### 主要功能对比 Accelerate 提供了一种简单的方法来加速 PyTorch 模型的分布式训练过程,而无需修改底层模型代码[^1]。该库自动处理多 GPU、TPU 和混合精度训练设置中的复杂细节。 DeepSpeed 则专注于大规模深度学习模型的高效训练和推理优化,在保持高吞吐量的同时显著降低内存占用并减少通信开销[^2]。 #### 使用便捷度 对于希望快速启用分布式训练而不深入研究具体实现机制的研究人员来说,Accelerate 更加友好易用;它通过简单的 API 调用来简化配置流程,并支持多种硬件平台间的无缝切换[^3]。 相比之下,虽然 DeepSpeed 配置相对复杂一些,但对于追求极致性能的应用场景而言提供了更精细粒度的控制选项以及丰富的特性集,如 ZeRO (Zero Redundancy Optimizer)[^4]。 #### 性能表现 当涉及到超大参数规模(数十亿至万亿级别)神经网络时,得益于其独特的零冗余优化器和其他先进技术手段的支持下,DeepSpeed 展现出更为出色的资源利用率及更快的速度提升效果[^5]。 而在中小规模的数据集上进行常规实验时,两者都能提供良好的加速体验,不过由于 Accelerate 设计之初就考虑到了广泛的适用性和兼容性问题,因此在这类情况下可能更容易集成到现有项目当中去[^6]。 ```python from accelerate import Accelerator accelerator = Accelerator() model, optimizer, train_loader = accelerator.prepare(model, optimizer, train_loader) for epoch in range(num_epochs): for batch in train_loader: outputs = model(batch['input_ids'], labels=batch['labels']) loss = outputs.loss accelerator.backward(loss) optimizer.step() optimizer.zero_grad() ``` ```python import deepspeed engine, _, _, _ = deepspeed.initialize( args=args, model=model, model_parameters=optimizer.param_groups, training_data=train_dataset ) for epoch in range(args.num_train_epochs): engine.train() for step, batch in enumerate(train_dataloader): loss = model(batch) engine.backward(loss) engine.step() ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值