单机多卡配置
为最大化利用计算资源以提高训练效率,我们使用了实验室提供的单机多卡的资源环境。
单机多卡(Single Machine Multi-GPU)是指在一台计算机上使用多个 GPU 进行并行计算或训练深度学习模型。这种方法能够显著加速计算过程,因为多个 GPU 可以同时处理数据,提高计算资源的利用率。
通常,我们在多卡条件下优先选择数据并行的方式即nn.DataParallel()。
示例总流程DEMO如下:
import torch
import torch.nn as nn
# 定义模型
model = MyModel()
# 包装模型
model = nn.DataParallel(model)
# 将模型移动到 GPU
model = model.to('cuda')
# 训练循环
for data in dataloader:
inputs, labels = data
inputs, labels = inputs.to('cuda'), labels.to('cuda')
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
可见,nn.DataParallel()提供了一键数据并行的方法,使得我们能较为方便的实现多卡并行的训练策略。
在代码, Trainer类中,对training loop定义了
def _inner_training_loop(
self, batch_size=None, args=None, resume_from_checkpoint=None, trial=None, ignore_keys_for_eval=None
):
...
model = self._wrap_model(self.model_wrapped)
...
通过self.wrap_model()方法进行了对Distributed training的实现:
# Distributed training (should be after apex fp16 initialization)
# Distributed training using PyTorch FSDP
if self.is_fsdp_xla_enabled:
try:
from torch_xla.distributed.fsdp import XlaFullyShardedDataParallel as FSDP
from torch_xla.distributed.fsdp import checkpoint_module
from torch_xla.distributed.fsdp.wrap import (
size_based_auto_wrap_policy,
transformer_auto_wrap_policy,
)
except ImportError:
raise ImportError("Missing XLA FSDP related module; please make sure to use torch-xla >= 2.0.")
auto_wrap_policy = None
auto_wrapper_callable = None
default_transformer_cls_names_to_wrap = getattr(model, "_no_split_modules", None)
fsdp_transformer_layer_cls_to_wrap = self.args.fsdp_config.get(
"transformer_layer_cls_to_wrap", default_transformer_cls_names_to_wrap
)
if self.args.fsdp_config["min_num_params"] > 0:
auto_wrap_policy = functools.partial(
size_based_auto_wrap_policy, min_num_params=self.args.fsdp_config["min_num_params"]
)
elif fsdp_transformer_layer_cls_to_wrap is not None:
transformer_cls_to_wrap = set()
for layer_class in fsdp_transformer_layer_cls_to_wrap:
transformer_cls = get_module_class_from_name(model, layer_class)
if transformer_cls is None:
raise Exception("Could not find the transformer layer class to wrap in the model.")
else:
transformer_cls_to_wrap.add(transformer_cls)
auto_wrap_policy = functools.partial(
transformer_auto_wrap_policy,
# Transformer layer class to wrap
transformer_layer_cls=transformer_cls_to_wrap,
)
fsdp_kwargs = self.args.xla_fsdp_config
if self.args.fsdp_config["xla_fsdp_grad_ckpt"]:
# Apply gradient checkpointing to auto-wrapped sub-modules if specified
def auto_wrapper_callable(m, *args, **kwargs):
return FSDP(checkpoint_module(m), *args, **kwargs)
# Wrap the base model with an outer FSDP wrapper
self.model = model = FSDP(
model,
auto_wrap_policy=auto_wrap_policy,
auto_wrapper_callable=auto_wrapper_callable,
**fsdp_kwargs,
)
# Patch `xm.optimizer_step` should not reduce gradients in this case,
# as FSDP does not need gradient reduction over sharded parameters.
def patched_optimizer_step(optimizer, barrier=False, optimizer_args={}):
loss = optimizer.step(**optimizer_args)
if barrier:
xm.mark_step()
return loss
xm.optimizer_step = patched_optimizer_step
elif is_sagemaker_dp_enabled():
model = nn.parallel.DistributedDataParallel(
model, device_ids=[int(os.getenv("SMDATAPARALLEL_LOCAL_RANK"))]
)
elif self.args.parallel_mode == ParallelMode.DISTRIBUTED:
if is_torch_neuroncore_available():
return model
kwargs = {}
if self.args.ddp_find_unused_parameters is not None:
kwargs["find_unused_parameters"] = self.args.ddp_find_unused_parameters
elif isinstance(model, PreTrainedModel):
# find_unused_parameters breaks checkpointing as per
# https://github.com/huggingface/transformers/pull/4659#issuecomment-643356021
kwargs["find_unused_parameters"] = not model.is_gradient_checkpointing
else:
kwargs["find_unused_parameters"] = True
if self.args.ddp_bucket_cap_mb is not None:
kwargs["bucket_cap_mb"] = self.args.ddp_bucket_cap_mb
if self.args.ddp_broadcast_buffers is not None:
kwargs["broadcast_buffers"] = self.args.ddp_broadcast_buffers
self.accelerator.ddp_handler = DistributedDataParallelKwargs(**kwargs)
启动时使用,同时传入对应的配置参数
python -m torch.distributed.launch $DISTRIBUTED_ARGS
其中分布式参数为
if [ $MASTER_ADDR ];then
echo $MASTER_ADDR
echo $MASTER_PORT
echo $WORLD_SIZE
echo $RANK
else
MASTER_ADDR=127.0.0.1
MASTER_PORT=29500
WORLD_SIZE=1
RANK=0
fi
DISTRIBUTED_ARGS="--nproc_per_node 1 \
--nnodes ${WORLD_SIZE} \
--node_rank ${RANK} \
--master_addr ${MASTER_ADDR} \
--master_port ${MASTER_PORT}"