【研0日记】23.12.30

dickyy666

已于 2023-12-31 01:28:53 修改

阅读量639

点赞数 12

分类专栏：读研日记文章标签：学习

于 2023-12-30 13:15:57 首次发布

本文链接：https://blog.csdn.net/m0_56654371/article/details/135304469

版权

读研日记专栏收录该内容

34 篇文章 0 订阅

订阅专栏

好好好，几个小时把开题报告写（抄）完了，爽，发给导看了，让他给我改hhh，争取今天能交到系统上面吧，早交早结束，不过好歹也是年前写完了嘻嘻

去恰饭咯，下午再来看看分布式的东西

晚上看了眼分布式训练的代码，大概清楚了很多了

and导还没看我的报告，这两天系统还维修

and我找到了昨天看的那篇，Framework(二)：分布式训练 - 知乎 (zhihu.com)，这个写的挺全的

首先就还是昨天说的那样，重点就是：nnodes节点数、node_rank节点的rank、nproc_per_node每个节点的进程数、master_addr主机ip地址、master_port通信端口，这些是要传到程序里面的参数。其他重要但不用传的参数：local_rank每个节点中进程的rank、rank所有节点中进程的rank、world_size所有进程数，这几个参数是程序内部自己会获取、传输的

先看数据并行，模型并行还没了解，改天

数据并行，每个gpu都能拿到一样的模型参数，关键就是怎么把模型分发到不同的gpu（也可以叫device）。

开始运行的程序要用到torch.distributed.launch，好我们先来看一下这个程序的注释

``torch.distributed.launch`` is a module that spawns up multiple distributed

training processes on each of the training nodes.



The utility can be used for single-node distributed training, in which one or

more processes per node will be spawned. The utility can be used for either

CPU training or GPU training. If the utility is used for GPU training,

each distributed process will be operating on a single GPU. This can achieve

well-improved single-node training performance. It can also be used in

multi-node distributed training, by spawning up multiple processes on each node

for well-improved multi-node distributed training performance as well.

This will especially be benefitial for systems with multiple Infiniband

interfaces that have direct-GPU support, since all of them can be utilized for

aggregated communication bandwidth.



In both cases of single-node distributed training or multi-node distributed

training, this utility will launch the given number of processes per node

(``--nproc-per-node``). If used for GPU training, this number needs to be less

or equal to the number of GPUs on the current system (``nproc_per_node``),

and each process will be operating on a single GPU from *GPU 0 to

GPU (nproc_per_node - 1)*.



重点就是，可用于CPU、GPU、单机多卡、多机多卡

然后来看一下怎么使用（就是怎么开始运行）

1. Single-Node multi-process distributed training



python -m torch.distributed.launch --nproc-per-node=NUM_GPUS_YOU_HAVE

YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other

arguments of your training script)



2. Multi-Node multi-process distributed training: (e.g. two nodes)



Node 1: *(IP: 192.168.1.1, and has a free port: 1234)*



python -m torch.distributed.launch --nproc-per-node=NUM_GPUS_YOU_HAVE

--nnodes=2 --node-rank=0 --master-addr="192.168.1.1"

--master-port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3

and all other arguments of your training script)



Node 2:



python -m torch.distributed.launch --nproc-per-node=NUM_GPUS_YOU_HAVE

--nnodes=2 --node-rank=1 --master-addr="192.168.1.1"

--master-port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3

and all other arguments of your training script)



就是把之前说的那几个参数传进去，然后在training_script.py里面要用arg_parse来解析，然后这部分里面有几个点说一下：（1）python -m的意思是，相当于直接运行后面的文件，__name__=="__main__"；（2）不需要的参数可以不用写

然后是几个使用要注意的点，或者说是关键

1. This utility and multi-process distributed (single-node or

multi-node) GPU training currently only achieves the best performance using

the NCCL distributed backend. Thus NCCL backend is the recommended backend to

use for GPU training.

第一个，建议使用NCCL做后端通信，效果最好

2. In your training program, you must parse the command-line argument:

``--local-rank=LOCAL_PROCESS_RANK``, which will be provided by this module.

If your training program uses GPUs, you should ensure that your code only

runs on the GPU device of LOCAL_PROCESS_RANK. This can be done by:



Parsing the local_rank argument



>>> # xdoctest: +SKIP

>>> import argparse

>>> parser = argparse.ArgumentParser()

>>> parser.add_argument("--local-rank", type=int)

>>> args = parser.parse_args()



Set your device to local rank using either



>>> torch.cuda.set_device(args.local_rank) # before your code runs



or



>>> with torch.cuda.device(args.local_rank):

>>> # your code to run

>>> ...



第二个，在training_script.py里面必须要有--local-rank这个argument，因为在程序运行的时候，就是相当于程序会自动为每个gpu分配一个local_rank，然后分别把模型按照local_rank放到对应的gpu中，--local-rank就是用来接收程序自动分配的local_rank。比如2机8卡，每个机子4卡，然后就会为每个gpu分配：node_0（gpu0，gpu1，gpu2，gpu3），node_1（gpu0，gpu1，gpu2，gpu3）。然后下面就是把model放到device上面，除了上面说的两种方法，还可以用 “model.to(device)”

3. In your training program, you are supposed to call the following function

at the beginning to start the distributed backend. It is strongly recommended

that ``init_method=env://``. Other init methods (e.g. ``tcp://``) may work,

but ``env://`` is the one that is officially supported by this module.

>>> torch.distributed.init_process_group(backend='YOUR BACKEND',

>>> init_method='env://')

第三个，training_script.py要用这行代码来初始化，backend=‘nccl’

4. In your training program, you can either use regular distributed functions

or use :func:`torch.nn.parallel.DistributedDataParallel` module. If your

training program uses GPUs for training and you would like to use

:func:`torch.nn.parallel.DistributedDataParallel` module,

here is how to configure it.



>>> model = torch.nn.parallel.DistributedDataParallel(model,

>>> device_ids=[args.local_rank],

>>> output_device=args.local_rank)



Please ensure that ``device_ids`` argument is set to be the only GPU device id

that your code will be operating on. This is generally the local rank of the

process. In other words, the ``device_ids`` needs to be ``[args.local_rank]``,

and ``output_device`` needs to be ``args.local_rank`` in order to use this

utility



第四个，training_script.py要用这行代码来分配模型

5. Another way to pass ``local_rank`` to the subprocesses via environment variable

``LOCAL_RANK``. This behavior is enabled when you launch the script with

``--use-env=True``. You must adjust the subprocess example above to replace

``args.local_rank`` with ``os.environ['LOCAL_RANK']``; the launcher

will not pass ``--local-rank`` when you specify this flag.



.. warning::



``local_rank`` is NOT globally unique: it is only unique per process

on a machine. Thus, don't use it to decide if you should, e.g.,

write to a networked filesystem. See

https://github.com/pytorch/pytorch/issues/12042 for an example of

how things can go wrong if you don't do this correctly.



第五个，这个看不懂，不重要

这个就是torch.distributed.launch的使用

接下来看看他的具体代码

def main(args=None):

warnings.warn(

"The module torch.distributed.launch is deprecated\n"

"and will be removed in future. Use torchrun.\n"

"Note that --use-env is set by default in torchrun.\n"

"If your script expects `--local-rank` argument to be set, please\n"

"change it to read from `os.environ['LOCAL_RANK']` instead. See \n"

"https://pytorch.org/docs/stable/distributed.html#launch-utility for \n"

"further instructions\n",

FutureWarning,

)

args = parse_args(args)

launch(args)

当然这个模块还有别的，重要的就是这个

args = parse_args(args)这个就是用来添加nnodes之类的参数的，然后launch就是开始运行，这个就不重要了，太复杂了

所以综合来看一下training_script.py里面应该有什么

# local_rank

parser = argparse.ArgumentParser(description='training')

parser.add_argument('--local_rank', type=str, help='local rank for dist')

args = parser.parse_args()



# world_size

world_size = torch.cuda.device_count()

local_rank = args.local_rank



# nccl initialize

dist.init_process_group(backend='nccl')



# 加载模型

torch.cuda.set_device(local_rank)

model = Mymodel()

model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],   output_device=args.local_rank)



然后这个model就可以使用去训练了

其实这个ddp何to(device)我感觉有点分不清了，好像是说，ddp是在保证各个gpu的通信，参数、梯度啥的是一致的，差不多就这意思，如果没有这个只有to(device)的话，那就是各个gpu各跑各的，只是单纯地把模型和各个数据分别给每个gpu、每个gpu自己单独计算

然后还可以说一下数据集的东西，用到DistributedSampler

dataset = load_datasets()

sampler = DistributedSampler(dataset)

loader = DataLoader(dataset = dataset, sampler = sampler, ...)

通过这样创建的 dataloader 就具有分布式采样能力，以单机多卡为例，若当前环境有N张显卡，整个数据集会被分割为N份，每张卡会获取到属于自己的那一份数据

然后其他就没啥了，优化器啥的都不需要分布式的设置

所以主要就是，数据并行下，一个model怎么加载到各个gpu上，然后数据集怎么加载，然后怎么开始运行（launch）

哎今天真累啊，晚上还在实验室玩了桌游，狼人杀那种骗来骗去的，感觉好难玩，我可能不太适合这种游戏，还是喜欢打麻将hhh，老子一人就是一个阵营，就是要胡牌赢全场hhh

明天出去玩，过年啦，好好放松一下吧

年后：

自己的paper！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！

dickyy666

关注

12
点赞
踩
17

收藏

觉得还不错? 一键收藏
1
评论
【研0日记】23.12.30

比如2机8卡，每个机子4卡，然后就会为每个gpu分配：node_0（gpu0，gpu1，gpu2，gpu3），node_1（gpu0，gpu1，gpu2，gpu3）。其实这个ddp何to(device)我感觉有点分不清了，好像是说，ddp是在保证各个gpu的通信，参数、梯度啥的是一致的，差不多就这意思，如果没有这个只有to(device)的话，那就是各个gpu各跑各的，只是单纯地把模型和各个数据分别给每个gpu、每个gpu自己单独计算。然后其他就没啥了，优化器啥的都不需要分布式的设置。
复制链接

扫一扫