PyTorch——利用Accelerate轻松控制多个CPU/GPU/TPU加速计算
前言
CPU?GPU?TPU?
计算设备太多,很混乱? 切换环境,代码大量改来改去? 不懂怎么调用多个CPU/GPU/TPU?或者想轻松调用? OK!OK!OK!
来自HuggingFace的Accelerate库帮你轻松解决这些问题,只需几行代码改动就可以快速完成计算设备的自动调整。 相关地址
下面就来说说怎么用
官方示例
先大致看个样 移除掉以前.to(device)
部分的代码,引入Accelerator 对model、optimizer、data、loss.backward()
做下处理即可
import torch
import torch. nn. functional as F
from datasets import load_dataset
from accelerate import Accelerator
accelerator = Accelerator( )
model = torch. nn. Transformer( )
optimizer = torch. optim. Adam( model. parameters( ) )
dataset = load_dataset( 'my_dataset' )
data = torch. utils. data. DataLoader( dataset, shuffle= True )
model, optimizer, data = accelerator. prepare( model, optimizer, data)
model. train( )
for epoch in range ( 10 ) :
for source, targets in data:
optimizer. zero_grad( )
output = model( source)
loss = F. cross_entropy( output, targets)
accelerator. backward( loss)
optimizer. step( )
单个程序内控制多个CPU/GPU/TPU
简单说一下
对于单个计算设备,像前面那个简单示例改下代码即可 多个计算设备(例如GPU)的情况下,有一点特殊的要处理,下面做个完整的PyTorch训练示例
设备环境
看看当前的显卡设备(2颗Tesla T4),命令 $ nvidia-smi
Thu Apr 27 10:53:26 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 43C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:00:05.0 Off | 0 |
| N/A 41C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
安装或更新Accelerate ,命令 $ !pip install --upgrade accelerate
导包
import torch
from torch import nn
from torch. utils. data import DataLoader
from torchvision. transforms import ToTensor, Compose
import torchvision. datasets as datasets
from accelerate import Accelerator
from accelerate import notebook_launcher
加载数据 FashionMNIST
train_data = datasets. FashionMNIST(
root= "./data" ,
train= True ,
download= True ,
transform= Compose( [ ToTensor( ) ] )
)
test_data = datasets. FashionMNIST(
root= "./data" ,
train= False ,
download= True ,
transform= Compose( [ ToTensor( ) ] )
)
print ( train_data. data. shape)
print ( test_data. data. shape)
创建一个简单的CNN模型
class CNNModel ( nn. Module) :
def __init__ ( self) :
super ( CNNModel, self) . __init__( )
self. module1 = nn. Sequential(
nn. Conv2d( 1 , 32 , kernel_size= 5 , stride= 1 , padding= 2 ) ,
nn. BatchNorm2d( 32 ) ,
nn. ReLU( ) ,
nn. MaxPool2d( kernel_size= 2 , stride= 2 )
)
self. module2 = nn. Sequential(
nn. Conv2d( 32 , 64 , kernel_size= 5 , stride= 1 , padding= 2 ) ,
nn. BatchNorm2d( 64 ) ,
nn. ReLU( ) ,
nn. MaxPool2d( kernel_size= 2 , stride= 2 )
)
self. flatten = nn. Flatten( )
self. linear1 = nn. Linear( 7 * 7 * 64 , 64 )
self. linear2 = nn. Linear( 64 , 10 )
self. relu = nn. ReLU( )
def forward ( self, x) :
out = self. module1( x)
out = self. module2( out)
out = self. flatten( out)
out = self. linear1( out)
out = self. relu( out)
out = self. linear2( out)
return out
训练函数-只包含训练
注意看accelerator相关代码 若要实现多设备控制训练,for epoch in range(epoch_num):
中末尾处的代码必不可少
def training_function ( ) :
epoch_num = 4
batch_size = 64
learning_rate = 0.005
train_loader = DataLoader( dataset= train_data, batch_size= batch_size, shuffle= True )
val_loader = DataLoader( test_data, batch_size= batch_size, shuffle= True )
model = CNNModel( )
criterion = nn. CrossEntropyLoss( )
optimizer = torch. optim. Adam( model. parameters( ) , lr= learning_rate)
accelerator = Accelerator( )
model, optimizer, train_loader, val_loader = accelerator. prepare( model, optimizer, train_loader, val_loader)
for epoch in range ( epoch_num) :
model. train( )
for i, ( X_train, y_train) in enumerate ( train_loader) :
out = model( X_train)
loss = criterion( out, y_train)
optimizer. zero_grad( )
accelerator. backward( loss)
optimizer. step( )
if ( i + 1 ) % 100 == 0 :
print ( f" { accelerator. device} Train... [epoch { epoch + 1 } / { epoch_num} , step { i + 1 } / { len ( train_loader) } ]\t[loss { loss. item( ) } ]" )
accelerator. wait_for_everyone( )
model = accelerator. unwrap_model( model)
accelerator. save( model, "model.pth" )
训练函数-包含训练和验证
相比前面的代码,多了“验证”相关的代码 验证时,因为使用多个设备进行训练,所以会比较特殊,会涉及到多个设备的验证结果合并的问题
def training_function ( ) :
epoch_num = 4
batch_size = 64
learning_rate = 0.005
train_loader = DataLoader( dataset= train_data, batch_size= batch_size, shuffle= True )
val_loader = DataLoader( test_data, batch_size= batch_size, shuffle= True )
model = CNNModel( )
criterion = nn. CrossEntropyLoss( )
optimizer = torch. optim. Adam( model. parameters( ) , lr= learning_rate)
accelerator = Accelerator( )
model, optimizer, train_loader, val_loader = accelerator. prepare( model, optimizer, train_loader, val_loader)
for epoch in range ( epoch_num) :
model. train( )
for i, ( X_train, y_train) in enumerate ( train_loader) :
out = model( X_train)
loss = criterion( out, y_train)
optimizer. zero_grad( )
accelerator. backward( loss)
optimizer. step( )
if ( i + 1 ) % 100 == 0 :
print ( f" { accelerator. device} Train... [epoch { epoch + 1 } / { epoch_num} , step { i + 1 } / { len ( train_loader) } ]\t[loss { loss. item( ) } ]" )
model. eval ( )
correct, total = 0 , 0
for X_val, y_val in val_loader:
with torch. no_grad( ) :
output = model( X_val)
_, pred = torch. max ( output, 1 )
pred, y_val = accelerator. gather_for_metrics( (
pred, y_val
) )
total += y_val. size( 0 )
correct += ( pred == y_val) . sum ( )
accelerator. print ( f'epoch { epoch + 1 } / { epoch_num} , accuracy = { 100 * ( correct. item( ) / total) : .2f } ' )
accelerator. wait_for_everyone( )
model = accelerator. unwrap_model( model)
accelerator. save( model, "model.pth" )
训练
如果你在本地训练的话,直接调用前面定义的函数training_function
即可。最后在命令行启动训练脚本 $ accelerate launch example.py
。
training_function( )
如果你在Kaggle/Colab上面,则需要利用notebook_launcher进行训练
notebook_launcher( training_function, num_processes= 2 )
Launching training on 2 GPUs.
cuda:0 Train... [epoch 1/4, step 100/469] [loss 0.43843933939933777]
cuda:1 Train... [epoch 1/4, step 100/469] [loss 0.5267877578735352]
cuda:0 Train... [epoch 1/4, step 200/469] [loss 0.39918822050094604]cuda:1 Train... [epoch 1/4, step 200/469] [loss 0.2748252749443054]
cuda:1 Train... [epoch 1/4, step 300/469] [loss 0.54105544090271]cuda:0 Train... [epoch 1/4, step 300/469] [loss 0.34716445207595825]
cuda:1 Train... [epoch 1/4, step 400/469] [loss 0.2694844901561737]
cuda:0 Train... [epoch 1/4, step 400/469] [loss 0.4343942701816559]
epoch 1/4, accuracy = 88.49
cuda:0 Train... [epoch 2/4, step 100/469] [loss 0.19695354998111725]
cuda:1 Train... [epoch 2/4, step 100/469] [loss 0.2911057770252228]
cuda:0 Train... [epoch 2/4, step 200/469] [loss 0.2948791980743408]
cuda:1 Train... [epoch 2/4, step 200/469] [loss 0.292676717042923]
cuda:0 Train... [epoch 2/4, step 300/469] [loss 0.222089946269989]
cuda:1 Train... [epoch 2/4, step 300/469] [loss 0.28814008831977844]
cuda:0 Train... [epoch 2/4, step 400/469] [loss 0.3431250751018524]
cuda:1 Train... [epoch 2/4, step 400/469] [loss 0.2546379864215851]
epoch 2/4, accuracy = 87.31
cuda:1 Train... [epoch 3/4, step 100/469] [loss 0.24118559062480927]cuda:0 Train... [epoch 3/4, step 100/469] [loss 0.363821804523468]
cuda:0 Train... [epoch 3/4, step 200/469] [loss 0.36783623695373535]
cuda:1 Train... [epoch 3/4, step 200/469] [loss 0.18346744775772095]
cuda:0 Train... [epoch 3/4, step 300/469] [loss 0.23459288477897644]
cuda:1 Train... [epoch 3/4, step 300/469] [loss 0.2887689769268036]
cuda:0 Train... [epoch 3/4, step 400/469] [loss 0.3079166114330292]
cuda:1 Train... [epoch 3/4, step 400/469] [loss 0.18255220353603363]
epoch 3/4, accuracy = 88.46
cuda:1 Train... [epoch 4/4, step 100/469] [loss 0.27428603172302246]
cuda:0 Train... [epoch 4/4, step 100/469] [loss 0.17705145478248596]
cuda:1 Train... [epoch 4/4, step 200/469] [loss 0.2811894416809082]
cuda:0 Train... [epoch 4/4, step 200/469] [loss 0.22682836651802063]
cuda:0 Train... [epoch 4/4, step 300/469] [loss 0.2291710525751114]
cuda:1 Train... [epoch 4/4, step 300/469] [loss 0.32024848461151123]
cuda:0 Train... [epoch 4/4, step 400/469] [loss 0.24648766219615936]
cuda:1 Train... [epoch 4/4, step 400/469] [loss 0.0805584192276001]
epoch 4/4, accuracy = 89.38
Launching training on CPU.
xla:0 Train... [epoch 1/4, step 100/938] [loss 0.6051161289215088]
xla:0 Train... [epoch 1/4, step 200/938] [loss 0.27442359924316406]
xla:0 Train... [epoch 1/4, step 300/938] [loss 0.557417631149292]
xla:0 Train... [epoch 1/4, step 400/938] [loss 0.1840067058801651]
xla:0 Train... [epoch 1/4, step 500/938] [loss 0.5252436399459839]
xla:0 Train... [epoch 1/4, step 600/938] [loss 0.2718536853790283]
xla:0 Train... [epoch 1/4, step 700/938] [loss 0.2763175368309021]
xla:0 Train... [epoch 1/4, step 800/938] [loss 0.39897507429122925]
xla:0 Train... [epoch 1/4, step 900/938] [loss 0.28720396757125854]
epoch = 0, accuracy = 86.36
xla:0 Train... [epoch 2/4, step 100/938] [loss 0.24496735632419586]
xla:0 Train... [epoch 2/4, step 200/938] [loss 0.37713131308555603]
xla:0 Train... [epoch 2/4, step 300/938] [loss 0.3106330633163452]
xla:0 Train... [epoch 2/4, step 400/938] [loss 0.40438592433929443]
xla:0 Train... [epoch 2/4, step 500/938] [loss 0.38303741812705994]
xla:0 Train... [epoch 2/4, step 600/938] [loss 0.39199298620224]
xla:0 Train... [epoch 2/4, step 700/938] [loss 0.38932573795318604]
xla:0 Train... [epoch 2/4, step 800/938] [loss 0.26298171281814575]
xla:0 Train... [epoch 2/4, step 900/938] [loss 0.21517205238342285]
epoch = 1, accuracy = 90.07
xla:0 Train... [epoch 3/4, step 100/938] [loss 0.366019606590271]
xla:0 Train... [epoch 3/4, step 200/938] [loss 0.27360212802886963]
xla:0 Train... [epoch 3/4, step 300/938] [loss 0.2014923095703125]
xla:0 Train... [epoch 3/4, step 400/938] [loss 0.21998485922813416]
xla:0 Train... [epoch 3/4, step 500/938] [loss 0.28129786252975464]
xla:0 Train... [epoch 3/4, step 600/938] [loss 0.42534705996513367]
xla:0 Train... [epoch 3/4, step 700/938] [loss 0.22158119082450867]
xla:0 Train... [epoch 3/4, step 800/938] [loss 0.359947144985199]
xla:0 Train... [epoch 3/4, step 900/938] [loss 0.3221997022628784]
epoch = 2, accuracy = 90.36
xla:0 Train... [epoch 4/4, step 100/938] [loss 0.2814193069934845]
xla:0 Train... [epoch 4/4, step 200/938] [loss 0.16465164721012115]
xla:0 Train... [epoch 4/4, step 300/938] [loss 0.2897304892539978]
xla:0 Train... [epoch 4/4, step 400/938] [loss 0.13403896987438202]
xla:0 Train... [epoch 4/4, step 500/938] [loss 0.1135573536157608]
xla:0 Train... [epoch 4/4, step 600/938] [loss 0.14964193105697632]
xla:0 Train... [epoch 4/4, step 700/938] [loss 0.20239461958408356]
xla:0 Train... [epoch 4/4, step 800/938] [loss 0.23625142872333527]
xla:0 Train... [epoch 4/4, step 900/938] [loss 0.3418393135070801]
epoch = 3, accuracy = 90.11
多个服务器、多个程序间控制多个CPU/GPU/TPU
详细内容请参考官方Example 包括
单服务器内,多个程序控制多个计算设备 多个服务器间,多个程序控制多个计算设备 写好代码后,请先在每个服务器下执行$ accelerate config
生成对应的配置文件,下面是个样例
( huggingface) PS C: \Users\alion\temp> accelerate config
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - In which compute environment are you running?
This machine
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Which type of machine are you using?
multi- GPU
How many different machines will you use ( use more than 1 for multi- node training) ? [ 1 ] : 2
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - What is the rank of this machine?
0
What is the IP address of the machine that will host the main process? 192.168 .101
What is the port you will use to communicate with the main process? 12345
Are all the machines on the same local network? Answer `no` if nodes are on the cloud and / or on different network hosts [ YES/ no] : yes
Do you wish to optimize your script with torch dynamo?[ yes/ NO] : no
Do you want to use DeepSpeed? [ yes/ NO] : no
Do you want to use FullyShardedDataParallel? [ yes/ NO] : no
Do you want to use Megatron- LM ? [ yes/ NO] : no
How many GPU( s) should be used for distributed training? [ 1 ] : 2
What GPU( s) ( by id ) should be used for training on this machine as a comma- seperated list ? [ all ] : 0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Do you wish to use FP16 or BF16 ( mixed precision) ?
fp16
accelerate configuration saved at C: \Users\alion/ . cache\huggingface\accelerate\default_config. yaml
最后在每个服务器启动训练脚本 $ accelerate launch example.py
(如果你是单台服务器多个程序,那就只启动一台的脚本就完了)
参考链接