from transformers import AutoModel
from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live
# 以t5-small为例
model = AutoModel.from_pretrained('t5-small)
estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=1, num_nodes=1)
输出结果:
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 1 GPU per node.
SW: Model with 60M total params, 16M largest layer params.
per CPU | per GPU | Options
1.52GB | 0.06GB | offload_param=OffloadDeviceEnum.cpu, offload_optimizer=OffloadDeviceEnum.cpu, zero_init=1
1.52GB | 0.06GB | offload_param=OffloadDeviceEnum.cpu, offload_optimizer=OffloadDeviceEnum.cpu, zero_init=0
1.35GB | 0.17GB | offload_param=none, offload_optimizer=OffloadDeviceEnum.cpu, zero_init=1
1.35GB | 0.17GB | offload_param=none, offload_optimizer=OffloadDeviceEnum.cpu, zero_init=0
0.09GB | 1.08GB | offload_param=none, offload_optimizer=none, zero_init=1
0.34GB | 1.08GB | offload_param=none, offload_optimizer=none, zero_init=0
ZeRO Stage 1
: 划分optimizer states;
ZeRO Stage 2
: 划分gradient;
ZeRO Stage 3
:划分parameter。