deepspeed训练log 多机多卡

详解以下deepspeed训练log,都代表什么意思: 


gpu004: Time to load cpu_adam op: 2.638178586959839 seconds
gpu009: Loading extension module cpu_adam...
gpu009: Time to load cpu_adam op: 2.6063992977142334 seconds
gpu009: Loading extension module cpu_adam...
gpu009: Time to load cpu_adam op: 2.6627159118652344 seconds
gpu004: Adam Optimizer #0 is created with AVX2 arithmetic capability.
gpu004: Config: alpha=0.000010, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
gpu009: Adam Optimizer #0 is created with AVX2 arithmetic capability.
gpu009: Config: alpha=0.000010, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
gpu009: [2024-07-20 21:30:05,202] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
gpu009: [2024-07-20 21:30:05,202] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
gpu009: [2024-07-20 21:30:05,394] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
gpu009: [2024-07-20 21:30:05,394] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
gpu009: [2024-07-20 21:30:05,394] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
gpu009: [2024-07-20 21:30:05,394] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer
gpu009: [2024-07-20 21:30:05,635] [INFO] [utils.py:800:see_memory_usage] Stage 3 initialize beginning
gpu009: [2024-07-20 21:30:05,636] [INFO] [utils.py:801:see_memory_usage] MA 1.75 GB Max_MA 6.54 GB CA 4.14 GB Max_CA 7 GB
gpu009: [2024-07-20 21:30:05,636] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 164.03 GB, percent = 16.3%
gpu009: [2024-07-20 21:30:05,654] [INFO] [stage3.py:130:init] Reduce bucket size 12845056
gpu009: [2024-07-20 21:30:05,654] [INFO] [stage3.py:131:init] Prefetch bucket size 11560550
gpu009: [2024-07-20 21:30:05,894] [INFO] [utils.py:800:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
gpu009: [2024-07-20 21:30:05,894] [INFO] [utils.py:801:see_memory_usage] MA 1.75 GB Max_MA 1.75 GB CA 4.14 GB Max_CA 4 GB
gpu009: [2024-07-20 21:30:05,894] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 164.08 GB, percent = 16.3%
gpu009: Parameter Offload: Total persistent parameters: 433664 in 169 params
gpu009: [2024-07-20 21:30:06,790] [INFO] [utils.py:800:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
gpu009: [2024-07-20 21:30:06,791] [INFO] [utils.py:801:see_memory_usage] MA 1.75 GB Max_MA 1.75 GB CA 4.14 GB Max_CA 4 GB
gpu009: [2024-07-20 21:30:06,791] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 164.14 GB, percent = 16.3%
gpu009: [2024-07-20 21:30:07,128] [INFO] [utils.py:800:see_memory_usage] Before creating fp16 partitions
gpu009: [2024-07-20 21:30:07,129] [INFO] [utils.py:801:see_memory_usage] MA 1.75 GB Max_MA 1.75 GB CA 4.14 GB Max_CA 4 GB
gpu009: [2024-07-20 21:30:07,129] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 164.14 GB, percent = 16.3%
gpu009: [2024-07-20 21:30:34,929] [INFO] [utils.py:800:see_memory_usage] After creating fp16 partitions: 1
gpu009: [2024-07-20 21:30:34,937] [INFO] [utils.py:801:see_memory_usage] MA 1.75 GB Max_MA 1.75 GB CA 4.14 GB Max_CA 4 GB
gpu009: [2024-07-20 21:30:34,938] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 174.31 GB, percent = 17.3%
gpu009: [2024-07-20 21:30:35,363] [INFO] [utils.py:800:see_memory_usage] Before creating fp32 partitions
gpu009: [2024-07-20 21:30:35,363] [INFO] [utils.py:801:see_memory_usage] MA 1.75 GB Max_MA 1.75 GB CA 4.14 GB Max_CA 4 GB
gpu009: [2024-07-20 21:30:35,364] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 173.36 GB, percent = 17.2%
gpu009: [2024-07-20 21:30:36,000] [INFO] [utils.py:800:see_memory_usage] After creating fp32 partitions
gpu009: [2024-07-20 21:30:36,001] [INFO] [utils.py:801:see_memory_usage] MA 1.75 GB Max_MA 1.75 GB CA 4.14 GB Max_CA 4 GB
gpu009: [2024-07-20 21:30:36,001] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 174.3 GB, percent = 17.3%
gpu009: [2024-07-20 21:30:38,669] [INFO] [utils.py:800:see_memory_usage] Before initializing optimizer states
gpu009: [2024-07-20 21:30:38,669] [INFO] [utils.py:801:see_memory_usage] MA 1.75 GB Max_MA 1.75 GB CA 4.14 GB Max_CA 4 GB
gpu009: [2024-07-20 21:30:38,670] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 205.0 GB, percent = 20.3%
gpu009: [2024-07-20 21:30:39,880] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | init_optimizer_state: 1152.36
gpu009: [2024-07-20 21:30:40,354] [INFO] [utils.py:800:see_memory_usage] After initializing optimizer states
gpu009: [2024-07-20 21:30:40,355] [INFO] [utils.py:801:see_memory_usage] MA 1.75 GB Max_MA 1.75 GB CA 4.14 GB Max_CA 4 GB
gpu009: [2024-07-20 21:30:40,355] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 199.34 GB, percent = 19.8%
gpu009: [2024-07-20 21:30:40,355] [INFO] [stage3.py:487:_setup_for_real_optimizer] optimizer state initialized
gpu004: [INFO|trainer.py:2078] 2024-07-20 21:30:42,543 >> ***** Running training *****
gpu004: [INFO|trainer.py:2079] 2024-07-20 21:30:42,544 >> Num examples = 19,200,000
gpu004: [INFO|trainer.py:2080] 2024-07-20 21:30:42,544 >> Num Epochs = 9,223,372,036,854,775,807
gpu004: [INFO|trainer.py:2081] 2024-07-20 21:30:42,544 >> Instantaneous batch size per device = 8
gpu004: [INFO|trainer.py:2084] 2024-07-20 21:30:42,544 >> Total train batch size (w. parallel, distributed & accumulation) = 128
gpu004: [INFO|trainer.py:2085] 2024-07-20 21:30:42,544 >> Gradient Accumulation steps = 1
gpu004: [INFO|trainer.py:2086] 2024-07-20 21:30:42,544 >> Total optimization steps = 150,000
gpu004: [INFO|trainer.py:2087] 2024-07-20 21:30:42,564 >> Number of trainable parameters = 5,454,133,248
gpu009: [2024-07-20 21:30:42,868] [INFO] [utils.py:800:see_memory_usage] After initializing ZeRO optimizer
gpu009: [2024-07-20 21:30:42,868] [INFO] [utils.py:801:see_memory_usage] MA 1.77 GB Max_MA 2.05 GB CA 4.14 GB Max_CA 4 GB
gpu009: [2024-07-20 21:30:42,868] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 207.16 GB, percent = 20.6%
gpu009: [2024-07-20 21:30:42,869] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
gpu009: [2024-07-20 21:30:42,869] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = WarmupDecayLR
gpu009: [2024-07-20 21:30:42,869] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupDecayLR object at 0x7efb65fb1250>
gpu009: [2024-07-20 21:30:42,869] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[1e-05], mom=[[0.9, 0.999]]
gpu009: [2024-07-20 21:30:42,881] [INFO] [config.py:987:print] DeepSpeedEngine configuration:
gpu009: [2024-07-20 21:30:42,881] [INFO] [config.py:991:print] activation_checkpointing_config {
gpu009: "partition_activations": false,
gpu009: "contiguous_memory_optimization": false,
gpu009: "cpu_checkpointing": false,
gpu009: "number_checkpoints": null,
gpu009: "synchronize_checkpoint_boundary": false,
gpu009: "profile": false
gpu009: }
gpu009: [2024-07-20 21:30:42,881] [INFO] [config.py:991:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
gpu009: [2024-07-20 21:30:42,881] [INFO] [config.py:991:print] amp_enabled .................. False
gpu009: [2024-07-20 21:30:42,881] [INFO] [config.py:991:print] amp_params ................... False
gpu009: [2024-07-20 21:30:42,881] [INFO] [config.py:991:print] autotuning_config ............ {
gpu009: "enabled": false,
gpu009: "start_step": null,
gpu009: "end_step": null,
gpu009: "metric_path": null,
gpu009: "arg_mappings": null,
gpu009: "metric": "throughput",
gpu009: "model_info": null,
gpu009: "results_dir": "autotuning_results",
gpu009: "exps_dir": "autotuning_exps",
gpu009: "overwrite": true,
gpu009: "fast": true,
gpu009: "start_profile_step": 3,
gpu009: "end_profile_step": 5,
gpu009: "tuner_type": "gridsearch",
gpu009: "tuner_early_stopping": 5,
gpu009: "tuner_num_trials": 50,
gpu009: "model_info_path": null,
gpu009: "mp_size": 1,
gpu009: "max_train_batch_size": null,
gpu009: "min_train_batch_size": 1,
gpu009: "max_train_micro_batch_size_per_gpu": 1.024000e+03,
gpu009: "min_train_micro_batch_size_per_gpu": 1,
gpu009: "num_tuning_micro_batch_sizes": 3
gpu009: }

以下是对你提供的 DeepSpeed 训练日志的详细解析:

1. 加载 CPU Adam 优化器

复制

gpu004: Time to load cpu_adam op: 2.638178586959839 seconds
gpu009: Loading extension module cpu_adam...
gpu009: Time to load cpu_adam op: 2.6063992977142334 seconds
  • 含义:这段日志显示了在 GPU004 和 GPU009 上加载 CPU 版本的 Adam 优化器所花费的时间。

2. Adam 优化器配置

apache

复制

gpu004: Adam Optimizer #0 is created with AVX2 arithmetic capability.
gpu004: Config: alpha=0.000010, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
  • 含义:创建了 Adam 优化器,支持 AVX2 硬件加速。配置参数:
    • alpha:学习率
    • betas:一阶和二阶矩的衰减率
    • weight_decay:权重衰减(L2 正则化)

3. DeepSpeed 优化器初始化

复制

gpu009: [INFO] ... Using DeepSpeed Optimizer param name adamw as basic optimizer
gpu009: [INFO] ... DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
  • 含义:指定使用 DeepSpeed 的 AdamW 作为基础优化器,并初始化为 DeepSpeedCPUAdam

4. ZeRO 优化器支持检查

复制

gpu009: [INFO] ... Checking ZeRO support for optimizer=DeepSpeedCPUAdam
  • 含义:检查 DeepSpeedCPUAdam 优化器是否支持 ZeRO(Zero Redundancy Optimizer)。

5. ZeRO Stage 3 初始化

复制

gpu009: [INFO] ... Creating fp16 ZeRO stage 3 optimizer
gpu009: [INFO] ... Stage 3 initialize beginning
  • 含义:开始初始化 ZeRO Stage 3 优化器,这是为了在显存有限的情况下优化内存使用。

6. 内存使用情况

复制

gpu009: MA 1.75 GB         Max_MA 6.54 GB         CA 4.14 GB         Max_CA 7 GB 
  • 含义:显示当前的内存使用情况:
    • MA:当前活动内存
    • Max_MA:最大活动内存
    • CA:当前缓存内存
    • Max_CA:最大缓存内存

7. 训练参数和状态初始化

apache

复制

gpu004: >> ***** Running training *****
gpu004: >>   Num examples = 19,200,000
gpu004: >>   Num Epochs = 9,223,372,036,854,775,807
  • 含义:开始训练过程,显示训练的样本数量和训练轮数(这里的轮数看起来异常,可能是配置错误)。

8. 优化器状态初始化

复制

gpu009: [INFO] ... time (ms) | init_optimizer_state: 1152.36
  • 含义:初始化优化器状态所花费的时间。

9. 深度学习配置

复制

gpu009: [INFO] ... DeepSpeed Final Optimizer = adamw
gpu009: [INFO] ... using configured LR scheduler = WarmupDecayLR
  • 含义:最终使用的优化器是 AdamW,学习率调度器为 WarmupDecayLR。

10. 内存使用的变化

复制

gpu009: ... CPU Virtual Memory: used = 207.16 GB, percent = 20.6%
  • 含义:显示 CPU 虚拟内存的使用情况,包括使用的总量和百分比。

总结

这些日志信息提供了关于 DeepSpeed 训练过程的详细视图,包括优化器的加载和初始化、内存使用情况、训练参数配置等。通过这些信息,可以监控训练的进展及其资源使用情况,帮助优化训练过程。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值