详解以下deepspeed训练log,都代表什么意思:
gpu004: Time to load cpu_adam op: 2.638178586959839 seconds
gpu009: Loading extension module cpu_adam...
gpu009: Time to load cpu_adam op: 2.6063992977142334 seconds
gpu009: Loading extension module cpu_adam...
gpu009: Time to load cpu_adam op: 2.6627159118652344 seconds
gpu004: Adam Optimizer #0 is created with AVX2 arithmetic capability.
gpu004: Config: alpha=0.000010, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
gpu009: Adam Optimizer #0 is created with AVX2 arithmetic capability.
gpu009: Config: alpha=0.000010, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
gpu009: [2024-07-20 21:30:05,202] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
gpu009: [2024-07-20 21:30:05,202] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
gpu009: [2024-07-20 21:30:05,394] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
gpu009: [2024-07-20 21:30:05,394] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
gpu009: [2024-07-20 21:30:05,394] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
gpu009: [2024-07-20 21:30:05,394] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer
gpu009: [2024-07-20 21:30:05,635] [INFO] [utils.py:800:see_memory_usage] Stage 3 initialize beginning
gpu009: [2024-07-20 21:30:05,636] [INFO] [utils.py:801:see_memory_usage] MA 1.75 GB Max_MA 6.54 GB CA 4.14 GB Max_CA 7 GB
gpu009: [2024-07-20 21:30:05,636] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 164.03 GB, percent = 16.3%
gpu009: [2024-07-20 21:30:05,654] [INFO] [stage3.py:130:init] Reduce bucket size 12845056
gpu009: [2024-07-20 21:30:05,654] [INFO] [stage3.py:131:init] Prefetch bucket size 11560550
gpu009: [2024-07-20 21:30:05,894] [INFO] [utils.py:800:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
gpu009: [2024-07-20 21:30:05,894] [INFO] [utils.py:801:see_memory_usage] MA 1.75 GB Max_MA 1.75 GB CA 4.14 GB Max_CA 4 GB
gpu009: [2024-07-20 21:30:05,894] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 164.08 GB, percent = 16.3%
gpu009: Parameter Offload: Total persistent parameters: 433664 in 169 params
gpu009: [2024-07-20 21:30:06,790] [INFO] [utils.py:800:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
gpu009: [2024-07-20 21:30:06,791] [INFO] [utils.py:801:see_memory_usage] MA 1.75 GB Max_MA 1.75 GB CA 4.14 GB Max_CA 4 GB
gpu009: [2024-07-20 21:30:06,791] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 164.14 GB, percent = 16.3%
gpu009: [2024-07-20 21:30:07,128] [INFO] [utils.py:800:see_memory_usage] Before creating fp16 partitions
gpu009: [2024-07-20 21:30:07,129] [INFO] [utils.py:801:see_memory_usage] MA 1.75 GB Max_MA 1.75 GB CA 4.14 GB Max_CA 4 GB
gpu009: [2024-07-20 21:30:07,129] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 164.14 GB, percent = 16.3%
gpu009: [2024-07-20 21:30:34,929] [INFO] [utils.py:800:see_memory_usage] After creating fp16 partitions: 1
gpu009: [2024-07-20 21:30:34,937] [INFO] [utils.py:801:see_memory_usage] MA 1.75 GB Max_MA 1.75 GB CA 4.14 GB Max_CA 4 GB
gpu009: [2024-07-20 21:30:34,938] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 174.31 GB, percent = 17.3%
gpu009: [2024-07-20 21:30:35,363] [INFO] [utils.py:800:see_memory_usage] Before creating fp32 partitions
gpu009: [2024-07-20 21:30:35,363] [INFO] [utils.py:801:see_memory_usage] MA 1.75 GB Max_MA 1.75 GB CA 4.14 GB Max_CA 4 GB
gpu009: [2024-07-20 21:30:35,364] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 173.36 GB, percent = 17.2%
gpu009: [2024-07-20 21:30:36,000] [INFO] [utils.py:800:see_memory_usage] After creating fp32 partitions
gpu009: [2024-07-20 21:30:36,001] [INFO] [utils.py:801:see_memory_usage] MA 1.75 GB Max_MA 1.75 GB CA 4.14 GB Max_CA 4 GB
gpu009: [2024-07-20 21:30:36,001] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 174.3 GB, percent = 17.3%
gpu009: [2024-07-20 21:30:38,669] [INFO] [utils.py:800:see_memory_usage] Before initializing optimizer states
gpu009: [2024-07-20 21:30:38,669] [INFO] [utils.py:801:see_memory_usage] MA 1.75 GB Max_MA 1.75 GB CA 4.14 GB Max_CA 4 GB
gpu009: [2024-07-20 21:30:38,670] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 205.0 GB, percent = 20.3%
gpu009: [2024-07-20 21:30:39,880] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | init_optimizer_state: 1152.36
gpu009: [2024-07-20 21:30:40,354] [INFO] [utils.py:800:see_memory_usage] After initializing optimizer states
gpu009: [2024-07-20 21:30:40,355] [INFO] [utils.py:801:see_memory_usage] MA 1.75 GB Max_MA 1.75 GB CA 4.14 GB Max_CA 4 GB
gpu009: [2024-07-20 21:30:40,355] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 199.34 GB, percent = 19.8%
gpu009: [2024-07-20 21:30:40,355] [INFO] [stage3.py:487:_setup_for_real_optimizer] optimizer state initialized
gpu004: [INFO|trainer.py:2078] 2024-07-20 21:30:42,543 >> ***** Running training *****
gpu004: [INFO|trainer.py:2079] 2024-07-20 21:30:42,544 >> Num examples = 19,200,000
gpu004: [INFO|trainer.py:2080] 2024-07-20 21:30:42,544 >> Num Epochs = 9,223,372,036,854,775,807
gpu004: [INFO|trainer.py:2081] 2024-07-20 21:30:42,544 >> Instantaneous batch size per device = 8
gpu004: [INFO|trainer.py:2084] 2024-07-20 21:30:42,544 >> Total train batch size (w. parallel, distributed & accumulation) = 128
gpu004: [INFO|trainer.py:2085] 2024-07-20 21:30:42,544 >> Gradient Accumulation steps = 1
gpu004: [INFO|trainer.py:2086] 2024-07-20 21:30:42,544 >> Total optimization steps = 150,000
gpu004: [INFO|trainer.py:2087] 2024-07-20 21:30:42,564 >> Number of trainable parameters = 5,454,133,248
gpu009: [2024-07-20 21:30:42,868] [INFO] [utils.py:800:see_memory_usage] After initializing ZeRO optimizer
gpu009: [2024-07-20 21:30:42,868] [INFO] [utils.py:801:see_memory_usage] MA 1.77 GB Max_MA 2.05 GB CA 4.14 GB Max_CA 4 GB
gpu009: [2024-07-20 21:30:42,868] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 207.16 GB, percent = 20.6%
gpu009: [2024-07-20 21:30:42,869] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
gpu009: [2024-07-20 21:30:42,869] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = WarmupDecayLR
gpu009: [2024-07-20 21:30:42,869] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupDecayLR object at 0x7efb65fb1250>
gpu009: [2024-07-20 21:30:42,869] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[1e-05], mom=[[0.9, 0.999]]
gpu009: [2024-07-20 21:30:42,881] [INFO] [config.py:987:print] DeepSpeedEngine configuration:
gpu009: [2024-07-20 21:30:42,881] [INFO] [config.py:991:print] activation_checkpointing_config {
gpu009: "partition_activations": false,
gpu009: "contiguous_memory_optimization": false,
gpu009: "cpu_checkpointing": false,
gpu009: "number_checkpoints": null,
gpu009: "synchronize_checkpoint_boundary": false,
gpu009: "profile": false
gpu009: }
gpu009: [2024-07-20 21:30:42,881] [INFO] [config.py:991:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
gpu009: [2024-07-20 21:30:42,881] [INFO] [config.py:991:print] amp_enabled .................. False
gpu009: [2024-07-20 21:30:42,881] [INFO] [config.py:991:print] amp_params ................... False
gpu009: [2024-07-20 21:30:42,881] [INFO] [config.py:991:print] autotuning_config ............ {
gpu009: "enabled": false,
gpu009: "start_step": null,
gpu009: "end_step": null,
gpu009: "metric_path": null,
gpu009: "arg_mappings": null,
gpu009: "metric": "throughput",
gpu009: "model_info": null,
gpu009: "results_dir": "autotuning_results",
gpu009: "exps_dir": "autotuning_exps",
gpu009: "overwrite": true,
gpu009: "fast": true,
gpu009: "start_profile_step": 3,
gpu009: "end_profile_step": 5,
gpu009: "tuner_type": "gridsearch",
gpu009: "tuner_early_stopping": 5,
gpu009: "tuner_num_trials": 50,
gpu009: "model_info_path": null,
gpu009: "mp_size": 1,
gpu009: "max_train_batch_size": null,
gpu009: "min_train_batch_size": 1,
gpu009: "max_train_micro_batch_size_per_gpu": 1.024000e+03,
gpu009: "min_train_micro_batch_size_per_gpu": 1,
gpu009: "num_tuning_micro_batch_sizes": 3
gpu009: }
以下是对你提供的 DeepSpeed 训练日志的详细解析:
1. 加载 CPU Adam 优化器
复制
gpu004: Time to load cpu_adam op: 2.638178586959839 seconds
gpu009: Loading extension module cpu_adam...
gpu009: Time to load cpu_adam op: 2.6063992977142334 seconds
- 含义:这段日志显示了在 GPU004 和 GPU009 上加载 CPU 版本的 Adam 优化器所花费的时间。
2. Adam 优化器配置
apache
复制
gpu004: Adam Optimizer #0 is created with AVX2 arithmetic capability.
gpu004: Config: alpha=0.000010, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
- 含义:创建了 Adam 优化器,支持 AVX2 硬件加速。配置参数:
- alpha:学习率
- betas:一阶和二阶矩的衰减率
- weight_decay:权重衰减(L2 正则化)
3. DeepSpeed 优化器初始化
复制
gpu009: [INFO] ... Using DeepSpeed Optimizer param name adamw as basic optimizer
gpu009: [INFO] ... DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
- 含义:指定使用 DeepSpeed 的 AdamW 作为基础优化器,并初始化为 DeepSpeedCPUAdam。
4. ZeRO 优化器支持检查
复制
gpu009: [INFO] ... Checking ZeRO support for optimizer=DeepSpeedCPUAdam
- 含义:检查 DeepSpeedCPUAdam 优化器是否支持 ZeRO(Zero Redundancy Optimizer)。
5. ZeRO Stage 3 初始化
复制
gpu009: [INFO] ... Creating fp16 ZeRO stage 3 optimizer
gpu009: [INFO] ... Stage 3 initialize beginning
- 含义:开始初始化 ZeRO Stage 3 优化器,这是为了在显存有限的情况下优化内存使用。
6. 内存使用情况
复制
gpu009: MA 1.75 GB Max_MA 6.54 GB CA 4.14 GB Max_CA 7 GB
- 含义:显示当前的内存使用情况:
- MA:当前活动内存
- Max_MA:最大活动内存
- CA:当前缓存内存
- Max_CA:最大缓存内存
7. 训练参数和状态初始化
apache
复制
gpu004: >> ***** Running training *****
gpu004: >> Num examples = 19,200,000
gpu004: >> Num Epochs = 9,223,372,036,854,775,807
- 含义:开始训练过程,显示训练的样本数量和训练轮数(这里的轮数看起来异常,可能是配置错误)。
8. 优化器状态初始化
复制
gpu009: [INFO] ... time (ms) | init_optimizer_state: 1152.36
- 含义:初始化优化器状态所花费的时间。
9. 深度学习配置
复制
gpu009: [INFO] ... DeepSpeed Final Optimizer = adamw
gpu009: [INFO] ... using configured LR scheduler = WarmupDecayLR
- 含义:最终使用的优化器是 AdamW,学习率调度器为 WarmupDecayLR。
10. 内存使用的变化
复制
gpu009: ... CPU Virtual Memory: used = 207.16 GB, percent = 20.6%
- 含义:显示 CPU 虚拟内存的使用情况,包括使用的总量和百分比。
总结
这些日志信息提供了关于 DeepSpeed 训练过程的详细视图,包括优化器的加载和初始化、内存使用情况、训练参数配置等。通过这些信息,可以监控训练的进展及其资源使用情况,帮助优化训练过程。