在微调过程中,跑完了3001次又再跑一次是怎么回事呢?求教
参数如下:
# 配置训练参数
args = TrainingArguments(
output_dir="./sijicaidan0806/GLM4",
max_steps=3001, # 最大训练步数
per_device_train_batch_size=1, # 每个设备的训练批次大小
dataloader_num_workers=16, # 数据加载器使用的线程数量
remove_unused_columns=False, # 是否移除未使用的列
gradient_accumulation_steps=32,
logging_steps=200,
num_train_epochs=10,
save_steps=200,
learning_rate=1e-5,
save_on_each_node=True,
gradient_checkpointing=True
)
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
{'loss': 0.7606, 'learning_rate': 9.333555481506165e-06, 'epoch': 4.11}
{'loss': 0.027, 'learning_rate': 8.66711096301233e-06, 'epoch': 8.22}
{'loss': 0.005, 'learning_rate': 8.000666444518494e-06, 'epoch': 12.33}
{'loss': 0.0016, 'learning_rate': 7.33422192602466e-06, 'epoch': 16.44}
{'loss': 0.0003, 'learning_rate': 6.6677774075308235e-06, 'epoch': 20.55}
{'loss': 0.0, 'learning_rate': 6.001332889036988e-06, 'epoch': 24.66}
{'loss': 0.0, 'learning_rate': 5.3348883705431534e-06, 'epoch': 28.77}
{'loss': 0.0, 'learning_rate': 4.668443852049317e-06, 'epoch': 32.88}
{'loss': 0.0, 'learning_rate': 4.001999333555482e-06, 'epoch': 36.99}
{'loss': 0.0, 'learning_rate': 3.335554815061646e-06, 'epoch': 41.1}
{'loss': 0.0, 'learning_rate': 2.669110296567811e-06, 'epoch': 45.22}
{'loss': 0.0, 'learning_rate': 2.0026657780739757e-06, 'epoch': 49.33}
{'loss': 0.0, 'learning_rate': 1.33622125958014e-06, 'epoch': 53.44}
{'loss': 0.0, 'learning_rate': 6.697767410863045e-07, 'epoch': 57.55}
{'loss': 0.0, 'learning_rate': 3.332222592469177e-09, 'epoch': 61.66}
{'train_runtime': 15886.9613, 'train_samples_per_second': 6.045, 'train_steps_per_second': 0.189, 'train_loss': 0.052959544249882896, 'epoch': 61.68}
100%|██████████| 3001/3001 [4:24:46<00:00, 5.29s/it]
{'loss': 0.0, 'learning_rate': 9.333555481506165e-06, 'epoch': 4.11}
7%|▋ | 200/3001 [17:35<4:07:02, 5.29s/it]Checkpoint destination directory ./sijicaidan0806/GLM4/checkpoint-200 already exists and is non-empty.Saving will proceed but saved results may be invalid.
{'loss': 0.0, 'learning_rate': 8.66711096301233e-06, 'epoch': 8.22}
13%|█▎ | 400/3001 [35:10<3:47:52, 5.26s/it]Checkpoint destination directory ./sijicaidan0806/GLM4/checkpoint-400 already exists and is non-empty.Saving will proceed but saved results may be invalid.
{'loss': 0.0, 'learning_rate': 8.000666444518494e-06, 'epoch': 12.33}
20%|█▉ | 600/3001 [52:48<3:32:30, 5.31s/it]Checkpoint destination directory ./sijicaidan0806/GLM4/checkpoint-600 already exists and is non-empty.Saving will proceed but saved results may be invalid.
{'loss': 0.0006, 'learning_rate': 7.33422192602466e-06, 'epoch': 16.44}
27%|██▋ | 800/3001 [1:10:26<3:13:37, 5.28s/it]Checkpoint destination directory ./sijicaidan0806/GLM4/checkpoint-800 already exists and is non-empty.Saving will proceed but saved results may be invalid.
{'loss': 0.0115, 'learning_rate': 6.6677774075308235e-06, 'epoch': 20.55}
33%|███▎ | 1000/3001 [1:28:06<2:56:51, 5.30s/it]Checkpoint destination directory ./sijicaidan0806/GLM4/checkpoint-1000 already exists and is non-empty.Saving will proceed but saved results may be invalid.
{'loss': 0.0014, 'learning_rate': 6.001332889036988e-06, 'epoch': 24.66}
40%|███▉ | 1200/3001 [1:45:44<2:38:40, 5.29s/it]Checkpoint destination directory ./sijicaidan0806/GLM4/checkpoint-1200 already exists and is non-empty.Saving will proceed but saved results may be invalid.
{'loss': 0.0, 'learning_rate': 5.3348883705431534e-06, 'epoch': 28.77}
47%|████▋ | 1400/3001 [2:03:26<2:21:08, 5.29s/it]Checkpoint destination directory ./sijicaidan0806/GLM4/checkpoint-1400 already exists and is non-empty.Saving will proceed but saved results may be invalid.
{'loss': 0.0, 'learning_rate': 4.668443852049317e-06, 'epoch': 32.88}
53%|█████▎ | 1600/3001 [2:21:02<2:03:10, 5.28s/it]Checkpoint destination directory ./sijicaidan0806/GLM4/checkpoint-1600 already exists and is non-empty.Saving will proceed but saved results may be invalid.
{'loss': 0.0, 'learning_rate': 4.001999333555482e-06, 'epoch': 36.99}
60%|█████▉ | 1800/3001 [2:38:40<1:44:47, 5.23s/it]Checkpoint destination directory ./sijicaidan0806/GLM4/checkpoint-1800 already exists and is non-empty.Saving will proceed but saved results may be invalid.
{'loss': 0.0, 'learning_rate': 3.335554815061646e-06, 'epoch': 41.1}
67%|██████▋ | 2000/3001 [2:56:21<1:28:51, 5.33s/it]Checkpoint destination directory ./sijicaidan0806/GLM4/checkpoint-2000 already exists and is non-empty.Saving will proceed but saved results may be invalid.
{'loss': 0.0, 'learning_rate': 2.669110296567811e-06, 'epoch': 45.22}
73%|███████▎ | 2200/3001 [3:14:01<1:10:20, 5.27s/it]Checkpoint destination directory ./sijicaidan0806/GLM4/checkpoint-2200 already exists and is non-empty.Saving will proceed but saved results may be invalid.
{'loss': 0.0, 'learning_rate': 2.0026657780739757e-06, 'epoch': 49.33}
80%|███████▉ | 2400/3001 [3:31:38<53:23, 5.33s/it]Checkpoint destination directory ./sijicaidan0806/GLM4/checkpoint-2400 already exists and is non-empty.Saving will proceed but saved results may be invalid.
{'loss': 0.0, 'learning_rate': 1.33622125958014e-06, 'epoch': 53.44}
87%|████████▋ | 2600/3001 [3:49:16<35:23, 5.30s/it]Checkpoint destination directory ./sijicaidan0806/GLM4/checkpoint-2600 already exists and is non-empty.Saving will proceed but saved results may be invalid.
{'loss': 0.0, 'learning_rate': 6.697767410863045e-07, 'epoch': 57.55}
93%|█████████▎| 2800/3001 [4:06:55<17:45, 5.30s/it]Checkpoint destination directory ./sijicaidan0806/GLM4/checkpoint-2800 already exists and is non-empty.Saving will proceed but saved results may be invalid.
{'loss': 0.0, 'learning_rate': 3.332222592469177e-09, 'epoch': 61.66}
100%|█████████▉| 3000/3001 [4:24:35<00:05, 5.30s/it]Checkpoint destination directory ./sijicaidan0806/GLM4/checkpoint-3000 already exists and is non-empty.Saving will proceed but saved results may be invalid.
{'train_runtime': 15881.1741, 'train_samples_per_second': 6.047, 'train_steps_per_second': 0.189, 'train_loss': 0.0009113291187475037, 'epoch': 61.68}
100%|██████████| 3001/3001 [4:24:41<00:00, 5.29s/it]
trainer.train() TrainOutput(global_step=3001, training_loss=0.0009113291187475037, metrics={'train_runtime': 15881.1741, 'train_samples_per_second': 6.047, 'train_steps_per_second': 0.189, 'train_loss': 0.0009113291187475037, 'epoch': 61.68})
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|██████████| 10/10 [00:03<00:00, 2.94it/s]