一、
示例 Running training
+ Running Evaluation
参数:
* Running training *
[INFO|trainer.py:2079] 2025-01-09 11:53:02,473 >> Num examples = 6,668
[INFO|trainer.py:2080] 2025-01-09 11:53:02,473 >> Num Epochs = 3
[INFO|trainer.py:2081] 2025-01-09 11:53:02,473 >> Instantaneous batch size per device = 2
[INFO|trainer.py:2084] 2025-01-09 11:53:02,473 >> Total train batch size (w. parallel, distributed& accumulation) = 4
[INFO|trainer.py:2085] 2025-01-09 11:53:02,473 >> Gradient Accumulation steps = 2
[INFO|trainer.py:2086] 2025-01-09 11:53:02,473 >> Total optimization steps = 5,001
[INFO|trainer.py:2087] 2025-01-09 11:53:02,475 >> Number of trainable parameters = 14,823,424
* Running Evaluation *
[INFO|trainer.py:3721] 2025-01-09 11:58:14,776 >> Num examples = 1667
[INFO|trainer.py:3724] 2025-01-09 11:58:14,776 >> Batch size = 2
二、Running training
2.1. Num examples = 6,668
训练数据的样本数
仍然是6,668
。
2.2. Num Epochs = 3
训练的轮数变为3,这意味着模型将用这6,668个样本进行3 轮训练
。相比之前的10轮训练,训练周期减少了。
2.3. Instantaneous batch size per device = 2
每个设备的批量大小减小到2。这意味着每个训练批次包含2个样本
,可能是为了降低内存消耗或者适应更小的硬件。
2.4. Total train batch size = 4
总的训练批次大小为4
。这意味着如果有多个设备,批量大小会被分配到多个设备上,总批量大小为4。
2.5. Gradient Accumulation steps = 2
仍然保持梯度累积步数为2,意味着每累积2个批次的梯度后,才会进行一次梯度更新
。
2.6. Total optimization steps = 5,001
总的优化步数为5,001。可以通过训练样本数和批量大小计算得出:每轮训练的批次数 = 6,668(总样本数) / 4 (总训练批次数)≈ 1,667批次
,3轮训练的总批次数是1,667 * 3 = 5,001步(与总优化步数一致)。
4 (总训练批次数)= 2(Instantaneous batch size per device)* 2(Gradient Accumulation steps)
2.7. Number of trainable parameters = 14,823,424
模型的可训练参数数目仍然是14,823,424。
三、Running Evaluation
3.1. Num examples = 1,667
评估数据的样本数
是1,667
。通常,评估数据量较训练数据少。
3.2. Batch size = 2
在评估阶段,批量大小为2
。这与训练时每个设备的批量大小一致,说明评估过程也按照较小的批量进行。