问题
尝试使用自己的数据来训练项目
https://github.com/yangjianxin1/GPT2-chitchat-directml
但loss在首次训练时可以降下来,但使用pretraied_model继续训练后就降不下来了,而且loss显示的不一样
设备
1. python 3.10.6
2. torch-directml 0.1.13.1.dev230301
训练准备
- 数据
F:\python\python310\projects\GPT2-chitchat-directml>…\python preprocess.py --train_path data/my/train.txt --save_path data/my/train.pkl
2023-06-15 10:22:46,416 - INFO - preprocessing data,data path:data/my/train.txt, save path:data/my/train.pkl
2023-06-15 10:22:46,426 - INFO - there are 24 dialogue in dataset
100%|█████████████████████████████████████████████████████████████████████████████████| 24/24 [00:00<00:00, 800.79it/s]
2023-06-15 10:22:46,556 - INFO - finish preprocessing data,the result is stored in data/my/train.pkl
2023-06-15 10:22:46,556 - INFO - mean of dialogue len:42.583333333333336,median of dialogue len:39.5,max len:80
train.json
这里是引用
训练记录
首次训练结束时的loss
2023-06-12 15:41:35,439 - INFO - saving current best model for epoch 301
2023-06-12 15:41:35,953 - INFO - saving current best model for epoch 301
2023-06-12 15:41:37,036 - INFO - batch 2 of epoch 302, loss 1.7703745365142822, batch_acc 0.9117647058823529, lr [1.817572524174725e-05]
2023-06-12 15:41:37,462 - INFO - batch 4 of epoch 302, loss 2.0290677547454834, batch_acc 0.8860759493670886, lr [1.817139046348783e-05]
2023-06-12 15:41:37,869 - INFO - batch 6 of epoch 302, loss 0.7335497736930847, batch_acc 0.9285714285714286, lr [1.816705568522841e-05]
使用pretrained继续训练
开始时
2023-06-12 16:22:05,439 - INFO - use GPU privateuseone:0 to train
2023-06-12 16:22:05,443 - INFO - number of model parameters: 4975920
2023-06-12 16:22:05,444 - INFO - args:Namespace(device=device(type=‘privateuseone’, index=0), no_cuda=False, vocab_path=‘vocab/vocab.txt’, model_config=‘data/my/config.json’, train_path=‘data/my/train.pkl’, max_len=150, log_path=‘data/train.log’, log=True, ignore_index=-100, epochs=1000, batch_size=1, gpu0_bsz=10, lr=2.6e-05, eps=1e-09, log_step=2, gradient_accumulation_steps=1, max_grad_norm=2.0, save_model_path=‘data/my/’, pretrained_model=‘data/my/’, seed=None, num_workers=0, patience=0, warmup_steps=4, val_num=12, cuda=True, sep_id=102, pad_id=0, cls_id=101)
2023-06-12 16:22:05,445 - INFO - loading training dataset and validating dataset
2023-06-12 16:22:05,494 - INFO - starting training
2023-06-12 16:22:06,015 - INFO - batch 2 of epoch 1, loss 7.603002071380615, batch_acc 1.0, lr [1.3e-05]
2023-06-12 16:22:06,345 - INFO - batch 4 of epoch 1, loss 7.671083450317383, batch_acc 0.96875, lr [2.6e-05]
2023-06-12 16:22:06,535 - INFO - batch 6 of epoch 1, loss 7.687896728515625, batch_acc 0.9555555555555556, lr [2.599566522174058e-05]
2023-06-12 16:22:06,737 - INFO - batch 8 of epoch 1, loss 7.656625747680664, batch_acc 0.9512195121951219, lr [2.5991330443481157e-05]
2023-06-12 16:22:06,863 - INFO - batch 10 of epoch 1, loss 7.5464067459106445, batch_acc 0.9230769230769231, lr [2.598699566522174e-05]
2023-06-12 16:22:07,021 - INFO - batch 12 of epoch 1, loss 7.670943260192871, batch_acc 0.9367088607594937, lr [2.598266088696232e-05]
2023-06-12 16:22:07,022 - INFO - epoch 1: loss 7.630055824915568, predict_acc 0.955193482688391
2023-06-12 16:22:07,553 - INFO - validate epoch 1: loss 7.547492941220601
2023-06-12 16:22:07,553 - INFO - saving current best model for epoch 1
2023-06-12 16:22:08,042 - INFO - saving current best model for epoch 1
然后一直徘徊在7.xxx