带着问题:
- 这种batchsize大小改变带来的影响是好是坏?
- 怎么消除batchsize改变大小带来的影响?
背景:
最近在做文本分类任务,确定好文本类别后,标注同学去给文本打标签,大家都知道,标注数据真的很昂贵,所以每个类别都尽可能少的用标注数据。今天的例子是,有这样四个类别:{0: '面试', 1: '工作', 2: '达人', 3: '其他'},尝试了每个类别500个样本去训练bert分类模型。
曾经,在做文本分类时,标注数据每个类别至少1w,训练模型时候,研究了一阵如何加速训练语言模型,因为100w的数据量,训练的太慢了。最后160词,batchsize=256,accelerate_step = 4,变相的加速训练,四张GPU一起训练。
由着2000样本训练bert模型,根据其预测效果,我们来看下小数据集fine-tuning bert,这个小数据集可以小到什么程度?其中要有20%的数据做eval,也就就是说只有1600的数据做训练,如果按照256*4=1024过样本的话,一个epoch,就是两个batch,梯度更新2次,平时训练10轮结束的模型,此时一共更新20次梯度,肯定没有最优。怎么办?
0.batchsize=256的训练效果。
Iter: 0, Train Loss: 1.4, Train Acc: 26.17%, Val Loss: 1.4, Val Acc: 32.25%, Time: 0:00:15 *, train rec: 26.17%, train pre: 26.17%, Val rec: 32.25%, Val pre: 32.25% train f1: 26.17%, Val f1: 32.25%,
Epoch [5/50]
Iter: 30, Train Loss: 0.63, Train Acc: 78.52%, Val Loss: 0.64, Val Acc: 77.96%, Time: 0:00:48 *, train rec: 78.52%, train pre: 78.52%, Val rec: 77.96%, Val pre: 77.96% train f1: 78.52%, Val f1: 77.96%,
Epoch [9/50]
Iter: 60, Train Loss: 0.21, Train Acc: 93.75%, Val Loss: 0.58, Val Acc: 81.21%, Time: 0:01:22 *, train rec: 93.75%, train pre: 93.75%, Val rec: 81.21%, Val pre: 81.21% train f1: 93.75%, Val f1: 81.21%,
Epoch [13/50]
Iter: 90, Train Loss: 0.071, Train Acc: 98.93%, Val Loss: 0.71, Val Acc: 80.05%, Time: 0:01:55 , train rec: 98.93%, train pre: 98.93%, Val rec: 80.05%, Val pre: 80.05% train f1: 98.93%, Val f1: 80.05%,
Epoch [18/50]
Iter: 120, Train Loss: 0.021, Train Acc: 99.22%, Val Loss: 0.95, Val Acc: 79.58%, Time: 0:02:27 , train rec: 99.22%, train pre: 99.22%, Val rec: 79.58%, Val pre: 79.58% train f1: 99.22%, Val f1: 79.58%,
Epoch [22/50]
Iter: 150, Train Loss: 0.02, Train Acc: 99.61%, Val Loss: 0.99, Val Acc: 79.35%, Time: 0:02:59 , train rec: 99.61%, train pre: 99.61%, Val rec: 79.35%, Val pre: 79.35% train f1: 99.61%, Val f1: 79.35%,
Epoch [26/50]
Iter: 180, Train Loss: 0.0056, Train Acc: 100.00%, Val Loss: 0.99, Val Acc: 80.28%, Time: 0:03:31 , train rec: 100.00%, train pre: 100.00%, Val rec: 80.28%, Val pre: 80.28% train f1: 100.00%, Val f1: 80.28%,
Epoch [31/50]
Iter: 210, Train Loss: 0.049, Train Acc: 98.05%, Val Loss: 0.98, Val Acc: 80.05%, Time: 0:04:04 , train rec: 98.05%, train pre: 98.05%, Val rec: 80.05%, Val pre: 80.05% train f1: 98.05%, Val f1: 80.05%,
Epoch [35/50]
Iter: 240, Train Loss: 0.0056, Train Acc: 100.00%, Val Loss: 1.1, Val Acc: 80.97%, Time: 0:04:36 , train rec: 100.00%, train pre: 100.00%, Val rec: 80.97%, Val pre: 80.97% train f1: 100.00%, Val f1: 80.97%,
Epoch [39/50]
Iter: 270, Train Loss: 0.0029, Train Acc: 100.00%, Val Loss: 1.1, Val Acc: 80.74%, Time: 0:05:08 , train rec: 100.00%, train pre: 100.00%, Val rec: 80.74%, Val pre: 80.74% train f1: 100.00%, Val f1: 80.74%,
Epoch [43/50]
Iter: 300, Train Loss: 0.0024, Train Acc: 100.00%, Val Loss: 1.1, Val Acc: 81.21%, Time: 0:05:40 , train rec: 100.00%, train pre: 100.00%, Val rec: 81.21%, Val pre: 81.21% train f1: 100.00%, Val f1: 81.21%,
Epoch [48/50]
Iter: 330, Train Loss: 0.0069, Train Acc: 99.61%, Val Loss: 1.2, Val Acc: 82.13%, Time: 0:06:13 , train rec: 99.61%, train pre: 99.61%, Val rec: 82.13%, Val pre: 82.13% train f1: 99.61%, Val f1: 82.13%,
使用该模型去预测一批文本,结果如下,精度评估 :
1 0.721157074 3 0.134897843 0 0.098881446
1 0.921663225 0 0.047866516 2 0.022388503
0 0.656914771 1 0.317423075 2 0.014724847
1 0.541134119 0 0.387974262 2 0.052128702
3 0.858255506 1 0.090887442 0 0.026639875
1 0.874167085 2 0.078069404 0 0.036619335
2 0.821369648 0 0.087310754 1 0.079726078
1 0.969043016 2 0.016877973 0 0.007908464
3 0.792807579 2 0.183794722 0 0.014929529
1 0.887573242 0 0.074632719 2 0.028010886
2 0.797808886 1 0.117950708 3 0.064865828
1 0.969324052 0 0.016529299 2 0.009995425
1 0.948564649 2 0.028399421 0 0.020106293
2 0.521256983 1 0.366217256 0 0.093916409
2 0.910155952 3 0.054033779 0 0.028991029
2 0.934165776 3 0.05176514 1 0.008472801
1 0.542768061 0 0.412951201 2 0.024075799
3 0.833275795 2 0.07818792 1 0.076510578
1 0.784907579 2 0.168464512 0 0.036980517
1 0.706942856 2 0.220705017 0 0.062249899
预测结果:
precision recall f1-score support
谈面试 0.52 0.96 0.68 74
找工作 0.64 0.89 0.75 220
打工人 0.56 0.71 0.63 174
其他 0.88 0.28 0.42 282
accuracy 0.63 750
macro avg 0.65 0.71 0.62 750
weighted avg 0.70 0.63 0.59 750
混淆矩阵1:
[[[611 65]
[ 3 71]]
[[422 108]
[ 24 196]]
[[480 96]
[ 50 124]]
[[457 11]
[203 79]]]
混淆矩阵2:
[[ 71 1 1 1]
[ 11 196 9 4]
[ 10 34 124 6]
[ 44 73 86 79]]
1.减少batchsize。
batchsize=128,accelerate_step = 1,不进行梯度累计加速,训练50epoch,这样梯度更新次数多点,至少能让模型参数受到这些个数据影响。但出现一个新问题:过拟合。。bert12层,每一层768个神经元,就这点样本,肯定都记住了,看看打印的准确率召回率:
这个是50个epoch的结果:很明显过拟合,训练的精度都要100%,预测的精度才80%。使用的预训练模型是:'bert-base-chinese'
Iter: 0, Train Loss: 1.4, Train Acc: 30.47%, Val Loss: 1.4, Val Acc: 29.47%, Time: 0:00:14 *, train rec: 30.47%, train pre: 30.47%, Val rec: 29.47%, Val pre: 29.47% train f1: 30.47%, Val f1: 29.47%,
Iter: 500, Train Loss: 0.0086, Train Acc: 99.22%, Val Loss: 1.2, Val Acc: 80.74%, Time: 0:05:28 *, train rec: 99.22%, train pre: 99.22%, Val rec: 80.74%, Val pre: 80.74% train f1: 99.22%, Val f1: 80.74%,
训练10个epoch,此时已经过拟合了。
Iter: 0, Train Loss: 1.4, Train Acc: 30.47%, Val Loss: 1.4, Val Acc: 29.47%, Time: 0:00:14 *, train rec: 30.47%, train pre: 30.47%, Val rec: 29.47%, Val pre: 29.47% train f1: 30.47%, Val f1: 29.47%,
Iter: 50, Train Loss: 0.46, Train Acc: 85.16%, Val Loss: 0.57, Val Acc: 79.35%, Time: 0:00:48 *, train rec: 85.16%, train pre: 85.16%, Val rec: 79.35%, Val pre: 79.35% train f1: 85.16%, Val f1: 79.35%,
Iter: 100, Train Loss: 0.072, Train Acc: 98.44%, Val Loss: 0.6, Val Acc: 80.97%, Time: 0:01:19 , train rec: 98.44%, train pre: 98.44%, Val rec: 80.97%, Val pre: 80.97% train f1: 98.44%, Val f1: 80.97%,
使用这个模型预测了一下新文本,预测效果如下:
1 0.993288338 3 0.00518999 2 0.001227561
1 0.99964416 0 0.000162596 2 0.000121725
0 0.662603736 1 0.336177588 2 0.000771599
1 0.701853752 2 0.295151234 0 0.002012548
3 0.91148746 1 0.087131523 0 0.000734749
1 0.969081044 2 0.030312186 0 0.000391913
2 0.999643564 0 0.000173132 1 0.000109302
1 0.999555051 2 0.000276813 3 9.5203E-05
2 0.999746382 1 9.90427E-05 3 9.14592E-05
1 0.99956125 0 0.000187406 2 0.000134702
2 0.999686599 1 0.000192517 3 7.34718E-05
1 0.999665618 2 0.000132147 0 0.000131814
1 0.999628305 2 0.000194056 0 9.89335E-05
2 0.993654847 1 0.005988611 0 0.000232039
2 0.999747694 1 0.000105971 0 8.71415E-05
2 0.999771178 1 0.000107796 3 6.05386E-05
1 0.986123919 0 0.01332451 2 0.000376619
3 0.986136317 1 0.012807772 2 0.000610389
1 0.993743122 2 0.005912784 3 0.000224714
2 0.743568361 1 0.254828185 3 0.001034564
预测结果:
precision recall f1-score support
谈面试 0.63 0.96 0.76 74
找工作 0.63 0.92 0.75 220
打工人 0.51 0.80 0.62 174
其他 0.98 0.14 0.25 282
accuracy 0.60 750
macro avg 0.69 0.71 0.59 750
weighted avg 0.73 0.60 0.53 750
混淆矩阵1:
[[[635 41]
[ 3 71]]
[[409 121]
[ 17 203]]
[[442 134]
[ 35 139]]
[[467 1]
[242 40]]]
混淆矩阵2:
[[ 71 1 2 0]
[ 7 203 10 0]
[ 2 32 139 1]
[ 32 88 122 40]]
2.两层的bert模型
使用bert的蒸馏缩小版本的模型:"uer/chinese_roberta_L-2_H-128",这个模型2层,hiddensize=128,该预训练模型小了很多。训练了50个epoch,没有过拟合的样子。这是打印的信息:
Iter: 0, Train Loss: 1.4, Train Acc: 33.59%, Val Loss: 1.4, Val Acc: 27.84%, Time: 0:00:12 *, train rec: 33.59%, train pre: 33.59%, Val rec: 27.84%, Val pre: 27.84% train f1: 33.59%, Val f1: 27.84%,
Iter: 50, Train Loss: 1.4, Train Acc: 42.97%, Val Loss: 1.4, Val Acc: 43.62%, Time: 0:00:15 *, train rec: 42.97%, train pre: 42.97%, Val rec: 43.62%, Val pre: 43.62% train f1: 42.97%, Val f1: 43.62%,
Iter: 100, Train Loss: 1.3, Train Acc: 53.91%, Val Loss: 1.3, Val Acc: 58.47%, Time: 0:00:18 *, train rec: 53.91%, train pre: 53.91%, Val rec: 58.47%, Val pre: 58.47% train f1: 53.91%, Val f1: 58.47%,
Iter: 150, Train Loss: 1.2, Train Acc: 70.31%, Val Loss: 1.2, Val Acc: 66.59%, Time: 0:00:21 *, train rec: 70.31%, train pre: 70.31%, Val rec: 66.59%, Val pre: 66.59% train f1: 70.31%, Val f1: 66.59%,
Iter: 200, Train Loss: 1.1, Train Acc: 72.66%, Val Loss: 1.1, Val Acc: 69.14%, Time: 0:00:24 *, train rec: 72.66%, train pre: 72.66%, Val rec: 69.14%, Val pre: 69.14% train f1: 72.66%, Val f1: 69.14%,
Iter: 250, Train Loss: 0.99, Train Acc: 78.12%, Val Loss: 1.0, Val Acc: 70.30%, Time: 0:00:27 *, train rec: 78.12%, train pre: 78.12%, Val rec: 70.30%, Val pre: 70.30% train f1: 78.12%, Val f1: 70.30%,
Iter: 300, Train Loss: 0.92, Train Acc: 78.12%, Val Loss: 0.93, Val Acc: 70.53%, Time: 0:00:31 *, train rec: 78.12%, train pre: 78.12%, Val rec: 70.53%, Val pre: 70.53% train f1: 78.12%, Val f1: 70.53%,
Iter: 350, Train Loss: 0.89, Train Acc: 75.00%, Val Loss: 0.88, Val Acc: 71.93%, Time: 0:00:34 *, train rec: 75.00%, train pre: 75.00%, Val rec: 71.93%, Val pre: 71.93% train f1: 75.00%, Val f1: 71.93%,
Iter: 400, Train Loss: 0.79, Train Acc: 77.34%, Val Loss: 0.83, Val Acc: 72.62%, Time: 0:00:37 *, train rec: 77.34%, train pre: 77.34%, Val rec: 72.62%, Val pre: 72.62% train f1: 77.34%, Val f1: 72.62%,
Iter: 450, Train Loss: 0.78, Train Acc: 75.78%, Val Loss: 0.8, Val Acc: 72.39%, Time: 0:00:40 *, train rec: 75.78%, train pre: 75.78%, Val rec: 72.39%, Val pre: 72.39% train f1: 75.78%, Val f1: 72.39%,
Iter: 500, Train Loss: 0.72, Train Acc: 82.03%, Val Loss: 0.76, Val Acc: 73.78%, Time: 0:00:43 *, train rec: 82.03%, train pre: 82.03%, Val rec: 73.78%, Val pre: 73.78% train f1: 82.03%, Val f1: 73.78%,
Iter: 550, Train Loss: 0.78, Train Acc: 71.88%, Val Loss: 0.73, Val Acc: 74.25%, Time: 0:00:47 *, train rec: 71.88%, train pre: 71.88%, Val rec: 74.25%, Val pre: 74.25% train f1: 71.88%, Val f1: 74.25%,
Iter: 600, Train Loss: 0.58, Train Acc: 85.94%, Val Loss: 0.71, Val Acc: 74.94%, Time: 0:00:50 *, train rec: 85.94%, train pre: 85.94%, Val rec: 74.94%, Val pre: 74.94% train f1: 85.94%, Val f1: 74.94%,
Iter: 650, Train Loss: 0.68, Train Acc: 79.69%, Val Loss: 0.69, Val Acc: 76.10%, Time: 0:00:53 *, train rec: 79.69%, train pre: 79.69%, Val rec: 76.10%, Val pre: 76.10% train f1: 79.69%, Val f1: 76.10%,
使用这个模型预测了一下新文本,预测效果如下:
1 0.492245287 3 0.357987851 0 0.093523517
1 0.679030061 3 0.15818736 0 0.109049782
0 0.793055415 1 0.093809418 2 0.069567055
1 0.608645439 2 0.186831772 0 0.142540157
1 0.653128266 3 0.200326622 0 0.079325967
1 0.424431771 3 0.366052866 2 0.125655115
2 0.681468666 1 0.165153056 3 0.084384508
1 0.633563757 2 0.184436008 0 0.125692308
3 0.743090272 1 0.107794032 2 0.091790773
1 0.714027822 0 0.115667157 2 0.103216507
1 0.488106728 2 0.250539482 3 0.209825307
1 0.68540746 2 0.12569724 0 0.114000656
1 0.702560306 0 0.125516459 2 0.110100329
1 0.514868975 2 0.291460186 0 0.138665304
2 0.581373572 1 0.228464305 0 0.098014489
2 0.748660624 3 0.101370245 0 0.081504814
1 0.67431432 0 0.140163451 3 0.109401464
3 0.720237732 1 0.135314092 2 0.098193653
1 0.561341882 2 0.223991126 3 0.148258299
1 0.610599279 3 0.163577497 2 0.158652917
两层的分类结果
precision recall f1-score support
谈面试 0.61 0.89 0.73 74
找工作 0.56 0.85 0.68 220
打工人 0.51 0.74 0.60 174
其他 0.71 0.14 0.23 282
accuracy 0.56 750
macro avg 0.60 0.66 0.56 750
weighted avg 0.61 0.56 0.50 750
混淆矩阵1:
[[[634 42]
[ 8 66]]
[[384 146]
[ 33 187]]
[[451 125]
[ 45 129]]
[[452 16]
[243 39]]]
混淆矩阵2:
[[ 66 5 2 1]
[ 9 187 16 8]
[ 7 31 129 7]
[ 26 110 107 39]]
通过上面的预测结果,看下模型效果:
1.大batchsize的预测效果没有小batchsize好。
2.轻量级(2层,128hiddensize)模型的预测效果也略差于重量级(12层,768hiddensize)预训练模型。