investigate different fine-tuning methods of BERT on text classification task and provide a general solution for BERT fine-tuning.
investigate the different approaches to fine-tuning BERT for the text classification task. There are some experimental findings:
1) The top layer of BERT is more useful for text classification;
2) With an appropriate layer-wise decreasing learning rate, BERT can overcome the catastrophic forgetting problem;
3) Within-task and in-domain further pre-training can significantly boost its performance;
4) A preceding multi-task fine-tuning is also helpful to the single-task fine-tuning, but its benefit is smaller than further pre-training;
5) BERT can improve the task with small-size data.
论文充分借鉴了ULMFiT的思想,设计了一系列fine-tune和pre-train的策略,根据使用语料的范围可分为:
(1)直接针对task的fine-tune
(2)基于In-Domain语料的pre-train+fine-tune
(3)基于In-Domain语料的pre-train+多任务fine-tune
(4)基于In-Out-Domain语料的pre-train+fine-tune
(5)基于In-Out-Domain语料的pre-train+多任务fine-tune
reference