When Do You Need Billions of Words of Pretraining Data? [Zhang 2020]
很棒很棒 总结完了才意识到这篇的discussion都是彩蛋都很有深意!
Core research question: What exact knowledge or skills do Transformer LMs learn from large-scale pretraining that they cannot learn from less data?
We adopt 4 probing methods 1) classifier probing, 2) info-theoretic probing, 3) unsupervised relative acceptability judgment, and 4) fine-tuning on NLU tasks. We draw learning curves that track the growth of these measures of linguistic abilities w.r.t pretraining data volume.
The syntactic learning curves rises slightly earlier than the semantic ones. The commonsense curve (for Winograd coref. only) clearly rises far later. Considering the fact that semantics permeate langauge and meaningless structures don’t exist, 这里只能说 “skills of the interest to syntactic probing can be mastered with the minimum amount of data compared to other skills of interest”.
Classifier Probing
- Most of the feature learning occurs with <100M words of pretraining data. Most learning curves reach the point of fastest growth around 1M words. The most notable exception to this pattern is the Winograd task, which only rises significantly between 1B and 30B pretraining words. We think this is due to the Winograd task is designed to test commonsense knowledge.
- Linguistic knowledge of RoBERTa pretrained on 100M words is already very close to that pretrained on 30B words.