[Paper Summary] When Do You Need Billions of Words of Pretraining Data? [Zhang 2020]

When Do You Need Billions of Words of Pretraining Data? [Zhang 2020]


很棒很棒 总结完了才意识到这篇的discussion都是彩蛋都很有深意!

Core research question: What exact knowledge or skills do Transformer LMs learn from large-scale pretraining that they cannot learn from less data?


We adopt 4 probing methods 1) classifier probing, 2) info-theoretic probing, 3) unsupervised relative acceptability judgment, and 4) fine-tuning on NLU tasks. We draw learning curves that track the growth of these measures of linguistic abilities w.r.t pretraining data volume.


The syntactic learning curves rises slightly earlier than the semantic ones. The commonsense curve (for Winograd coref. only) clearly rises far later. Considering the fact that semantics permeate langauge and meaningless structures don’t exist, 这里只能说 “skills of the interest to syntactic probing can be mastered with the minimum amount of data compared to other skills of interest”.


Classifier Probing
  • Most of the feature learning occurs with <100M words of pretraining data. Most learning curves reach the point of fastest growth around 1M words. The most notable exception to this pattern is the Winograd task, which only rises significantly between 1B and 30B pretraining words. We think this is due to the Winograd task is designed to test commonsense knowledge.
  • Linguistic knowledge of RoBERTa pretrained on 100M words is already very close to that pretrained on 30B words.

Info-theoretic Probing (MDL online code vers
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值