[Paper Summary] When Do You Need Billions of Words of Pretraining Data? [Zhang 2020]

最新推荐文章于 2023-08-26 08:23:33 发布

芝麻挞

最新推荐文章于 2023-08-26 08:23:33 发布

阅读量265

点赞数

分类专栏：我爱读的paper

本文链接：https://blog.csdn.net/weixin_43928665/article/details/118624132

版权

When Do You Need Billions of Words of Pretraining Data? [Zhang 2020]

很棒很棒总结完了才意识到这篇的discussion都是彩蛋都很有深意！

Core research question: What exact knowledge or skills do Transformer LMs learn from large-scale pretraining that they cannot learn from less data?

We adopt 4 probing methods 1) classifier probing, 2) info-theoretic probing, 3) unsupervised relative acceptability judgment, and 4) fine-tuning on NLU tasks. We draw learning curves that track the growth of these measures of linguistic abilities w.r.t pretraining data volume.

The syntactic learning curves rises slightly earlier than the semantic ones. The commonsense curve (for Winograd coref. only) clearly rises far later. Considering the fact that semantics permeate langauge and meaningless structures don’t exist, 这里只能说 “skills of the interest to syntactic probing can be mastered with the minimum amount of data compared to other skills of interest”.

Classifier Probing

Most of the feature learning occurs with <100M words of pretraining data. Most learning curves reach the point of fastest growth around 1M words. The most notable exception to this pattern is the Winograd task, which only rises significantly between 1B and 30B pretraining words. We think this is due to the Winograd task is designed to test commonsense knowledge.
Linguistic knowledge of RoBERTa pretrained on 100M words is already very close to that pretrained on 30B words.

Info-theoretic Probing (MDL online code vers

最低0.47元/天解锁文章

芝麻挞

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
[Paper Summary] When Do You Need Billions of Words of Pretraining Data? [Zhang 2020]

When Do You Need Billions of Words of Pretraining Data? [Zhang 2020]很棒很棒总结完了才意识到这篇的discussion都是彩蛋都很有深意！Core research question: What exact knowledge or skills do Transformer LMs learn from large-scale pretraining that they cannot learn from less data?
复制链接

扫一扫