[Paper Summary] A Primer in BERTology: What We Know About How BERT Works [Rogers 2020]

最新推荐文章于 2022-03-28 18:42:42 发布

芝麻挞

最新推荐文章于 2022-03-28 18:42:42 发布

阅读量282

点赞数

分类专栏：我爱读的paper

本文链接：https://blog.csdn.net/weixin_43928665/article/details/118564961

版权

我爱读的paper 专栏收录该内容

23 篇文章 0 订阅

订阅专栏

A Primer in BERTology: What We Know About How BERT Works [Rogers 2020]

Probing works strive to learn about the types of of linguistic (e.g. POS, dependency, [Warstadt2019] - Five Analysis with NPIs, [Warstadt 2020] - RoBERTa acquires a preference for linguistic generalization over surface generalization) and world knowledge (e.g. [Talmor 2019 oLMpics]) learned by BERT, as well as where and how this knowledge may be stored in the model (e.g. different layers)

What does BERT know?

Syntactics

BERT representations are hierarchical rather than linear, i.e. there is something akin to syntactic tree structure.
BERT takes subject-predicate agreement into account when performing cloze task.
BERT is able to detect the prescence of NPIs

Either BERT’s syntactic knowledge is incomplete, or it does not need to rely on it for solving tasks. The latter seems more likely.

Sematics

BERT encodes info about entity types, relations, semantic roles and proto-roles (this reflects what can be learned from cloze)
BERT struggles with representations of numbers.
This is partially due to wordpiece tokenization, since numbers of similar values can be divided up into substantially different word chunks.
BERT does not actually form a generic idea of named entities, although its F1 scores on NER probing tasks are high.

World Knowledge

BERT struggles with pragmatic inference and role-based event knowledge.
BERT does encode some world knowledge but to a large extent doesn’t know how to use it (i.e. how to reason). e.g. it knows that people can walk into houses and that houses are big, but it cannot infer houses are bigger than people. In order to retrieve BERT’s knowledge, we need good template sentences. 就只能喂球才能接住，放开了打就一塌糊涂。

The fact that a linguistic pattern is not observed by our probing classifier does not guarantee that it is not here, and the observation of a pattern does not tell us how it is used. – [Tenney 2019 BERT Rediscovers the Classical NLP Pipeline]: POS, parsing, NER, semantic roles, coreference.

Localizing Knowledge

Contextualized Embeddings

Distilling a contextualized representation into static yields a better static representation with lexical semantic information.
BERT embedings occupy a narrow cone in the vector space, and this effect increases from the earlier to later layers. That is, two random words will on avg have a much higher cosine similarity than expected if embeddings were directionally uniform (isotropic). Indeed, isotropy was shown to be beneficial for static word embeddings and this might be a fruitful future direction.
BERT’s contextualized embeddings form distinct clusters corresponding to word senses, making it successful at word sense disambiguation task.
Representations of the same word depend on the position of the sentence in which it occurs.

SA Heads

Some heads seem to specialize in certain types of syntactic
relations.
In line with evidence of partial knowledge of syntax, it was found that no single head has the complete syntactic tree info.
Attention weights are weak indicators of subject-verb agreement and reflexive anaphora.
Self-attention as an interpretability mechanism is a popular idea, but visualization is typically limited to qualitative analysis, often with cherry-picked examples. After all complex features may be encoded by a combination of heads.
[Kovaleva 2019 Revealing the Dark Secrets of BERT] show that most SA heads do not directly encode any non-trivial linguistic information, at least when they fine-tuned on GLUE. Much of the model produced the vertival pattern on [CLS], [SEP], and punctuation tokens. This redundancy is likely related to the overparameterization issue.
The function of special tokens are not yet well understood. [Clark 2019 What Does BERT Look at?] looks into attn to special tokens, noting that heads in early layers attend more to [CLS], in middle layers to [SEP], and in final layers to periods and commas.
Many SA heads in vanilla BERT seem to naturally learn the same patterns. This explains why pruning them does not have too much impact. This raises the question how far we could get with intentionally encouraging diverse SA patterns, as opposed to simply pre-setting patterns that we already know the model would learn.

BERT Layers

Middle layers of Transformers are best-performing overall and the most transferable across tasks.
There is conflicting evidence about syntactic chunks, in terms of its location at early vs. middle layers. And the studies have been using different probing tasks.
The final layers of BERT are the most task-specific. In pre-training, this means specificity to the MLM task.
Semantics is spread across the entire model. This is rather to be expected: semantics permeates all languages, and linguists debate whether meaningless structures can exist at all

Training Choices

Architecture

The number of heads was not as significant as the number of layers.
The issue of depth is related to information flow from the initial layers which appear to be task-invariant to the most task-specific layers close to the classifier. If that is the case, a deeper model has more capacity to encode non-task-specific info.

Training Regime

Large-batch training is beneficial
Faster training: it can be done in a recursive manner, where the shallower version is trained first and then the trained parameters are copied to deeper layers. Such a “warm start” can lead to a 25% faster training without sacrificing performance.

Pre-training Objectives

汇总了一大堆魔改objectives的工作，非常细，想了解去看原paper呀~

Pre-training Data

Incorporate explicit linguistic information
Explicitly supplying structured knowledge (e.g. entity-enhanced, knowledge-based completion, linearized table data)
[Prasanna 2020 When BERT plays the lottery, all tickets are winning] found most weights of pre-trained BERT are useful in fine-runing, although there are ‘better’ and ‘worse’ subnetworks. One explanation is that pre-trained weights help the fine-tuned BERT find wider and flatter areas with smaller generalization error.

Given the large number and variety of proposed modifications, one would wish to know how much impact each of them has. However, due to the overall trend towards larger model, systematic ablations have become expensive.
Most new models claim superiority on standard benchmarks, but gains are often marginal, and estimates of model stability and significance testing are very rare. 是呀, sadly早已不是可以负担得起显著性检验的时代了
Related concerns are about fair comparisons of architectures and reproducibility.

Fine-tuning

[Kovaleva 2019 Revealing Dark Secrets of BERT] reported most changes during fine-tuning occurred in the last two layers, but those changes cause SA to focus on [SEP] rather than on linguistically interpretable patterns. If [SEP] can be interpreted as ‘no-op’ indicator, as suggested by [Clark 2019], then fine-tuning basically tells BERT what to ignore.
Adapter modules facilitating multi-task learning & cross-lingual transfer.
逐渐进入玄学范畴：A big methodological challenge in the current NLP is that the reported performance improvements of new models may well be within variations induced by environmental factos (e.g. weight initialization, training data order, random seeds)

Overparameterization

All but a few heads could be pruned without significant losses in performance [Voita 2019b Specialized heads to the heavy lifting, the rest can be pruned].
Some BERT heads/layers are not only redundant, but also harmful to the downstream task. Positive effect from head disabling was reported for MT, abstractive summarization and GLUE.

Pruning as a model analysis technique

Pruning is initially a compression technique that reduces the amount of computation via zeroing out of certain parts of a large model. It was also recommended training larger models and compressing them heavily rather than compressing smaller models slightly.
Whatever can be pruned without substantial performance reduction are not actually used.
继续玄学: Again, the network may not learn to carry out functions in a distangled way. There is evidence of more diffuse representation spread across the full network. If so, ablting individual components harms the weight-sharing mechanism.

Future Directions

Call for more resources
But harder datasets that cannot be resolved with shallow heuristics are unlikely to emerge if their development is not as valued as modeling work.
Call for benchmarks for a full range of linguistic competence
“Checklist behavioral testing”
‘Teach’ rasoning
It turned out that large pre-trained models do have a lot of knowledge but they don’t know how to reason on top of that.
Learn what happens at inference time
Discovering what knowledge actually gets used. This is related to [Elazar 2020 Amnesic Probing]

芝麻挞

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
[Paper Summary] A Primer in BERTology: What We Know About How BERT Works [Rogers 2020]

A Primer in BERTology: What We Know About How BERT Works [Rogers 2020]Probing works strive to learn about the types of of linguistic (e.g. POS, dependency, [Warstadt2019] - Five Analysis with NPIs, [Warstadt 2020] - RoBERTa acquires a preference for lingu
复制链接

扫一扫