文章目录
往期文章链接目录
Before we start
In this post, I mainly focus on the conclusions the authors reach in the paper, and I think these conclusions are worth sharing.
In this paper, the authors study the attention maps of a pre-trained BERT model. Their analysis focuses on the 144 attention heads in BERT.
Surface-Level Patterns in Attention
- There are heads that specialize to attending heavily on the next or previous token, especially in earlier layers of the network.
-
A substantial amount of BERT’s attention focuses on a few tokens. For example, over half of BERT’s attention in layers 6-10 focuses on [SEP]. One possible explanation is that [SEP] is used to aggregate segment-level information which can then be read by other heads.
However, if this explanation were true, they would expect attention heads processing [SEP] to attend broadly over the whole segment to build up these representations. However, they instead almost entirely (more than 90%) attend to themselves and the other [SEP] token.
They speculate that attention over these special tokens might be used as a sort of “no-op” when the attention head’s function is not applicable.
-
Some attention heads, especially in lower layers, have very broad attention. The output of these heads is roughly a bag-of-vectors representation of the sentence.
-
They also measured entropies for all attention heads from only the [CLS] token. The last layer has a high entropy from [CLS], indicating very broad attention. This finding makes sense given that the representation for the [CLS] token is used as input for the “next sen- tence prediction” task during pre-training, so it attends broadly to aggregate a representation for the whole input in the last layer.
Probing Individual Attention Heads
-
There is no single attention head that does well at syntax “overall”.
-
They do find that certain attention heads specialize to specif