What Does BERT Look At? An Analysis of BERT’s Attention 论文总结

最新推荐文章于 2022-08-23 23:06:35 发布

Jay_Tang

最新推荐文章于 2022-08-23 23:06:35 发布

阅读量2k

点赞数 2

分类专栏： NLP 核心推导文章标签：自然语言处理深度学习

本文链接：https://blog.csdn.net/Jay_Tang/article/details/108572586

版权

本文分析了BERT模型的注意力机制，发现部分注意力头专注于前后令牌，尤其是早期层。BERT的注意力大量集中在少数令牌上，如[SEP]。研究者推测[SEP]可能用于聚合段落信息，但注意力头并不广泛关注整个段落。此外，一些注意力头表现出语法敏感行为，尽管未接受专门的语法训练。通过组合注意力头，研究显示BERT在自我监督训练中可以学习到一些语法特性。最后，头部聚类结果显示同一层内的头部往往有相似的注意力分布。

摘要由CSDN通过智能技术生成

文章目录

往期文章链接目录

Before we start

In this post, I mainly focus on the conclusions the authors reach in the paper, and I think these conclusions are worth sharing.

In this paper, the authors study the attention maps of a pre-trained BERT model. Their analysis focuses on the 144 attention heads in BERT.

Surface-Level Patterns in Attention

There are heads that specialize to attending heavily on the next or previous token, especially in earlier layers of the network.

A substantial amount of BERT’s attention focuses on a few tokens. For example, over half of BERT’s attention in layers 6-10 focuses on [SEP]. One possible explanation is that [SEP] is used to aggregate segment-level information which can then be read by other heads.

However, if this explanation were true, they would expect attention heads processing [SEP] to attend broadly over the whole segment to build up these representations. However, they instead almost entirely (more than 90%) attend to themselves and the other [SEP] token.

They speculate that attention over these special tokens might be used as a sort of “no-op” when the attention head’s function is not applicable.

Some attention heads, especially in lower layers, have very broad attention. The output of these heads is roughly a bag-of-vectors representation of the sentence.
They also measured entropies for all attention heads from only the [CLS] token. The last layer has a high entropy from [CLS], indicating very broad attention. This finding makes sense given that the representation for the [CLS] token is used as input for the “next sen- tence prediction” task during pre-training, so it attends broadly to aggregate a representation for the whole input in the last layer.

Probing Individual Attention Heads

There is no single attention head that does well at syntax “overall”.
They do find that certain attention heads specialize to specif

最低0.47元/天解锁文章

Jay_Tang

关注

2
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
What Does BERT Look At? An Analysis of BERT’s Attention 论文总结

文章目录往期文章链接目录Before we startSurface-Level Patterns in AttentionProbing Individual Attention HeadsProbing Attention Head CombinationsClustering Attention Heads往期文章链接目录往期文章链接目录Before we startIn this post, I mainly focus on the conclusions the authors reach
复制链接

扫一扫

专栏目录