What Does BERT Look At? An Analysis of BERT’s Attention 论文总结

本文分析了BERT模型的注意力机制,发现部分注意力头专注于前后令牌,尤其是早期层。BERT的注意力大量集中在少数令牌上,如[SEP]。研究者推测[SEP]可能用于聚合段落信息,但注意力头并不广泛关注整个段落。此外,一些注意力头表现出语法敏感行为,尽管未接受专门的语法训练。通过组合注意力头,研究显示BERT在自我监督训练中可以学习到一些语法特性。最后,头部聚类结果显示同一层内的头部往往有相似的注意力分布。
摘要由CSDN通过智能技术生成

往期文章链接目录

Before we start

In this post, I mainly focus on the conclusions the authors reach in the paper, and I think these conclusions are worth sharing.

In this paper, the authors study the attention maps of a pre-trained BERT model. Their analysis focuses on the 144 attention heads in BERT.

Surface-Level Patterns in Attention

  1. There are heads that specialize to attending heavily on the next or previous token, especially in earlier layers of the network.
  1. A substantial amount of BERT’s attention focuses on a few tokens. For example, over half of BERT’s attention in layers 6-10 focuses on [SEP]. One possible explanation is that [SEP] is used to aggregate segment-level information which can then be read by other heads.

    However, if this explanation were true, they would expect attention heads processing [SEP] to attend broadly over the whole segment to build up these representations. However, they instead almost entirely (more than 90%) attend to themselves and the other [SEP] token.

    They speculate that attention over these special tokens might be used as a sort of “no-op” when the attention head’s function is not applicable.

  1. Some attention heads, especially in lower layers, have very broad attention. The output of these heads is roughly a bag-of-vectors representation of the sentence.

  2. They also measured entropies for all attention heads from only the [CLS] token. The last layer has a high entropy from [CLS], indicating very broad attention. This finding makes sense given that the representation for the [CLS] token is used as input for the “next sen- tence prediction” task during pre-training, so it attends broadly to aggregate a representation for the whole input in the last layer.

Probing Individual Attention Heads

  1. There is no single attention head that does well at syntax “overall”.

  2. They do find that certain attention heads specialize to specif

  • 2
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值