原文:DECONSTRUCTING BERT, PART 2: VISUALIZING THE INNER WORKINGS OF ATTENTION
接上篇:自然语言处理模型:bert 结构原理解析——attention+transformer(翻译自:Deconstructing BERT)
This is the second part of a two-part series on deconstructing BERT.
In part 1, Deconstructing BERT: Distilling 6 Patterns from 100 Million Parameters, I described how BERT’s attention mechanism can take on many different forms. For example, one attention head focused nearly all of the attention on the next word in the sequence; another focused on the previous word (see illustration below). In both cases, BERT essentially learned a sequential update pattern, something that human designers of neural networks have explicitly encoded in architectures such as Recurrent Neural Networks (though BERT’s version is more like a CNN). Later, I will show how BERT is able to mimic a Bag-of-Words model as well
Are you interested in receiving more in-depth technical education about language models and NLP applications? Subscribe below to receive relevant updates.
- 这是BERT 结构解析的第二部分
- 第一部分描述了bert 的attention关注度机制是如何在不同形式下发挥作用的。例如一个attention关注的几乎是句子中的下一个词;另一个可能关注前一个词。两种模式实质上都学习到了序列更新的模式,这在RNN里是神经网络设计者显式编码的。
- 接下来展示bert 如何模仿词袋模型
So how does BERT achieve these stunning feats of plasticity? To answer this question, I extend the visualization tool from Part 1 to probe deeper into the mind of BERT, to expose the neurons that give BERT its shape-shifting superpowers. You can try the updated visualization tool in this Colab notebook, or find it on Github.
The original visualization tool (based on the excellent Tensor2Tensor implementation by Llion Jones) attempted to answer the what of attention: that is, what attention structures is BERT learning? To answer the how, I added an attention details view, which visualizes how attention is computed. The details view is invoked by clicking on the ⊕ icon. You can see a demo below, or skip ahead to the screen shots:
- bert 如何做到如此惊人壮举,借助上一篇文章 Part 1的可视化工具来探索给予bert超能力的神经元
- 可视化工具尝试解释bert 学习了什么结构的attention,为此加入了attention细节视图,显示了attention的计算过程。
VISUALIZATION TOOL OVERVIEW
BERT is a bit like a Rube Goldberg machine: though the individual components are fairly intuitive, the end-to-end system can be hard to grasp. For now, I’ll walk through the parts of BERT’s attention architecture represented in the visualization tool. (For a comprehensive tutorial on BERT, I recommend The Illustrated Transformer and The Illustrated BERT.)
The new attention details view is shown below. Note that positive values are colored blue and negative values orange, with color intensity reflecting the magnitude of the value. All vectors are of length 64 and are specific to a particular attention head. Like the original visualization tool, connecting lines are weighted based on the attention score between the respective words.
- bert有点类似 Rube Goldberg machine,但后者相当直观,bert很难把握
- 下图显示了attention的细节,正值标为蓝色,负值标位红色,颜色深度反映了值大小。
- 所有向量长度64,且属于某个attention的head
- 连线是对应单词的关注度分数
Let’s break this down:
Query q: the query vector q encodes the word/position on the left that is paying attention, i.e. the one that is “querying” the other words. In the example above, the query vector for “the” (the selected word) is highlighted.
Key k: the key vector k encodes the word on the right to which attention is being paid. The key vector together with the query vector determine the attention score between the respective words, as described below.
q×k (element-wise): the element-wise product of the query vector and a key vector. This product is computed between the selected query vector and each of the key vectors. This is a precursor to the dot product (the sum of the element-wise product) and is included for visualization purposes because it shows how individual elements in the query and key vectors contribute to the dot product.
q·k: the dot product of the selected query vector and each of the key vectors. This is the unnormalized attention score.
Softmax: the softmax of q·k / 8 across all target words. This normalizes the attention scores to be positive and sum to one. The constant factor 8 is the square root of the vector length (64), and is included for reasons described in this paper.
- 解释一下
- Query q:查询向量q,是对左边关注者单词的词本身/位置的编码,是发起关注的一方。正如上图高亮的the
- Key k:key向量k,是右边被关注者单词的词本身的编码,k和q共同决定了关注度的分值
- q×k (element-wise):k和q对应元素的乘积。这个乘积操作在查询向量和每个key向量中进行,是点积前的形式,可以直观看出那个元素对最终的attention关注度贡献最大。
- q·k:q和k的点积,是未标准化的attention分数
- Softmax:对一个q对应所有k的点积做softmax归一化,保证attention分值为正且和为1. 8表示64开根,原因详见this paper.
EXPLAINING BERT’S ATTENTION PATTERNS(bert attention模式解释)
In Part 1, I identified several patterns in the attention structure across BERT’s attention heads. Let’s see if we can use the visualization tool to understand how BERT forms these patterns.
DELIMITER-FOCUSED ATTENTION PATTERNS(定界符关注度模式)
Let’s start with the simple case where most attention is focused on the sentence separator [SEP] token (Pattern 6 from Part 1). As suggested in Part 1, this pattern may be a way for BERT to propagate sentence-level state to the word level:
- 从一个简单例子开始,大部分的attention都关注在句子的分隔符 ,是一种将语句层面的状态传播到词语层面的方式
So, how exactly is BERT able to fixate on the [SEP] tokens? Let’s see if the visualization tool can provide some clues. Here we see the attention details view of the example above:
- bert 怎么观察分隔符的,细节见下图:
In the Key column, the key vectors for the two occurrences of [SEP] carry a distinctive signature: they both have a small number of active neurons with high positive (blue) or low negative (orange) values, and a larger number of neurons with values close to zero (light blue/orange or white):
- 在key这列,两个分隔符的key向量 有独特的模式:都只有少量神经元激活(深蓝色或深橘色),大部分神经元都是关闭状(浅颜色或白色)
The query vectors tend to match the [SEP] key vectors along those active neurons, resulting in high values for the element-wise product q×k, as in this example:
- 查询向量与分隔符的k向量在激活神经元上匹配,最终对应位置上的元素乘积较大。
The query vectors for the other words follow a similar pattern: they match the [SEP] key vector along the same set of neurons. Thus it seems that BERT has designated a small set of neurons as “[SEP]-matching neurons,” and query vectors are assigned values that match the [SEP] key vectors at these positions. The result is the [SEP]-focused attention pattern.
- 其他词语的查询向量也有类似的模式:与分隔符的k向量在同一批神经元上匹配。可以看出bert 指定了部分神经元为分隔符匹配的神经元。查询向量与之作用后在对应这些位置上有值。表现出了针对分隔符的关注模式。
Bag of Words attention pattern(词袋关注度模式)
This is a less common pattern, which was not discussed in Part 1. In this pattern, attention is divided fairly evenly across all words in the same sentence:
- 这是一种非常见模式,此模式中,句子中每个词都平均分到了关注度。
The effect of this pattern is to distribute sentence-level state to the word level, as was likely the case for the first pattern as well. BERT is essentially computing a Bag-of-Words embedding by taking an (almost) unweighted average of the word embeddings (which are the value vectors — see above-mentioned tutorials for details on this.)
So how does BERT finesse the queries and keys to form this attention pattern? Let’s again turn to the attention details view:
- 这种模式的效果是将句子层面的状态传播到词语层面。
- bert 实质上在计算词袋embedding,对词向量做了无加权的平均,细节见下图:
In the q×k column, we see a clear pattern: a small number of neurons (2–4) dominate the calculation of the attention scores. When query and key vector are in the same sentence (the first sentence, in this case), the product shows high values (blue) at these neurons. When query and key vector are in different sentences, the product is strongly negative (orange) at these same positions, as in this example:
- 在q×k 列中,可以清晰看出,少数 (2–4)神经元主导了最终的attention分。当查询向量和k向量在同个句子中时,点积在这些神经元上显示了较大值(q*k列上方的深蓝色)。当查询向量与k向量不在同一句子中时,点积在这些神经元位置是较小的负数值(q*k列下方深橘色)
When query and key are both from sentence 1, they tend to have values with the same sign along the active neurons, resulting in a positive product. When the query is from sentence 1, and the key is from sentence 2, the same neurons tend to have values with opposite signs, resulting in a negative product.
But how does BERT know the concept of “sentence”, especially in the first layer of the network before higher-level abstractions are formed? The answer lies in the sentence-level embeddings that are added to the input layer (see Figure 1, below). The information encoded in these sentence embeddings flows to downstream variables, i.e. queries and keys, and enables them to acquire sentence-specific values.
- 当查询向量和k向量都来自句子1,他们在激活的神经元上倾向于有相同的符号,所以有正的点积。当来自不同句子时,同一批神经元倾向于有相反的符号,点积结果为负。
- BERT 如何知道句子的概念呢?特别是在没有高度抽象的网络的第一层?答案是在输入层添加的句子层面的我们embedding ,见下图。
- 句子embedding编码到的信息,流到下游的变量中(查询向量和k向量),使其获得了句子相关的值(途中的segment embedding 是编码了不同句子的信息,token embedding编码了不同词的信息,position embedding编码了词的位置信息)
Figure 1: Segment embeddings for Sentences A and B are added to the input embeddings, along with position embeddings. (From BERT paper.)
NEXT-WORD ATTENTION PATTERNS(后接词语的attention模式)
In the next-word attention pattern, virtually all the attention is focused on the next word in the input sequence, except at the delimiter tokens:
- 此模式下几乎所有关注点都落在下一个单词,除了分隔符
This attention pattern enables BERT to capture sequential relationships, e.g. bigrams. Let’s check out the attention detail view for the above example:
- 此模式是bert 抓取到了序列的关系(n-gram的2-gram 词袋模型),细节见下图:
We see that the product of the query vector for “the” and the key vector for “store” (the next word) is strongly positive across most neurons. For tokens other than the next token, the key-query product contains some combination of positive and negative values. The result is a high attention score between “the” and “store”.
For this attention pattern, a large number of neurons figure into the attention score, and these neurons differ depending on the token position, as illustrated here:
- the的查询向量和store的k向量在大部分神经元上都是正。the与其他位置(非store的词)的点积 就有部分正部分负。于是the 和store的关注度分数较高
- 此模式下,大量神经元参与了关注度分数计算,神经元的区别取决于词的相对位置
This behavior differs from the delimiter-focused and the sentence-focused attention patterns, in which a small, fixed set of neurons determine the attention scores. For those two patterns, only a few neurons are required because the patterns are so simple, and there is little variation in the words that receive attention. In contrast, the next-word attention pattern needs to track which of the 512 words receives attention from a given position, i.e., which is the next word. To do so it needs to generate queries and keys such that each query vector matches with a unique key vector from the 512 possibilities. This would be difficult to accomplish using a small subset of neurons.
So how is BERT able to generate these position-aware queries and keys? In this case, the answer lies in BERT’s position embeddings, which are added to the word embeddings at the input layer (see Figure 1). BERT learns a unique position embedding for each of the 512 positions in the input sequence, and this position-specific information can flow through the model to the key and query vectors.
- 这种模式与上述针对分隔符的关注模式还有句子关注模式不同,后者只有少部分神经元起效。对于哪两种模式,因为简单,只需要少量神经元参与。而next后接词的模式需要追踪(学习)一个指定位置对512个词的关注度(那个是下一个词)。这里一个查询向量要与512个k向量相乘,只用少量神经元难以完成。
- BERT 如何产生这些位置相关的查询向量和k向量呢?答案在于位置embedding,在输入层加入的。输入序列中,bert 学习在512个位置上学习不同的embedding,这种信息可以流向网络下游成为查询向量和k向量
NOTES
I have only covered some of the coarse-level attention patterns discussed in Part 1 and have not touched on lower-level dynamics around linguistic phenomena such as coreference, synonymy, etc. I hope that this tool can help provide intuition in many of these cases.
- 这里只粗略讨论了上篇说道的一两个模式,类似共指和同义并没有涉及,但是这个可视化工具可以帮助直观理解。
TRY IT OUT!
Please try out the visualization tool and share what you find!
Colab: https://colab.research.google.com/drive/1Nlhh2vwlQdKleNMqpmLDBsAwrv_7NnrB
Github: https://github.com/jessevig/bertviz
This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.
学习感想:
- bert 能学到不同句子间的关系,靠的是segment embedding
- 学到不同词相关性,靠的是position embedding
- 学习词向量表示,靠的是Word embedding
- bert 一个模型可以训练这么多信息,所以可以用到很多场景中