# Transformer 结构分析

## self-attetion

### 1. 输入

X = E m b e d d i n g L o o k u p ( X ) + P o s i t i o n a l E n c o d i n g X . s h a p e = = ( b a t c h _ s i z e , s e q _ l e n , e m b e d d i n g _ d i m ) X = EmbeddingLookup(X) + PositionalEncoding \\ X.shape == (batch\_size, seq\_len, embedding\_dim)

### 2. 计算Q，K，V

Q = L i n e a r ( X ) = X W Q K = L i n e a r ( X ) = X W K V = L i n e a r ( X ) = X W V W = = ( e m b e d d i n g _ d i m , e m b e d d i n g _ d i m ) Q , K , V = = ( b a t c h _ s i z e , s e q _ l e n , e m b e d d i n g _ d i m ) Q = Linear(X) = XW_{Q} \\ K = Linear(X) = XW_{K} \\ V = Linear(X) = XW_{V} \\ \\ W == (embedding\_dim, embedding\_dim) \\ Q, K, V == (batch\_size, seq\_len, embedding\_dim)

### 4. 计算

d = Q ⋅ K T (1) d = Q \cdot K^T \tag{1}
d = m a s k ( d ) (2) d = mask(d) \tag{2}
d = d / d k (3) d = d / d_k \tag{3}
d = s o f t m a x ( d ) (4) d = softmax(d) \tag 4

• (1)计算得到各个字之间的关系（相似度）.这里的d的维度是 (batch_size, h, seq_len, embedding_dim) * (batch_size, h, embedding_dim, seq_len)==>(batch_size, h, seq_len, seq_len)。QKV分别有 batch_size * h 个矩阵，可以认为是在一个(batch_size, h)的棋盘中，每个位置放置了一个大小为(seq_len, embedding_dim)的矩阵。这里的前两个维度不变只是把棋盘中对应位置的矩阵拿出来做矩阵乘法，并把结果再放回到棋盘中。

• (3) d k d_k 是为了改变已经偏离的方差。我的理解是，由于矩阵转置后相乘会有很多内积运算，而内积运算将 d k d_k 个数相加时会改变数据的分布。而这个分布的趋势是 m e a n = 0 , v a r i a n c e = d k mean=0, variance=d_k 。为了使方差回归到1，把所有结果都除上一个 d k \sqrt{d_k} ，这样求平方时会抵消已有的方差 d k d_k

# 均值为0，方差为1
a = np.random.randn(2,3000)
b = np.random.randn(3000,2)
c = a.dot(b)

print(np.var(a))
print(np.mean(c))
print(np.var(c))

# 1.0262973662546435
# 25.625943965792157
# 1347.432397285718


To illustrate why the dot products get large, assume that the components of q and k are independent random variables with > mean 0 and variance 1. Then their dot product, q ⋅ k = ∑ i = 1 d k q i k i q \cdot k=\sum_{i=1}^{d_{k}} q_{i} k_{i} , has mean 0 and variance dk.

• (4)计算各个词义所占的比例 d ⋅ v d \cdot v ，按照权重融合了各个字的语义。最后将多个头的结果拼接成一个完成的embedding作为self-attendion的输出。

(batch_size, h, seq_len, seq_len) * batch_size, h, seq_len, embedding/h

# (batch, seq_len, h, embed/head) -> (batch, h, seq_len, embed/head)
q = self.qry(y).view(y.size(0), y.size(1), self.head, -1).transpose(1, 2)
k = self.key(x).view(x.size(0), x.size(1), self.head, -1).transpose(1, 2)
v = self.val(x).view(x.size(0), x.size(1), self.head, -1).transpose(1, 2)
d = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(k.size(-1))     # 相似度 (batch , h, seq, seq)
a = F.softmax(d, dim=-1)            # (batch , h, seq, seq)

# (batch , h, seq_len, seq_len) * (batch, h, seq_len, embedding/h)
# => (batch, h, seq_len, embedding/h)
# => (batch, seq_len, h, embedding/h)
c = torch.matmul(a, v).transpose(1, 2)

# (batch, seq_len, embedding)
c = c.contiguous().view(c.size(0), c.size(1), -1)


## Encoder的完整过程:

1). 字向量与位置编码:
X = E m b e d d i n g L o o k u p ( X ) + P o s i t i o n a l E n c o d i n g (eq.2) X = EmbeddingLookup(X) + PositionalEncoding \tag{eq.2}
X ∈ ( b a t c h _ s i z e ∗ s e q _ l e n ∗ e m b e d _ d i m ) X \in ({batch\_size * seq\_len * embed\_dim} )
2). 自注意力机制:
Q = L i n e a r ( X ) = X W Q Q = Linear(X) = XW_{Q}
K = L i n e a r ( X ) = X W K (eq.3) K = Linear(X) = XW_{K} \tag{eq.3}
V = L i n e a r ( X ) = X W V V = Linear(X) = XW_{V}
X a t t e n t i o n = S e l f A t t e n t i o n ( Q ,   K ,   V ) (eq.4) X_{attention} = SelfAttention(Q, \ K, \ V) \tag{eq.4}
3). 残差连接与 L a y e r   N o r m a l i z a t i o n Layer \ Normalization
X a t t e n t i o n = X + X a t t e n t i o n (eq. 5) X_{attention} = X + X_{attention} \tag{eq. 5}
X a t t e n t i o n = L a y e r N o r m ( X a t t e n t i o n ) (eq. 6) X_{attention} = LayerNorm(X_{attention}) \tag{eq. 6}
4). 两层线性映射并用激活函数激活, 比如说 R e L U ReLU :
X h i d d e n = L i n e a r ( A c t i v a t e ( L i n e a r ( X a t t e n t i o n ) ) ) (eq. 7) X_{hidden} = Linear(Activate(Linear(X_{attention}))) \tag{eq. 7}
5). 重复3).:
X h i d d e n = X a t t e n t i o n + X h i d d e n X_{hidden} = X_{attention} + X_{hidden}
X h i d d e n = L a y e r N o r m ( X h i d d e n ) X_{hidden} = LayerNorm(X_{hidden})
X h i d d e n ∈ ( b a t c h _ s i z e   ∗   s e q _ l e n .   ∗   e m b e d _ d i m ) X_{hidden} \in ({batch\_size \ * \ seq\_len. \ * \ embed\_dim})

## Decoder 的完整过程

### 1). 输入数据

• 输入y的embedding：
X = E m b e d d i n g L o o k u p ( X ) + P o s i t i o n a l E n c o d i n g X ∈ ( b a t c h _ s i z e ∗ s e q _ l e n ∗ e m b e d _ d i m ) X = EmbeddingLookup(X) + PositionalEncoding \\ X \in ({batch\_size * seq\_len * embed\_dim} )

• encoder层的输出
h = e n c o d e r ( x ) h ∈ ( b a t c h _ s i z e ∗ s e q _ l e n ∗ e m b e d _ d i m ) h = encoder(x) \\ h \in (batch\_size * seq\_len * embed\_dim)

def get_pad(self, x):
"""
:param x:
"""
seq_len = x.size(1)
for _ in range(2):


    torch.triu(torch.ones(seq_len, seq_len).byte(), diagonal=1)

# [0, 1, 1, 1, 1, 1, 1, 1, 1, 1],
# [0, 0, 1, 1, 1, 1, 1, 1, 1, 1],
# [0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
# [0, 0, 0, 0, 1, 1, 1, 1, 1, 1],
# [0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
# [0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
# [0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
# [0, 0, 0, 0, 0, 0, 0, 0, 1, 1],
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

def get_att(head: int, seq_len: int):
"""
:param seq_len: int
:return:
"""
# 上三角矩阵， 不保留对角线
att = torch.triu(torch.ones(seq_len, seq_len).byte(), diagonal=1)
for _ in range(2):
# torch.squeeze() 删掉维度为1的维度：(1,3)==> (3)
# torch.unsqueeze() 扩充维度，在指定位置加上维数为1的维度：(3)==> (1,3)
att = torch.unsqueeze(att, dim=0)
# 像瓦片一样平铺


### 2). 多层 decoder Layer结构

• mask-self-attention + 残差 + LayerNorm； y经过mask之后含义已经改变，每一行表示当前词和之前的语义，表示的是某一时刻的可以获得的语义。比如0时刻只能获得第一个单词的语义，而第二个时刻可以获得前两个单词的语义。mask_self_attention得到的结果，每一行就是一个时刻包含的语义关系，表示我当前已经翻译出的单词的语义。
y 0 = m a s k _ s e l f _ a t t e n t i o n ( y , y , m y ) y = L a y e r N o r m ( y + r ) y_0 = mask\_self\_attention(y, y, my) \\ y = LayerNorm(y + r)

• self-attention + 残差 + LayerNorm，这里每一层decoder layer的数据都来自encoder的输出x，x经过变换生成K，V，用当前的y计算得到Q。然后计算Q和K的相似度再应用到V上就是结果; 这里的 Q y , K x , V x Q_y, K_x, V_x 就类似于seq2seq中的attention，把每个时刻的y和所有的x进行内积运算，找到每个x的权重再从所有的x中抽取需要的信息。一个 Q y Q_y 已经包含了decoder中的所有时刻。最后得到的结果表示的是，每个时刻应该从encoder中抽取哪些信息。 y 0 y_0 的shape是 (batch_size, h, seq_len, embedding/h).
y 0 = s e l f _ a t t e n t i o n ( x , y , m x ) y = L a y e r N o r m ( y + r ) y_0 = self\_attention(x, y, mx) \\ y = LayerNorm(y + r)

• 激活层：
y 0 = L i n e a r ( A c t i v a t e ( L i n e a r ( y ) ) ) y = L a y e r N o r m ( y 0 + y ) y_{0} = Linear(Activate(Linear(y))) \\ y = LayerNorm(y_0 + y)

class DecodeLayer(nn.Module):
super(DecodeLayer, self).__init__()
self.qrys = nn.ModuleList([nn.Linear(embed_len, embed_len / head) for _ in range(2)])
self.keys = nn.ModuleList([nn.Linear(embed_len, embed_len / head) for _ in range(2)])
self.vals = nn.ModuleList([nn.Linear(embed_len, embed_len / head) for _ in range(2)])
self.lal = nn.Sequential(nn.Linear(embed_len, embed_len),
nn.ReLU(),
nn.Linear(embed_len, embed_len))
self.lns = nn.ModuleList([nn.LayerNorm(embed_len) for _ in range(3)])

def mul_att(self, x, y, m, i):
q = self.qrys[i](y).view(y.size(0), y.size(1), self.head, -1).transpose(1, 2)
k = self.keys[i](x).view(x.size(0), x.size(1), self.head, -1).transpose(1, 2)
v = self.vals[i](x).view(x.size(0), x.size(1), self.head, -1).transpose(1, 2)

d = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(k.size(-1))
a = F.softmax(d, dim=-1)

# (batch , h, seq_len, seq_len) * (batch, h, seq_len, embedding/h)
# => (batch, h, seq_len, embedding/h)
# => (batch, seq_len, h, embedding/h)
c = torch.matmul(a, v).transpose(1, 2)
c = c.contiguous().view(c.size(0), c.size(1), -1)
return c

def forward(self, y, x, my, mx):
"""

:param y:  带上positional encoder的embedding。 (batch, seq_len, embedding)
:param x:  encoder的输出 (batch, seq_len, embedding)
:return:
"""
r = y               # 暂时保存用于计算残差网络
y = self.mul_att(y, y, my, 0)
y = self.lns[0](y + r)

r = y
y = self.mul_att(x, y, mx, 1)
y = self.lns[1](y + r)

r = y
y = self.lal(y)
return self.lns[2](y + r)


### 3）输出：

y = L i n e a r ( y ) l o g i t s = s o f t m a x ( y ) y = Linear(y) \\ logits = softmax(y)

03-27 98

08-07 1770
10-09 401
01-30 175
04-15 1122
09-26 3232
09-15 530