为了更好地理解t5模型结构的内容,这里给出t5模型的整体结构流程
t5整体模型结构流程
t5整体的结构流程
t5在运行的过程中,主要改变的就是key_states和value_states的值
6个encoder部分的layerselfattention
输入的hidden_staes = (1,8,11,64)
首先调用query_states
query_states = shape(self.q(hidden_states))
得到
query_states = (1,8,11,64)
然后进入key_states和value_states
# get key/value states
key_states = project(
hidden_states, self.k, key_value_states, past_key_value[0] if past_key_value is not None else None
)
value_states = project(
hidden_states, self.v, key_value_states, past_key_value[1] if past_key_value is not None else None
)
这里调用的语句为
def project(hidden_states, proj_layer, key_value_states, past_key_value):
"""projects hidden states correctly to key/query states"""
if key_value_states is None:
# self-attn
# (batch_size, n_heads, seq_length, dim_per_head)
hidden_states = shape(proj_layer(hidden_states))
获取key_states和value_states的内容
key_states = (1,8,11,64)
value_states = (1,8,11,64)
接下来进行position_bias的计算
............
else:
position_bias = self.compute_bias(real_seq_length, key_length)
注意这里计算出来的self.compute_bias调用self._relative_position_bucket传入的参数
relative_position_bucket = self._relative_position_bucket(
relative_position, # shape (query_length, key_length)
bidirectional=(not self.is_decoder),
num_buckets=self.relative_attention_num_buckets,
)
这里在encoder部分传入的bidirectional的参数为True,在decoder部分传入的bidirectional的参数为False,现在这里是encoder部分,所以传入的参数为True。
这里计算出来的position_bias的内容为
position_bias = (1,8,11,64)
接下来调用mask
if mask is not None:
position_bias = position_bias+mask
这里的mask要么全零,要么为None,不去管它。
然后接下来运行程序的后面的代码
scores += position_bias
attn_weights = nn.functional.softmax(scores.float(), dim=-1).type_as(
scores
) # (batch_size, n_heads, seq_length, key_length)
............
return outputs
第一次调用6个decoder部分的layerselfattention
输入的hidden_states = (1,1,512),接下来调用
query_states = shape(self.q(hidden_states))
获得query_states的参数
query_states = (1,8,1,64)
接下来调用key_states和value_states的内容
# get key/value states
key_states = project(
hidden_states, self.k, key_value_states, past_key_value[0] if past_key_value is not None else None
)
value_states = project(
hidden_states, self.v, key_value_states, past_key_value[1] if past_key_value is not None else None
)
这里在project函数之中调用的语句
if key_value_states is None:
# self-attn
# (batch_size, n_heads, seq_length, dim_per_head)
hidden_states = shape(proj_layer(hidden_states))
这里输入的hidden_states还是(1,1,512),接着经过两个线性网络层,输出的key_states和value_states的内容
key_states = (1,8,1,64)
value_states = (1,8,1,64)
然后进入position_bias的计算之中
if position_bias is None:
if not self.has_relative_attention_bias:
position_bias = torch.zeros(
(1, self.n_heads, real_seq_length, key_length), device=scores.device, dtype=scores.dtype
)
#if self.gradient_checkpointing and self.training:
# position_bias.requires_grad = True
else:
position_bias = self.compute_bias(real_seq_length, key_length)
# if key and values are already calculated
# we want only the last query position bias
if past_key_value is not None:
position_bias = position_bias[:, :, -hidden_states.size(1) :, :]
注意这里计算出来的self.compute_bias调用self._relative_position_bucket传入的参数
relative_position_bucket = self._relative_position_bucket(
relative_position, # shape (query_length, key_length)
bidirectional=(not self.is_decoder),
num_buckets=self.relative_attention_num_buckets,
)
这里在encoder部分传入的bidirectional的参数为True,在decoder部分传入的bidirectional的参数为False,现在这里是decoder部分,所以传入的参数为False。
这里计算出来的position_bias的内容为
position_bias = (1,8,11,64)
这里第一次计算出来的position_bias的内容
position_bias =
tensor([[[[ 3.5000]],
[[ 0.4531]],
[[ 3.1875]],
[[ 0.9727]],
[[-5.4688]],
[[ 5.1875]],
[[ 2.1562]],
[[ 0.5391]]]])
然后加上position_bias,经过一波常规操作之后进行输出
scores += position_bias
............
outputs = (attn_output,)+(present_key_value_state,)+(position_bias,)
第一次调用6个decoder部分的layercrossattention
这里开始时候调用的过程
batch_size,seq_length = hidden_states.shape[:2]
real_seq_length = seq_length
得到的结果
batch_size = 1,seq_length = 1,real_seq_length = 1
然后调用
key_length = real_seq_length if key_value_states is None else key_value_states.shape[1]
对应的参数为key_length = 11
接下来调用query_states
query_states = shape(self.q(hidden_states))
得到query_states的内容
query_states = (1,1,512)
然后调用key_states和value_states的内容
# get key/value states
key_states = project(
hidden_states, self.k, key_value_states, past_key_value[0] if past_key_value is not None else None
)
value_states = project(
hidden_states, self.v, key_value_states, past_key_value[1] if past_key_value is not None else None
)
这里输入的
key_value_states = (1,11,64)
为之前6个encoder网络层得到的内容,第一次的layercrossattention中的key_states、value_states都由key_values得到
elif past_key_value is None:
# cross-attn
# (batch_size, n_heads, seq_length, dim_per_head)
hidden_states = shape(proj_layer(key_value_states))
接着调用position_bias的内容
if position_bias is None:
if not self.has_relative_attention_bias:
position_bias = torch.zeros(
(1, self.n_heads, real_seq_length, key_length), device=scores.device, dtype=scores.dtype
)
............
这里的position_bias为全零的position_bias内容。
然后进行一些常规的操作内容
scores += position_bias
attn_weights = nn.functional.softmax(scores.float(),dim=-1).type_as(scores)
......
最后常规输出内容部分
present_key_value_state = (key_states, value_states) if (self.is_decoder and use_cache) else None
outputs = (attn_output,) + (present_key_value_state,) + (position_bias,)
第二次调用6个decoder部分的layerselfattention
(这里的第二次为调用了6个encoder的t5layerselfattention以及decoder中的6个encoder的t5layerselfattention和t5layercrossattention内容)
这里的第二次相当于预测完第一个数值之后,第二次运行到新的位置。这里调用的past_key_value[0]相当于上一个位置同一层输出的key_states,past_key_value[1]相当于上一个位置同一层输出的value_states(比如这里是第二波6个encoder+3个decoder+第4个decoder的selflayerattention,那么前面就相当于第一波的6个encoder+3个decoder+第4个decoder的selflayerattention的内容)
分析:这里是如何将上一波的输出结果传入下一波的?
传递过程在t5stack之中
for i, (layer_module, past_key_value) in enumerate(zip(self.block, past_key_values)):
.............
这里遍历past_key_values数组中的内容,并接着往下传入
接下来进入
key_states = project(
hidden_states, self.k, key_value_states, past_key_value[0] if past_key_value is not None else None
)
value_states = project(
hidden_states, self.v, key_value_states, past_key_value[1] if past_key_value is not None else None
)
if past_key_value is not None:
if key_value_states is None:
# self-attn
# (batch_size, n_heads, key_length, dim_per_head)
hidden_states = torch.cat([past_key_value, hidden_states], dim=2)
else:
# cross-attn
hidden_states = past_key_value
这里如果是t5layerselfattention的时候会调用第一个if,如果是crossattention的时候会调用第二个if
如果为t5layerselfattention的时候,在project函数里面会调用如下代码
if past_key_value is not None:
if key_value_states is None:
# self-attn
# (batch_size, n_heads, key_length, dim_per_head)
hidden_states = torch.cat([past_key_value, hidden_states], dim=2)
............
return hidden_states
获得第二波中的输出内容
key_states.size = torch.Size([1, 8, 2, 64])
value_states.size = torch.Size([1, 8, 2, 64])
接下来调用scores内容
# compute scores
scores = torch.matmul(
query_states, key_states.transpose(3, 2)
) # equivalent of torch.einsum("bnqd,bnkd->bnqk", query_states, key_states), compatible with onnx op>9
获得的结果
scores = torch.Size([1, 8, 1, 2])
接下来查看position_bias的计算
if position_bias is None:
if not self.has_relative_attention_bias:
position_bias = torch.zeros(
(1, self.n_heads, real_seq_length, key_length), device=scores.device, dtype=scores.dtype
)
if self.gradient_checkpointing and self.training:
position_bias.requires_grad = True
else:
position_bias = self.compute_bias(real_seq_length, key_length)
注意这里计算出来的self.compute_bias调用self._relative_position_bucket传入的参数
relative_position_bucket = self._relative_position_bucket(
relative_position, # shape (query_length, key_length)
bidirectional=(not self.is_decoder),
num_buckets=self.relative_attention_num_buckets,
)
这里在encoder部分传入的bidirectional的参数为True,在decoder部分传入的bidirectional的参数为False,现在这里是decoder部分,所以传入的参数为False。
这里计算出来的position_bias的内容为
position_bias = (1,8,11,64)
接下来的操作,有对应的一行小字标注:
if key and values are already calculated,
we want only the last query position bias.
调用对应的代码
if past_key_value is not None:
position_bias = position_bias[:, :, -hidden_states.size(1) :, :]
注意取出来的是最后面的一维,取出来之后,position_bias = (1,8,1,2)
这里得到的position_bias的结果
position_bias = torch.Size([1, 8, 2, 2])
这里是原先position_bias的扩充,比如原先的position_bias的内容为
position_bias =
tensor([[[[ 3.5000]],
[[ 0.4531]],
[[ 3.1875]],
[[ 0.9727]],
[[-5.4688]],
[[ 5.1875]],
[[ 2.1562]],
[[ 0.5391]]]])
现在的position_bias为
position_bias =
tensor([[[[ 3.9844, 3.5000]],
[[ 1.2266, 0.4531]],
[[ 4.3438, 3.1875]],
[[ 2.0312, 0.9727]],
[[ 0.7969, -5.4688]],
[[ 4.9375, 5.1875]],
[[ 4.7500, 2.1562]],
[[ 4.5000, 0.5391]]]])
然后调用语句
scores += position_bias
#scores = (1,8,1,2)
attn_weights = nn.functional.softmax(scores.float(), dim=-1).type_as(
scores
) # (batch_size, n_heads, seq_length, key_length)
attn_weights = nn.functional.dropout(
attn_weights, p=self.dropout, training=self.training
) # (batch_size, n_heads, seq_length, key_length)
# Mask heads if we want to
if layer_head_mask is not None:
attn_weights = attn_weights * layer_head_mask
到这为止scores的内容都为(1,8,1,2)
接下来调用
attn_output = unshape((torch.matmul(attn_weights,value_states))
attn_weights = (1,8,1,2),value_states = (1,8,2,64)
相乘之后得到结果(1,8,1,64)
然后使用unshape之后进行输出
attn_output = unshape(torch.matmul(attn_weights,value_states))
#attn_output = (1,1,512)
attn_output = self.o(attn_output)
获得结果
attn_output = (1,1,512)
第二次调用6个decoder部分的layercrossattention
刚开始调用的参数一样
batch_size,seq_length = hidden_states.shape[:2]
real_seq_length = seq_length
这里的batch_size = 1,seq_length = 1,real_seq_length = 1
接着调用
key_length = real_seq_length if key_value_states is None else key_value_states.shape[1]
获得参数
key_length = 11
唯一的区别就在于key_states和value_states的调用过程不一样
key_states = project(
hidden_states, self.k, key_value_states, past_key_value[0] if past_key_value is not None else None
)
value_states = project(
hidden_states, self.v, key_value_states, past_key_value[1] if past_key_value is not None else None
)
首先,这里传入的past_key_value[0]和past_key_value[1]为同一层次上一波的运行的结果
这里调用的past_key_value[0]相当于上一个位置同一层输出的key_states,past_key_value[1]相当于上一个位置同一层输出的value_states(比如这里是第二波6个encoder+3个decoder+第4个decoder的selflayerattention,那么前面就相当于第一波的6个encoder+3个decoder+第4个decoder的selflayerattention的内容)
接下来进入project函数之中
def project(hidden_states, proj_layer, key_value_states, past_key_value):
"""projects hidden states correctly to key/query states"""
if key_value_states is None:
# self-attn
# (batch_size, n_heads, seq_length, dim_per_head)
hidden_states = shape(proj_layer(hidden_states))
elif past_key_value is None:
# cross-attn
# (batch_size, n_heads, seq_length, dim_per_head)
hidden_states = shape(proj_layer(key_value_states))
if past_key_value is not None:
if key_value_states is None:
# self-attn
# (batch_size, n_heads, key_length, dim_per_head)
hidden_states = torch.cat([past_key_value, hidden_states], dim=2)
else:
# cross-attn
hidden_states = past_key_value
return hidden_states
直接运行最后一个else
hidden_states = past_key_value
获得hidden_states = torch.Size([1, 8, 11, 64])
接下来调用position_bias的内容,注意crosslayerattention的position_bias永远为零
if position_bias is None:
if not self.has_relative_attention_bias:
position_bias = torch.zeros(
(1, self.n_heads, real_seq_length, key_length), device=scores.device, dtype=scores.dtype
)
获得的position_bias的结果
position_bias =
tensor([[[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]]])
总结一下project函数中的内容为,第一个if针对第一次的selflayerattention(包括encoder和decoder部分),else针对第一次的layercrossattention,第二个if针对第二次~第n次的selflayerattention,else针对第二次~第n次的layercrossattention
后续的操作都差不多
(1,8,1,64)*(1,8,64,11) = (1,8,1,11)
(1,8,1,11)*(1,8,11,64) = (1,8,1,64)
if position_bias is None
......
if mask is not None:
......
这里的mask不为None,可以研究一下
副录
t5模型和mt5模型的区别
转自科学空间
原文地址
自从在去年10月发布后,T5在今年还经历了一次低调的小升级,具体细节可以查看Github链接,官方把升级前的T5称为T5.1.0,而升级后的叫做T5.1.1。它主要的改动来自论文《GLU Variants Improve Transformer》,主要是借用了《Language Modeling with Gated Convolutional Networks》的GLU(Gated Linear Unit)来增强FFN部分的效果。具体来说,原来T5的FFN为(T5没有Bias)
F
F
N
(
x
)
=
r
e
l
u
(
x
W
1
)
W
2
(
1
)
FFN(x)=relu(xW1)W2(1)
FFN(x)=relu(xW1)W2(1)
现在改为了
F
F
N
G
E
G
L
U
(
x
)
=
(
g
e
l
u
(
x
W
1
)
⊗
x
W
2
)
W
3
(
2
)
FFNGEGLU(x)=(gelu(xW1)⊗xW2)W3(2)
FFNGEGLU(x)=(gelu(xW1)⊗xW2)W3(2)
也就是把relu激活的第一个变化层改为了gelu激活的门控线性单元,这样FFN层增加了50%参数,但是从论文效果看效果明显增加。此外,T5.1.1还对Embedding层做了改动,原来在T5.1.0中,Encoder和Decoder的Embedding层、Decoder最后预测概率分布的Softmax层都是共享同一个Embedding矩阵的,现在T5.1.1只让Encoder和Decoder的Embedding层共享,而Decoder最后预测概率分布的Softmax层则用了一个独立的Embedding矩阵,当然这会让参数量大大增加,但Google的结论说这样做效果会更好,其结论被总结在最近的论文《Rethinking embedding coupling in pre-trained language models》中。还有最后一点改动,T5.1.1在预训练阶段去掉了Dropout,而只有在下游微调阶段才使用Dropout。
这里的区别体现在t5layerff的参数之中
class T5LayerFF(nn.Module):
def __init__(self, config):
super().__init__()
if config.feed_forward_proj == "relu":
self.DenseReluDense = T5DenseReluDense(config)
elif config.feed_forward_proj == "gated-gelu":
self.DenseReluDense = T5DenseGatedGeluDense(config)
else:
raise ValueError(
f"{self.config.feed_forward_proj} is not supported. Choose between `relu` and `gated-gelu`"
)
self.layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
self.dropout = nn.Dropout(config.dropout_rate)
def forward(self, hidden_states):
forwarded_states = self.layer_norm(hidden_states)
forwarded_states = self.DenseReluDense(forwarded_states)
hidden_states = hidden_states + self.dropout(forwarded_states)
return hidden_states
relu对应着t5-1.0,而gated-gelu对应着t5-1.1
t5forconditionalgeneration中decoder_input_ids的变换
if labels is not None and decoder_input_ids is None and decoder_inputs_embeds is None:
decoder_input_ids = self._shift_right(labels)
首先,如果是在generate生成函数的过程之中,则decoder_input_ids起始的时候为tensor([[0]]),上面的语句不会被调用。
其次,这里如果labels = torch.tensor([0,1,2]),并且decoder_input_ids = None的时候,会调用上面的语句,调用完成之后的结果
decoder_input_ids = torch.tensor([0,0,1])
进入_shift_right函数之中,查看内容
def _shift_right(self,input_ids):
decoder_start_token_id = self.config.decoder_start_token_id
pad_token_id = self.config.pad_token_id
获得的结果参数
decoder_start_token_id = 0,pad_token_id = 0
接下来进入平移inputs的部分
# shift inputs to the right
if is_torch_fx_proxy(input_ids):
# Item assignment is not supported natively for proxies.
shifted_input_ids = torch.full(input_ids.shape[:-1] + (1,), decoder_start_token_id)
shifted_input_ids = torch.cat([shifted_input_ids, input_ids[..., :-1]], dim=-1)
else:
shifted_input_ids = input_ids.new_zeros(input_ids.shape)
shifted_input_ids[..., 1:] = input_ids[..., :-1].clone()
shifted_input_ids[..., 0] = decoder_start_token_id
进入file_utils.py查看is_torch_fx_proxy函数
def is_torch_fx_proxy(x):
if is_torch_fx_available():
import torch.fx
return isinstance(x, torch.fx.Proxy)
return False
进入is_torch_fx_proxy函数
def is_torch_fx_available():
return _torch_fx_available
_torch_fx_available = _torch_onnx_dict_inputs_support_available = False
这里先不去管它,直接进入后面的部分,进行tensor的移动并且在开头位置加上0的操作
shifted_input_ids = input_ids.new_zeros(input_ids.shape)
shifted_input_ids[..., 1:] = input_ids[..., :-1].clone()
shifted_input_ids[..., 0] = decoder_start_token_id