总结篇：t5 transformers整体的结构流程图

最新推荐文章于 2024-10-04 03:29:36 发布

唐僧爱吃唐僧肉

最新推荐文章于 2024-10-04 03:29:36 发布

阅读量4.2k

点赞数 4

分类专栏： nezha源码解读 bert源码解读文章标签： python 深度学习 pytorch

本文链接：https://blog.csdn.net/znevegiveup1/article/details/121627742

版权

bert源码解读同时被 2 个专栏收录

51 篇文章 7 订阅

订阅专栏

nezha源码解读

13 篇文章 5 订阅

订阅专栏

为了更好地理解t5模型结构的内容，这里给出t5模型的整体结构流程

t5整体模型结构流程

t5整体的结构流程
副录
- t5模型和mt5模型的区别
- t5forconditionalgeneration中decoder_input_ids的变换

t5整体的结构流程

t5在运行的过程中，主要改变的就是key_states和value_states的值

6个encoder部分的layerselfattention

输入的hidden_staes = (1,8,11,64)
首先调用query_states

query_states = shape(self.q(hidden_states))

得到

query_states = (1,8,11,64)

然后进入key_states和value_states

# get key/value states
key_states = project(
    hidden_states, self.k, key_value_states, past_key_value[0] if past_key_value is not None else None
)
value_states = project(
    hidden_states, self.v, key_value_states, past_key_value[1] if past_key_value is not None else None
)

这里调用的语句为

def project(hidden_states, proj_layer, key_value_states, past_key_value):
    """projects hidden states correctly to key/query states"""
    if key_value_states is None:
        # self-attn
        # (batch_size, n_heads, seq_length, dim_per_head)
        hidden_states = shape(proj_layer(hidden_states))

获取key_states和value_states的内容

key_states = (1,8,11,64)
value_states = (1,8,11,64)

接下来进行position_bias的计算

............
else:
    position_bias = self.compute_bias(real_seq_length, key_length)

注意这里计算出来的self.compute_bias调用self._relative_position_bucket传入的参数

relative_position_bucket = self._relative_position_bucket(
    relative_position,  # shape (query_length, key_length)
    bidirectional=(not self.is_decoder),
    num_buckets=self.relative_attention_num_buckets,
)

这里在encoder部分传入的bidirectional的参数为True，在decoder部分传入的bidirectional的参数为False，现在这里是encoder部分，所以传入的参数为True。
这里计算出来的position_bias的内容为

position_bias = (1,8,11,64)

接下来调用mask

if mask is not None:
	position_bias = position_bias+mask

这里的mask要么全零，要么为None，不去管它。
然后接下来运行程序的后面的代码

scores += position_bias
attn_weights = nn.functional.softmax(scores.float(), dim=-1).type_as(
    scores
)  # (batch_size, n_heads, seq_length, key_length)
............
return outputs

第一次调用6个decoder部分的layerselfattention

输入的hidden_states = (1,1,512)，接下来调用

query_states = shape(self.q(hidden_states))

获得query_states的参数

query_states = (1,8,1,64)

接下来调用key_states和value_states的内容

# get key/value states
key_states = project(
    hidden_states, self.k, key_value_states, past_key_value[0] if past_key_value is not None else None
)
value_states = project(
    hidden_states, self.v, key_value_states, past_key_value[1] if past_key_value is not None else None
)

这里在project函数之中调用的语句

if key_value_states is None:
    # self-attn
    # (batch_size, n_heads, seq_length, dim_per_head)
    hidden_states = shape(proj_layer(hidden_states))

这里输入的hidden_states还是(1,1,512)，接着经过两个线性网络层，输出的key_states和value_states的内容

key_states = (1,8,1,64)
value_states = (1,8,1,64)

然后进入position_bias的计算之中

if position_bias is None:
   if not self.has_relative_attention_bias:
       position_bias = torch.zeros(
           (1, self.n_heads, real_seq_length, key_length), device=scores.device, dtype=scores.dtype
       )
       #if self.gradient_checkpointing and self.training:
       #    position_bias.requires_grad = True
   else:
       position_bias = self.compute_bias(real_seq_length, key_length)

       # if key and values are already calculated
       # we want only the last query position bias
       if past_key_value is not None:
           position_bias = position_bias[:, :, -hidden_states.size(1) :, :]

注意这里计算出来的self.compute_bias调用self._relative_position_bucket传入的参数

relative_position_bucket = self._relative_position_bucket(
    relative_position,  # shape (query_length, key_length)
    bidirectional=(not self.is_decoder),
    num_buckets=self.relative_attention_num_buckets,
)

这里在encoder部分传入的bidirectional的参数为True，在decoder部分传入的bidirectional的参数为False，现在这里是decoder部分，所以传入的参数为False。
这里计算出来的position_bias的内容为

position_bias = (1,8,11,64)

这里第一次计算出来的position_bias的内容

position_bias = 
tensor([[[[ 3.5000]],
         [[ 0.4531]],
         [[ 3.1875]],
         [[ 0.9727]],
         [[-5.4688]],
         [[ 5.1875]],
         [[ 2.1562]],
         [[ 0.5391]]]])

然后加上position_bias，经过一波常规操作之后进行输出

scores += position_bias
............
outputs = (attn_output,)+(present_key_value_state,)+(position_bias,)

第一次调用6个decoder部分的layercrossattention

这里开始时候调用的过程

batch_size,seq_length = hidden_states.shape[:2]
real_seq_length = seq_length

得到的结果

batch_size = 1,seq_length = 1,real_seq_length = 1

然后调用

key_length = real_seq_length if key_value_states is None else key_value_states.shape[1]

对应的参数为key_length = 11
接下来调用query_states

query_states = shape(self.q(hidden_states))

得到query_states的内容

query_states = (1,1,512)

然后调用key_states和value_states的内容

# get key/value states
key_states = project(
    hidden_states, self.k, key_value_states, past_key_value[0] if past_key_value is not None else None
)
value_states = project(
    hidden_states, self.v, key_value_states, past_key_value[1] if past_key_value is not None else None
)

这里输入的

key_value_states = (1,11,64)

为之前6个encoder网络层得到的内容，第一次的layercrossattention中的key_states、value_states都由key_values得到

elif past_key_value is None:
    # cross-attn
    # (batch_size, n_heads, seq_length, dim_per_head)
    hidden_states = shape(proj_layer(key_value_states))

接着调用position_bias的内容

if position_bias is None:
    if not self.has_relative_attention_bias:
        position_bias = torch.zeros(
            (1, self.n_heads, real_seq_length, key_length), device=scores.device, dtype=scores.dtype
        )
        ............

这里的position_bias为全零的position_bias内容。
然后进行一些常规的操作内容

scores += position_bias
attn_weights = nn.functional.softmax(scores.float(),dim=-1).type_as(scores)
......

最后常规输出内容部分

present_key_value_state = (key_states, value_states) if (self.is_decoder and use_cache) else None
outputs = (attn_output,) + (present_key_value_state,) + (position_bias,)

第二次调用6个decoder部分的layerselfattention

(这里的第二次为调用了6个encoder的t5layerselfattention以及decoder中的6个encoder的t5layerselfattention和t5layercrossattention内容)
这里的第二次相当于预测完第一个数值之后，第二次运行到新的位置。这里调用的past_key_value[0]相当于上一个位置同一层输出的key_states，past_key_value[1]相当于上一个位置同一层输出的value_states(比如这里是第二波6个encoder+3个decoder+第4个decoder的selflayerattention，那么前面就相当于第一波的6个encoder+3个decoder+第4个decoder的selflayerattention的内容)
分析：这里是如何将上一波的输出结果传入下一波的？
传递过程在t5stack之中

for i, (layer_module, past_key_value) in enumerate(zip(self.block, past_key_values)):
	.............

这里遍历past_key_values数组中的内容，并接着往下传入
接下来进入

key_states = project(
    hidden_states, self.k, key_value_states, past_key_value[0] if past_key_value is not None else None
)
value_states = project(
    hidden_states, self.v, key_value_states, past_key_value[1] if past_key_value is not None else None
)

if past_key_value is not None:
    if key_value_states is None:
        # self-attn
        # (batch_size, n_heads, key_length, dim_per_head)
        hidden_states = torch.cat([past_key_value, hidden_states], dim=2)
    else:
        # cross-attn
        hidden_states = past_key_value

这里如果是t5layerselfattention的时候会调用第一个if，如果是crossattention的时候会调用第二个if
如果为t5layerselfattention的时候，在project函数里面会调用如下代码

if past_key_value is not None:
    if key_value_states is None:
        # self-attn
        # (batch_size, n_heads, key_length, dim_per_head)
        hidden_states = torch.cat([past_key_value, hidden_states], dim=2)
        ............
return hidden_states

获得第二波中的输出内容

key_states.size = torch.Size([1, 8, 2, 64])
value_states.size = torch.Size([1, 8, 2, 64])

接下来调用scores内容

# compute scores
scores = torch.matmul(
    query_states, key_states.transpose(3, 2)
)  # equivalent of torch.einsum("bnqd,bnkd->bnqk", query_states, key_states), compatible with onnx op>9

获得的结果

scores = torch.Size([1, 8, 1, 2])

接下来查看position_bias的计算

if position_bias is None:
     if not self.has_relative_attention_bias:
         position_bias = torch.zeros(
             (1, self.n_heads, real_seq_length, key_length), device=scores.device, dtype=scores.dtype
         )
         if self.gradient_checkpointing and self.training:
             position_bias.requires_grad = True
     else:
         position_bias = self.compute_bias(real_seq_length, key_length)

注意这里计算出来的self.compute_bias调用self._relative_position_bucket传入的参数

relative_position_bucket = self._relative_position_bucket(
    relative_position,  # shape (query_length, key_length)
    bidirectional=(not self.is_decoder),
    num_buckets=self.relative_attention_num_buckets,
)

position_bias = (1,8,11,64)

接下来的操作，有对应的一行小字标注：

if key and values are already calculated,
we want only the last query position bias.

调用对应的代码

if past_key_value is not None:
   position_bias = position_bias[:, :, -hidden_states.size(1) :, :]

注意取出来的是最后面的一维，取出来之后，position_bias = (1,8,1,2)
这里得到的position_bias的结果

position_bias = torch.Size([1, 8, 2, 2])

这里是原先position_bias的扩充，比如原先的position_bias的内容为

position_bias = 
tensor([[[[ 3.5000]],

         [[ 0.4531]],

         [[ 3.1875]],

         [[ 0.9727]],

         [[-5.4688]],

         [[ 5.1875]],

         [[ 2.1562]],

         [[ 0.5391]]]])

现在的position_bias为

position_bias = 
tensor([[[[ 3.9844,  3.5000]],

         [[ 1.2266,  0.4531]],

         [[ 4.3438,  3.1875]],

         [[ 2.0312,  0.9727]],

         [[ 0.7969, -5.4688]],

         [[ 4.9375,  5.1875]],

         [[ 4.7500,  2.1562]],

         [[ 4.5000,  0.5391]]]])

然后调用语句

scores += position_bias
#scores = (1,8,1,2)
attn_weights = nn.functional.softmax(scores.float(), dim=-1).type_as(
    scores
)  # (batch_size, n_heads, seq_length, key_length)
attn_weights = nn.functional.dropout(
    attn_weights, p=self.dropout, training=self.training
)  # (batch_size, n_heads, seq_length, key_length)

# Mask heads if we want to
if layer_head_mask is not None:
    attn_weights = attn_weights * layer_head_mask

到这为止scores的内容都为(1,8,1,2)
接下来调用

attn_output = unshape((torch.matmul(attn_weights,value_states))

attn_weights = (1,8,1,2),value_states = (1,8,2,64)
相乘之后得到结果(1,8,1,64)
然后使用unshape之后进行输出

attn_output = unshape(torch.matmul(attn_weights,value_states))
#attn_output = (1,1,512)
attn_output = self.o(attn_output)

获得结果

attn_output = (1,1,512)

第二次调用6个decoder部分的layercrossattention

刚开始调用的参数一样

batch_size,seq_length = hidden_states.shape[:2]
real_seq_length = seq_length

这里的batch_size = 1,seq_length = 1,real_seq_length = 1
接着调用

key_length = real_seq_length if key_value_states is None else key_value_states.shape[1]

获得参数

key_length = 11

唯一的区别就在于key_states和value_states的调用过程不一样

key_states = project(
    hidden_states, self.k, key_value_states, past_key_value[0] if past_key_value is not None else None
)
value_states = project(
    hidden_states, self.v, key_value_states, past_key_value[1] if past_key_value is not None else None
)

首先，这里传入的past_key_value[0]和past_key_value[1]为同一层次上一波的运行的结果
这里调用的past_key_value[0]相当于上一个位置同一层输出的key_states，past_key_value[1]相当于上一个位置同一层输出的value_states(比如这里是第二波6个encoder+3个decoder+第4个decoder的selflayerattention，那么前面就相当于第一波的6个encoder+3个decoder+第4个decoder的selflayerattention的内容)
接下来进入project函数之中

def project(hidden_states, proj_layer, key_value_states, past_key_value):
    """projects hidden states correctly to key/query states"""
    if key_value_states is None:
        # self-attn
        # (batch_size, n_heads, seq_length, dim_per_head)
        hidden_states = shape(proj_layer(hidden_states))
    elif past_key_value is None:
        # cross-attn
        # (batch_size, n_heads, seq_length, dim_per_head)
        hidden_states = shape(proj_layer(key_value_states))

    if past_key_value is not None:
        if key_value_states is None:
            # self-attn
            # (batch_size, n_heads, key_length, dim_per_head)
            hidden_states = torch.cat([past_key_value, hidden_states], dim=2)
        else:
            # cross-attn
            hidden_states = past_key_value
    return hidden_states

直接运行最后一个else

hidden_states = past_key_value

获得hidden_states = torch.Size([1, 8, 11, 64])
接下来调用position_bias的内容，注意crosslayerattention的position_bias永远为零

if position_bias is None:
    if not self.has_relative_attention_bias:
        position_bias = torch.zeros(
            (1, self.n_heads, real_seq_length, key_length), device=scores.device, dtype=scores.dtype
        )

获得的position_bias的结果

position_bias = 
tensor([[[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

         [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

         [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

         [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

         [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

         [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

         [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

         [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]]])

总结一下project函数中的内容为，第一个if针对第一次的selflayerattention(包括encoder和decoder部分)，else针对第一次的layercrossattention，第二个if针对第二次～第n次的selflayerattention，else针对第二次～第n次的layercrossattention
后续的操作都差不多

(1,8,1,64)*(1,8,64,11) = (1,8,1,11)
(1,8,1,11)*(1,8,11,64) = (1,8,1,64)

if position_bias is None
	......
	if mask is not None:
	......

这里的mask不为None,可以研究一下

副录

t5模型和mt5模型的区别

转自科学空间
原文地址
自从在去年10月发布后，T5在今年还经历了一次低调的小升级，具体细节可以查看Github链接，官方把升级前的T5称为T5.1.0，而升级后的叫做T5.1.1。它主要的改动来自论文《GLU Variants Improve Transformer》，主要是借用了《Language Modeling with Gated Convolutional Networks》的GLU（Gated Linear Unit）来增强FFN部分的效果。具体来说，原来T5的FFN为（T5没有Bias）
$F F N (x) = r e l u (x W 1) W 2 (1)$

现在改为了
$F F N G E G L U (x) = (g e l u (x W 1) \otimes x W 2) W 3 (2)$

也就是把relu激活的第一个变化层改为了gelu激活的门控线性单元，这样FFN层增加了50%参数，但是从论文效果看效果明显增加。此外，T5.1.1还对Embedding层做了改动，原来在T5.1.0中，Encoder和Decoder的Embedding层、Decoder最后预测概率分布的Softmax层都是共享同一个Embedding矩阵的，现在T5.1.1只让Encoder和Decoder的Embedding层共享，而Decoder最后预测概率分布的Softmax层则用了一个独立的Embedding矩阵，当然这会让参数量大大增加，但Google的结论说这样做效果会更好，其结论被总结在最近的论文《Rethinking embedding coupling in pre-trained language models》中。还有最后一点改动，T5.1.1在预训练阶段去掉了Dropout，而只有在下游微调阶段才使用Dropout。

这里的区别体现在t5layerff的参数之中

class T5LayerFF(nn.Module):
    def __init__(self, config):
        super().__init__()
        if config.feed_forward_proj == "relu":
            self.DenseReluDense = T5DenseReluDense(config)
        elif config.feed_forward_proj == "gated-gelu":
            self.DenseReluDense = T5DenseGatedGeluDense(config)
        else:
            raise ValueError(
                f"{self.config.feed_forward_proj} is not supported. Choose between `relu` and `gated-gelu`"
            )
        self.layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
        self.dropout = nn.Dropout(config.dropout_rate)

    def forward(self, hidden_states):
        forwarded_states = self.layer_norm(hidden_states)
        forwarded_states = self.DenseReluDense(forwarded_states)
        hidden_states = hidden_states + self.dropout(forwarded_states)
        return hidden_states

relu对应着t5-1.0，而gated-gelu对应着t5-1.1

t5forconditionalgeneration中decoder_input_ids的变换

if labels is not None and decoder_input_ids is None and decoder_inputs_embeds is None:
	decoder_input_ids = self._shift_right(labels)

首先，如果是在generate生成函数的过程之中，则decoder_input_ids起始的时候为tensor([[0]])，上面的语句不会被调用。
其次，这里如果labels = torch.tensor([0,1,2])，并且decoder_input_ids = None的时候，会调用上面的语句，调用完成之后的结果

decoder_input_ids = torch.tensor([0,0,1])

进入_shift_right函数之中，查看内容

def _shift_right(self,input_ids):
	decoder_start_token_id = self.config.decoder_start_token_id
	pad_token_id = self.config.pad_token_id

获得的结果参数

decoder_start_token_id = 0,pad_token_id = 0

接下来进入平移inputs的部分

# shift inputs to the right
if is_torch_fx_proxy(input_ids):
    # Item assignment is not supported natively for proxies.
    shifted_input_ids = torch.full(input_ids.shape[:-1] + (1,), decoder_start_token_id)
    shifted_input_ids = torch.cat([shifted_input_ids, input_ids[..., :-1]], dim=-1)
else:
    shifted_input_ids = input_ids.new_zeros(input_ids.shape)
    shifted_input_ids[..., 1:] = input_ids[..., :-1].clone()
    shifted_input_ids[..., 0] = decoder_start_token_id

进入file_utils.py查看is_torch_fx_proxy函数

def is_torch_fx_proxy(x):
    if is_torch_fx_available():
        import torch.fx

        return isinstance(x, torch.fx.Proxy)
    return False

进入is_torch_fx_proxy函数

def is_torch_fx_available():
    return _torch_fx_available

_torch_fx_available = _torch_onnx_dict_inputs_support_available = False

这里先不去管它，直接进入后面的部分，进行tensor的移动并且在开头位置加上0的操作

shifted_input_ids = input_ids.new_zeros(input_ids.shape)
shifted_input_ids[..., 1:] = input_ids[..., :-1].clone()
shifted_input_ids[..., 0] = decoder_start_token_id