总结篇:t5 transformers整体的结构流程图

为了更好地理解t5模型结构的内容,这里给出t5模型的整体结构流程

t5整体的结构流程

t5在运行的过程中,主要改变的就是key_states和value_states的值

6个encoder部分的layerselfattention

输入的hidden_staes = (1,8,11,64)
首先调用query_states

query_states = shape(self.q(hidden_states))

得到

query_states = (1,8,11,64)

然后进入key_states和value_states

# get key/value states
key_states = project(
    hidden_states, self.k, key_value_states, past_key_value[0] if past_key_value is not None else None
)
value_states = project(
    hidden_states, self.v, key_value_states, past_key_value[1] if past_key_value is not None else None
)

这里调用的语句为

def project(hidden_states, proj_layer, key_value_states, past_key_value):
    """projects hidden states correctly to key/query states"""
    if key_value_states is None:
        # self-attn
        # (batch_size, n_heads, seq_length, dim_per_head)
        hidden_states = shape(proj_layer(hidden_states))

获取key_states和value_states的内容

key_states = (1,8,11,64)
value_states = (1,8,11,64)

接下来进行position_bias的计算

............
else:
    position_bias = self.compute_bias(real_seq_length, key_length)

注意这里计算出来的self.compute_bias调用self._relative_position_bucket传入的参数

relative_position_bucket = self._relative_position_bucket(
    relative_position,  # shape (query_length, key_length)
    bidirectional=(not self.is_decoder),
    num_buckets=self.relative_attention_num_buckets,
)

这里在encoder部分传入的bidirectional的参数为True,在decoder部分传入的bidirectional的参数为False,现在这里是encoder部分,所以传入的参数为True。
这里计算出来的position_bias的内容为

position_bias = (1,8,11,64)

接下来调用mask

if mask is not None:
	position_bias = position_bias+mask

这里的mask要么全零,要么为None,不去管它。
然后接下来运行程序的后面的代码

scores += position_bias
attn_weights = nn.functional.softmax(scores.float(), dim=-1).type_as(
    scores
)  # (batch_size, n_heads, seq_length, key_length)
............
return outputs

第一次调用6个decoder部分的layerselfattention

输入的hidden_states = (1,1,512),接下来调用

query_states = shape(self.q(hidden_states))

获得query_states的参数

query_states = (1,8,1,64)

接下来调用key_states和value_states的内容

# get key/value states
key_states = project(
    hidden_states, self.k, key_value_states, past_key_value[0] if past_key_value is not None else None
)
value_states = project(
    hidden_states, self.v, key_value_states, past_key_value[1] if past_key_value is not None else None
)

这里在project函数之中调用的语句

if key_value_states is None:
    # self-attn
    # (batch_size, n_heads, seq_length, dim_per_head)
    hidden_states = shape(proj_layer(hidden_states))

这里输入的hidden_states还是(1,1,512),接着经过两个线性网络层,输出的key_states和value_states的内容

key_states = (1,8,1,64)
value_states = (1,8,1,64)

然后进入position_bias的计算之中

if position_bias is None:
   if not self.has_relative_attention_bias:
       position_bias = torch.zeros(
           (1, self.n_heads, real_seq_length, key_length), device=scores.device, dtype=scores.dtype
       )
       #if self.gradient_checkpointing and self.training:
       #    position_bias.requires_grad = True
   else:
       position_bias = self.compute_bias(real_seq_length, key_length)

       # if key and values are already calculated
       # we want only the last query position bias
       if past_key_value is not None:
           position_bias = position_bias[:, :, -hidden_states.size(1) :, :]

注意这里计算出来的self.compute_bias调用self._relative_position_bucket传入的参数

relative_position_bucket = self._relative_position_bucket(
    relative_position,  # shape (query_length, key_length)
    bidirectional=(not self.is_decoder),
    num_buckets=self.relative_attention_num_buckets,
)

这里在encoder部分传入的bidirectional的参数为True,在decoder部分传入的bidirectional的参数为False,现在这里是decoder部分,所以传入的参数为False。
这里计算出来的position_bias的内容为

position_bias = (1,8,11,64)

这里第一次计算出来的position_bias的内容

position_bias = 
tensor([[[[ 3.5000]],
         [[ 0.4531]],
         [[ 3.1875]],
         [[ 0.9727]],
         [[-5.4688]],
         [[ 5.1875]],
         [[ 2.1562]],
         [[ 0.5391]]]])

然后加上position_bias,经过一波常规操作之后进行输出

scores += position_bias
............
outputs = (attn_output,)+(present_key_value_state,)+(position_bias,)

第一次调用6个decoder部分的layercrossattention

这里开始时候调用的过程

batch_size,seq_length = hidden_states.shape[:2]
real_seq_length = seq_length

得到的结果

batch_size = 1,seq_length = 1,real_seq_length = 1

然后调用

key_length = real_seq_length if key_value_states is None else key_value_states.shape[1]

对应的参数为key_length = 11
接下来调用query_states

query_states = shape(self.q(hidden_states))

得到query_states的内容

query_states = (1,1,512)

然后调用key_states和value_states的内容

# get key/value states
key_states = project(
    hidden_states, self.k, key_value_states, past_key_value[0] if past_key_value is not None else None
)
value_states = project(
    hidden_states, self.v, key_value_states, past_key_value[1] if past_key_value is not None else None
)

这里输入的

key_value_states = (1,11,64)

为之前6个encoder网络层得到的内容,第一次的layercrossattention中的key_states、value_states都由key_values得到

elif past_key_value is None:
    # cross-attn
    # (batch_size, n_heads, seq_length, dim_per_head)
    hidden_states = shape(proj_layer(key_value_states))

接着调用position_bias的内容

if position_bias is None:
    if not self.has_relative_attention_bias:
        position_bias = torch.zeros(
            (1, self.n_heads, real_seq_length, key_length), device=scores.device, dtype=scores.dtype
        )
        ............

这里的position_bias为全零的position_bias内容。
然后进行一些常规的操作内容

scores += position_bias
attn_weights = nn.functional.softmax(scores.float(),dim=-1).type_as(scores)
......

最后常规输出内容部分

present_key_value_state = (key_states, value_states) if (self.is_decoder and use_cache) else None
outputs = (attn_output,) + (present_key_value_state,) + (position_bias,)

第二次调用6个decoder部分的layerselfattention

(这里的第二次为调用了6个encoder的t5layerselfattention以及decoder中的6个encoder的t5layerselfattention和t5layercrossattention内容)
这里的第二次相当于预测完第一个数值之后,第二次运行到新的位置。这里调用的past_key_value[0]相当于上一个位置同一层输出的key_states,past_key_value[1]相当于上一个位置同一层输出的value_states(比如这里是第二波6个encoder+3个decoder+第4个decoder的selflayerattention,那么前面就相当于第一波的6个encoder+3个decoder+第4个decoder的selflayerattention的内容)
分析:这里是如何将上一波的输出结果传入下一波的?
传递过程在t5stack之中

for i, (layer_module, past_key_value) in enumerate(zip(self.block, past_key_values)):
	.............

这里遍历past_key_values数组中的内容,并接着往下传入
接下来进入

key_states = project(
    hidden_states, self.k, key_value_states, past_key_value[0] if past_key_value is not None else None
)
value_states = project(
    hidden_states, self.v, key_value_states, past_key_value[1] if past_key_value is not None else None
)
if past_key_value is not None:
    if key_value_states is None:
        # self-attn
        # (batch_size, n_heads, key_length, dim_per_head)
        hidden_states = torch.cat([past_key_value, hidden_states], dim=2)
    else:
        # cross-attn
        hidden_states = past_key_value

这里如果是t5layerselfattention的时候会调用第一个if,如果是crossattention的时候会调用第二个if
如果为t5layerselfattention的时候,在project函数里面会调用如下代码

if past_key_value is not None:
    if key_value_states is None:
        # self-attn
        # (batch_size, n_heads, key_length, dim_per_head)
        hidden_states = torch.cat([past_key_value, hidden_states], dim=2)
        ............
return hidden_states

获得第二波中的输出内容

key_states.size = torch.Size([1, 8, 2, 64])
value_states.size = torch.Size([1, 8, 2, 64])

接下来调用scores内容

# compute scores
scores = torch.matmul(
    query_states, key_states.transpose(3, 2)
)  # equivalent of torch.einsum("bnqd,bnkd->bnqk", query_states, key_states), compatible with onnx op>9

获得的结果

scores = torch.Size([1, 8, 1, 2])

接下来查看position_bias的计算

if position_bias is None:
     if not self.has_relative_attention_bias:
         position_bias = torch.zeros(
             (1, self.n_heads, real_seq_length, key_length), device=scores.device, dtype=scores.dtype
         )
         if self.gradient_checkpointing and self.training:
             position_bias.requires_grad = True
     else:
         position_bias = self.compute_bias(real_seq_length, key_length)

注意这里计算出来的self.compute_bias调用self._relative_position_bucket传入的参数

relative_position_bucket = self._relative_position_bucket(
    relative_position,  # shape (query_length, key_length)
    bidirectional=(not self.is_decoder),
    num_buckets=self.relative_attention_num_buckets,
)

这里在encoder部分传入的bidirectional的参数为True,在decoder部分传入的bidirectional的参数为False,现在这里是decoder部分,所以传入的参数为False。
这里计算出来的position_bias的内容为

position_bias = (1,8,11,64)

接下来的操作,有对应的一行小字标注:

if key and values are already calculated,
we want only the last query position bias.

调用对应的代码

if past_key_value is not None:
   position_bias = position_bias[:, :, -hidden_states.size(1) :, :]

注意取出来的是最后面的一维,取出来之后,position_bias = (1,8,1,2)
这里得到的position_bias的结果

position_bias = torch.Size([1, 8, 2, 2])

这里是原先position_bias的扩充,比如原先的position_bias的内容为

position_bias = 
tensor([[[[ 3.5000]],

         [[ 0.4531]],

         [[ 3.1875]],

         [[ 0.9727]],

         [[-5.4688]],

         [[ 5.1875]],

         [[ 2.1562]],

         [[ 0.5391]]]])

现在的position_bias为

position_bias = 
tensor([[[[ 3.9844,  3.5000]],

         [[ 1.2266,  0.4531]],

         [[ 4.3438,  3.1875]],

         [[ 2.0312,  0.9727]],

         [[ 0.7969, -5.4688]],

         [[ 4.9375,  5.1875]],

         [[ 4.7500,  2.1562]],

         [[ 4.5000,  0.5391]]]])

然后调用语句

scores += position_bias
#scores = (1,8,1,2)
attn_weights = nn.functional.softmax(scores.float(), dim=-1).type_as(
    scores
)  # (batch_size, n_heads, seq_length, key_length)
attn_weights = nn.functional.dropout(
    attn_weights, p=self.dropout, training=self.training
)  # (batch_size, n_heads, seq_length, key_length)

# Mask heads if we want to
if layer_head_mask is not None:
    attn_weights = attn_weights * layer_head_mask

到这为止scores的内容都为(1,8,1,2)
接下来调用

attn_output = unshape((torch.matmul(attn_weights,value_states))

attn_weights = (1,8,1,2),value_states = (1,8,2,64)
相乘之后得到结果(1,8,1,64)
然后使用unshape之后进行输出

attn_output = unshape(torch.matmul(attn_weights,value_states))
#attn_output = (1,1,512)
attn_output = self.o(attn_output)

获得结果

attn_output = (1,1,512)

第二次调用6个decoder部分的layercrossattention

刚开始调用的参数一样

batch_size,seq_length = hidden_states.shape[:2]
real_seq_length = seq_length

这里的batch_size = 1,seq_length = 1,real_seq_length = 1
接着调用

key_length = real_seq_length if key_value_states is None else key_value_states.shape[1]

获得参数

key_length = 11

唯一的区别就在于key_states和value_states的调用过程不一样

key_states = project(
    hidden_states, self.k, key_value_states, past_key_value[0] if past_key_value is not None else None
)
value_states = project(
    hidden_states, self.v, key_value_states, past_key_value[1] if past_key_value is not None else None
)

首先,这里传入的past_key_value[0]和past_key_value[1]为同一层次上一波的运行的结果
这里调用的past_key_value[0]相当于上一个位置同一层输出的key_states,past_key_value[1]相当于上一个位置同一层输出的value_states(比如这里是第二波6个encoder+3个decoder+第4个decoder的selflayerattention,那么前面就相当于第一波的6个encoder+3个decoder+第4个decoder的selflayerattention的内容)
接下来进入project函数之中

def project(hidden_states, proj_layer, key_value_states, past_key_value):
    """projects hidden states correctly to key/query states"""
    if key_value_states is None:
        # self-attn
        # (batch_size, n_heads, seq_length, dim_per_head)
        hidden_states = shape(proj_layer(hidden_states))
    elif past_key_value is None:
        # cross-attn
        # (batch_size, n_heads, seq_length, dim_per_head)
        hidden_states = shape(proj_layer(key_value_states))

    if past_key_value is not None:
        if key_value_states is None:
            # self-attn
            # (batch_size, n_heads, key_length, dim_per_head)
            hidden_states = torch.cat([past_key_value, hidden_states], dim=2)
        else:
            # cross-attn
            hidden_states = past_key_value
    return hidden_states

直接运行最后一个else

hidden_states = past_key_value

获得hidden_states = torch.Size([1, 8, 11, 64])
接下来调用position_bias的内容,注意crosslayerattention的position_bias永远为零

if position_bias is None:
    if not self.has_relative_attention_bias:
        position_bias = torch.zeros(
            (1, self.n_heads, real_seq_length, key_length), device=scores.device, dtype=scores.dtype
        )

获得的position_bias的结果

position_bias = 
tensor([[[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

         [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

         [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

         [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

         [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

         [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

         [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

         [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]]])

总结一下project函数中的内容为,第一个if针对第一次的selflayerattention(包括encoder和decoder部分),else针对第一次的layercrossattention,第二个if针对第二次~第n次的selflayerattention,else针对第二次~第n次的layercrossattention
后续的操作都差不多

(1,8,1,64)*(1,8,64,11) = (1,8,1,11)
(1,8,1,11)*(1,8,11,64) = (1,8,1,64)
if position_bias is None
	......
	if mask is not None:
	......

这里的mask不为None,可以研究一下

副录

t5模型和mt5模型的区别

转自科学空间
原文地址
自从在去年10月发布后,T5在今年还经历了一次低调的小升级,具体细节可以查看Github链接,官方把升级前的T5称为T5.1.0,而升级后的叫做T5.1.1。它主要的改动来自论文《GLU Variants Improve Transformer》,主要是借用了《Language Modeling with Gated Convolutional Networks》的GLU(Gated Linear Unit)来增强FFN部分的效果。具体来说,原来T5的FFN为(T5没有Bias)
F F N ( x ) = r e l u ( x W 1 ) W 2 ( 1 ) FFN(x)=relu(xW1)W2(1) FFN(x)=relu(xW1)W2(1)

现在改为了
F F N G E G L U ( x ) = ( g e l u ( x W 1 ) ⊗ x W 2 ) W 3 ( 2 ) FFNGEGLU(x)=(gelu(xW1)⊗xW2)W3(2) FFNGEGLU(x)=(gelu(xW1)xW2)W3(2)

也就是把relu激活的第一个变化层改为了gelu激活的门控线性单元,这样FFN层增加了50%参数,但是从论文效果看效果明显增加。此外,T5.1.1还对Embedding层做了改动,原来在T5.1.0中,Encoder和Decoder的Embedding层、Decoder最后预测概率分布的Softmax层都是共享同一个Embedding矩阵的,现在T5.1.1只让Encoder和Decoder的Embedding层共享,而Decoder最后预测概率分布的Softmax层则用了一个独立的Embedding矩阵,当然这会让参数量大大增加,但Google的结论说这样做效果会更好,其结论被总结在最近的论文《Rethinking embedding coupling in pre-trained language models》中。还有最后一点改动,T5.1.1在预训练阶段去掉了Dropout,而只有在下游微调阶段才使用Dropout。

这里的区别体现在t5layerff的参数之中

class T5LayerFF(nn.Module):
    def __init__(self, config):
        super().__init__()
        if config.feed_forward_proj == "relu":
            self.DenseReluDense = T5DenseReluDense(config)
        elif config.feed_forward_proj == "gated-gelu":
            self.DenseReluDense = T5DenseGatedGeluDense(config)
        else:
            raise ValueError(
                f"{self.config.feed_forward_proj} is not supported. Choose between `relu` and `gated-gelu`"
            )
        self.layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
        self.dropout = nn.Dropout(config.dropout_rate)

    def forward(self, hidden_states):
        forwarded_states = self.layer_norm(hidden_states)
        forwarded_states = self.DenseReluDense(forwarded_states)
        hidden_states = hidden_states + self.dropout(forwarded_states)
        return hidden_states

relu对应着t5-1.0,而gated-gelu对应着t5-1.1

t5forconditionalgeneration中decoder_input_ids的变换

if labels is not None and decoder_input_ids is None and decoder_inputs_embeds is None:
	decoder_input_ids = self._shift_right(labels)

首先,如果是在generate生成函数的过程之中,则decoder_input_ids起始的时候为tensor([[0]]),上面的语句不会被调用。
其次,这里如果labels = torch.tensor([0,1,2]),并且decoder_input_ids = None的时候,会调用上面的语句,调用完成之后的结果

decoder_input_ids = torch.tensor([0,0,1])

进入_shift_right函数之中,查看内容

def _shift_right(self,input_ids):
	decoder_start_token_id = self.config.decoder_start_token_id
	pad_token_id = self.config.pad_token_id

获得的结果参数

decoder_start_token_id = 0,pad_token_id = 0

接下来进入平移inputs的部分

# shift inputs to the right
if is_torch_fx_proxy(input_ids):
    # Item assignment is not supported natively for proxies.
    shifted_input_ids = torch.full(input_ids.shape[:-1] + (1,), decoder_start_token_id)
    shifted_input_ids = torch.cat([shifted_input_ids, input_ids[..., :-1]], dim=-1)
else:
    shifted_input_ids = input_ids.new_zeros(input_ids.shape)
    shifted_input_ids[..., 1:] = input_ids[..., :-1].clone()
    shifted_input_ids[..., 0] = decoder_start_token_id

进入file_utils.py查看is_torch_fx_proxy函数

def is_torch_fx_proxy(x):
    if is_torch_fx_available():
        import torch.fx

        return isinstance(x, torch.fx.Proxy)
    return False

进入is_torch_fx_proxy函数

def is_torch_fx_available():
    return _torch_fx_available
_torch_fx_available = _torch_onnx_dict_inputs_support_available = False

这里先不去管它,直接进入后面的部分,进行tensor的移动并且在开头位置加上0的操作

shifted_input_ids = input_ids.new_zeros(input_ids.shape)
shifted_input_ids[..., 1:] = input_ids[..., :-1].clone()
shifted_input_ids[..., 0] = decoder_start_token_id
  • 4
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值