longformer代码结构解读


longformer在延伸maxlen的同时,结构上也存在着很多的难点,这里逐步分析出来。
总的运行测试代码

import torch
#from tokenization import FullTokenizer
#from bertmodels import Bert
from pythonicforbert import FullTokenizer
from pythonicforbert import Nezha,NezhaConfig

from pythonicforbert import get_model_function
import json
longformer_bin_file = '/home/xiaoguzai/模型/Longformer/pytorch_model.bin'
longformer_config_file = '/home/xiaoguzai/模型/Longformer/config.json'
LongFormerModel,LongFormerConfig,get_data = get_model_function('longformer-base')

import json
with open('/home/xiaoguzai/模型/Longformer/config.json','r',encoding='utf8')as fp:
    json_data = json.load(fp)

longformerconfig = LongFormerConfig(**json_data)
longformer = LongFormerModel(longformerconfig)
longformermodel = get_data(longformer,longformer_bin_file)
#bert.eval()
longformermodel.eval()
input_ids = torch.ones(2,1025).long()
output_ids = longformermodel(input_ids)
print('output_id2 = ')
print(output_ids)

_sliding_chunks_query_key_matmul函数之中的结构变换

这里最难懂的是这样的几句

query_size = list(query.size())
query_size[1] = query_size[1]*2-1
query_stride = list(query.stride())
query_stride[1] = query_stride[1]//2
query = query.as_strided(size=query_size,stride=query_stride)

首先这里的query_size和key_size都是512的整数倍,因为longformer最长长度为4096,对于不是整数倍的数值,longformer会自动填充成为整数倍。
这里输入的query_size = (24,1,512,64),(24,2,512,64),…(24,n,512,64)等多种情况,其中batch_size = 2,第一个24=batch_size*num_heads,第二个1或者2为有几个512,然后后面两位一般固定为512和64,512为longformer一个周期固定的长度,64为size_per_head一个注意力头的大小。
所以这里本质上就是query_size[1]乘上2然后减去1,query_stride[1]//2之后得到的新的tensor内容
这里由于query和key的数值过大,无法直接看出变换后的tensor,可以采用化繁为简的方法,先看小的tensor的变换,进而发现规律。

import torch
import numpy as np
import random
def setup_seed(seed):
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.backends.cudnn.deterministic = True
# 设置随机数种子
setup_seed(20)
data = torch.rand(5,2,4,3)
print('data = ')
print(data)

data_size = list(data.size())
data_size[1] = data_size[1]*2-1
data_stride = list(data.stride())
data_stride[1] = data_stride[1]//2
data = data.as_strided(size=data_size,stride=data_stride)

print('data = ')
print(data)

1.当size[1] = 1的情况下,data数据保持不变

2.当size[1] = 2的情况下,data数据中间变为乘2加一,对应数据变化如下:

原先的数据:

data = 
tensor([[[[0.5615, 0.1774, 0.8147],
          [0.3295, 0.2319, 0.7832],
          [0.8544, 0.1012, 0.1877],
          [0.9310, 0.0899, 0.3156]],

         [[0.9423, 0.2536, 0.7388],
          [0.5404, 0.4356, 0.4430],
          [0.6257, 0.0379, 0.7130],
          [0.3229, 0.9631, 0.2284]]]])

现在的数据

print('data = ')

print(data)

data = 
tensor([[[[0.5615, 0.1774, 0.8147],
          [0.3295, 0.2319, 0.7832],
          [0.8544, 0.1012, 0.1877],
          [0.9310, 0.0899, 0.3156]],
          
         [[0.8544, 0.1012, 0.1877],
          [0.9310, 0.0899, 0.3156],
          [0.9423, 0.2536, 0.7388],
          [0.5404, 0.4356, 0.4430]],

         [[0.9423, 0.2536, 0.7388],
          [0.5404, 0.4356, 0.4430],
          [0.6257, 0.0379, 0.7130],
          [0.3229, 0.9631, 0.2284]]]])

可以看出来,由于第二维的4为偶数,所以中间的数据由上下两波数据拼接而成

[[0.8544, 0.1012, 0.1877],
 [0.9310, 0.0899, 0.3156],
 [0.9423, 0.2536, 0.7388],
 [0.5404, 0.4356, 0.4430]]

如果中间数据为奇数,试验一波,原装的数据为

data = 
tensor([[[[0.5615, 0.1774, 0.8147],
          [0.3295, 0.2319, 0.7832],
          [0.8544, 0.1012, 0.1877],
          [0.9310, 0.0899, 0.3156],
          [0.9423, 0.2536, 0.7388]],

         [[0.5404, 0.4356, 0.4430],
          [0.6257, 0.0379, 0.7130],
          [0.3229, 0.9631, 0.2284],
          [0.4489, 0.2113, 0.6839],
          [0.7478, 0.4627, 0.7742]]]])

经历过as_stride函数之后,新的数据为

data = 
tensor([[[[0.5615, 0.1774, 0.8147],
          [0.3295, 0.2319, 0.7832],
          [0.8544, 0.1012, 0.1877],
          [0.9310, 0.0899, 0.3156],
          [0.9423, 0.2536, 0.7388]],

         [[0.1012, 0.1877, 0.9310],
          [0.0899, 0.3156, 0.9423],
          [0.2536, 0.7388, 0.5404],
          [0.4356, 0.4430, 0.6257],
          [0.0379, 0.7130, 0.3229]],

         [[0.7388, 0.5404, 0.4356],
          [0.4430, 0.6257, 0.0379],
          [0.7130, 0.3229, 0.9631],
          [0.2284, 0.4489, 0.2113],
          [0.6839, 0.7478, 0.4627]]]])

可以看出,中间一波提取出来了数据

          0.1012, 0.1877],
 [0.9310, 0.0899, 0.3156],
 [0.9423, 0.2536, 0.7388]],

[[0.5404, 0.4356, 0.4430],
 [0.6257, 0.0379, 0.7130],
 [0.3229,

而最后一波提取出来了数据

                  0.7388]],
[[0.5404, 0.4356, 0.4430],
 [0.6257, 0.0379, 0.7130],
 [0.3229, 0.9631, 0.2284],
 [0.4489, 0.2113, 0.6839],
 [0.7478, 0.4627,

构成了新的tensor内容

3.当size[1]=3,对应数据变化如下:

data = 
tensor([[[[0.5615, 0.1774, 0.8147],
          [0.3295, 0.2319, 0.7832],
          [0.8544, 0.1012, 0.1877],
          [0.9310, 0.0899, 0.3156]],

         [[0.9423, 0.2536, 0.7388],
          [0.5404, 0.4356, 0.4430],
          [0.6257, 0.0379, 0.7130],
          [0.3229, 0.9631, 0.2284]],

         [[0.4489, 0.2113, 0.6839],
          [0.7478, 0.4627, 0.7742],
          [0.3861, 0.0727, 0.8736],
          [0.3510, 0.3279, 0.3254]]]])

变化之后的内容如下(与size[1]=2变化类似)

data = 
tensor([[[[0.5615, 0.1774, 0.8147],
          [0.3295, 0.2319, 0.7832],
          [0.8544, 0.1012, 0.1877],
          [0.9310, 0.0899, 0.3156]],

         [[0.8544, 0.1012, 0.1877],
          [0.9310, 0.0899, 0.3156],
          [0.9423, 0.2536, 0.7388],
          [0.5404, 0.4356, 0.4430]],

         [[0.9423, 0.2536, 0.7388],
          [0.5404, 0.4356, 0.4430],
          [0.6257, 0.0379, 0.7130],
          [0.3229, 0.9631, 0.2284]],

         [[0.6257, 0.0379, 0.7130],
          [0.3229, 0.9631, 0.2284],
          [0.4489, 0.2113, 0.6839],
          [0.7478, 0.4627, 0.7742]],

         [[0.4489, 0.2113, 0.6839],
          [0.7478, 0.4627, 0.7742],
          [0.3861, 0.0727, 0.8736],
          [0.3510, 0.3279, 0.3254]]]])

attention之中attention_scores的形状变化

这里

attentions = torch.matmul(query,key.transpose(-1,-2))

得到的

attentions = (24,5,512,512)
(传入的key = (24,5,512,512),value=(24,5,512,512))

接下来padding在最后一层铺满数值0

attention_scores = nn.functional.pad(
	attention_scores,(0,0,0,1)
)

得到attention_scores = (24,5,513,512)
然后调用view函数

attention_scores = attention_scores.view(*attention_sccores.size()[:-2],attention_scores.size(-1),attention_scores.size(-2))

这里原先每一层最后由(24,5,513,512)->(24,5,512,513),每一层最后的原先铺了512个零,(共铺了24*5个,跟513没有关系,513是多了一列),现在变换完形状之后仍然铺了512个零,所以每一个最后一列多了一个非零的数,最后是512个零
接下来代码

diagonal_attention_scores = attention_scores.new_empty(
    (batch_size*num_attention_heads,seq_len//one_sided_attn_window_size,one_sided_attn_window_size,one_sided_attn_window_size*2+1)
)

这里我们的输入的inputs = (2,1025),所以batch_size = 2,num_attention_heads = 12,seq_len = 1024+512(超出不够512补齐512) = 1536,one_sided_attn_window_size = 256,seq_len//one_sided_attn_window_size = 1536//256 = 6。
接下来进入diagonal_attention_scores环节:

diagonal_attention_scores内容讲解

1.第一次diagonal_attention_scores

diagonal_attention_scores[0:batch_size*num_attention_heads,0:seq_len//one_sided_attn_window_size-1,\
                          0:one_sided_attn_window_size,one_sided_attn_window_size:one_sided_attn_window_size*2+1] = \
attention_scores[0:batch_size*num_attention_heads,0:seq_len//one_sided_attn_window_size-1,0:one_sided_attn_window_size,0:one_sided_attn_window_size+1]
diagonal_attention_scores[0:2*12,0:1536//256-1,0:256,256:256*2+1] = attention_scores[0:2*12,0:1536//256-1,0:256,0:256+1]

diagonal_attention_scores[0:24,0:5,0:256,256:513] = attention_scores[0:24,0:5,0:256,0:257]

对应的转换图片
转化图片1注意是下面的attention_scores矩阵赋值给上面的diagonal_attention_scores矩阵

注意给定的区间是左闭右开的

2.第二波diagonal_attention_scores

diagonal_attention_scores[0:batch_size*num_attention_heads,-1,0:one_sided_attn_window_size,one_sided_attn_window_size:one_sided_attn_window_size*2+1] = \
attention_scores[0:batch_size*num_attention_heads,-1,one_sided_attn_window_size:one_sided_attn_window_size*2,0:one_sided_attn_window_size+1]
diagonal_attention_scores[0:2*12,-1,0:256,256:256*2+1] = attention_scores[0:2*12,-1,256:256*2,0:256+1]

diagonal_attention_scores[0:24,-1,0:256,256:513] = attention_scores[0:24,-1,256:512,0:257]

转化图片2注意是下面的attention_scores矩阵赋值给上面的diagonal_attention_scores矩阵

3.第三波diagonal_attention_scores

diagonal_attention_scores[0:batch_size*num_attention_heads,1:seq_len//one_sided_attn_window_size,\
                          0:one_sided_attn_window_size,0:one_sided_attn_window_size] = \
attention_scores[0:batch_size*num_attention_heads,0:seq_len//one_sided_attn_window_size-1,\
                 one_sided_attn_window_size-1:one_sided_attn_window_size*2-1,one_sided_attn_window_size+1:one_sided_attn_window_size*2+1]
diagonal_attention_scores[0:2*12,1:1536//256,0:256,0:256] = \
attention_scores[0:2*12,0:1536//256-1,256-1:256*2-1,256+1:256*2+1]

diagonal_attention_scores[0:24,1:6,0:256,0:256] = \
attention_scores[0:24,0:5,255:511,257:513]

转化照片3注意是下面的attention_scores矩阵赋值给上面的diagonal_attention_scores矩阵

第四波diagonal_attention_scores

diagonal_attention_scores[0:batch_size*num_attention_heads,0,1:one_sided_attn_window_size,1:one_sided_attn_window_size] = \
attention_scores[0:batch_size*num_attention_heads,0,0:one_sided_attn_window_size-1,one_sided_attn_window_size+2:2*one_sided_attn_window_size+1]
diagonal_attention_scores[0:24,0,1:256,1:256] = attention_scores[0:24,0,0:255,258:513]

转化照片4
注意是下面的attention_scores矩阵赋值给上面的diagonal_attention_scores矩阵

总结:这波longformer之中的填充内容

前面若干维(大致)
前面若干维图片1最后面一维
最后面一维度图片注意是下面的attention_scores给上面的diagonal_attention_scores矩阵进行赋值

从数组中抽取数据进行改变的操作(_sliding_chunks_query_key_matmul的剩余操作)

剩余的_sliding_chunks_query_key_matmul函数部分还有一些操作

diagonal_attention_scores = diagonal_attention_scores.view(batch_size,num_attention_heads,seq_len,2*one_sided_attn_window_size+1).transpose(2,1)

得到的形状

diagonal_attention_scores = (2,12,1536,513)->(2,1536,12,513)

总体对应的内容如下
来的代码部分正好是对上三角矩阵和下三角矩阵进行mask,从而得到dilated_sliding_window内容
longformer之中的Dilated sliding window内容
接下来进行操作

beginning_mask_2d = diagonal_attention_scores.new_ones(one_sided_attn_window_size,one_sided_attn_window_size+1).tril().flip(dims=[0])
beginning_mask = beginning_mask_2d[None,:,None,:]
ending_mask = beginning_mask.flip(dims=(1,3))

这里构成的是一个(256,257)的右下为零的左上三角矩阵
得到的结果中beginning_mask和ending_mask的结果为

beginning_mask = 
tensor([[[[1., 1., 1., 1., 1., 1.,........................................ 1., 1., 0.]],
         [[1., 1., 1., 1., 1., 1.,........................................ 1., 0., 0.]],
         [[1., 1., 1., 1., 1., 1.,........................................ 0., 0., 0.]],
         ..............................................................................
         [[1., 0., 0., 0., 0., 0.,.......................................  0., 0., 0.]]]])
ending_mask = 
tensor([[[[0., 0., 0., 0., 0., 0.,.......................................  0., 0., 1.]],
         [[0., 0., 0., 0., 0., 0.,........................................ 0., 1., 1.]],
         ..............................................................................
         [[0., 1., 1., 1., 1., 1.,........................................ 1., 1., 1.]]]])

这里我们比较一下beginning_input被beginning_mask标注之前的内容以及beginning_input被beginning_mask标注之后的内容,即经过

beginning_input.masked_fill_(beginning_mask==1,-float("inf"))

感觉上面的图片应该
之前的内容和之后的内容
之前的beginning_input

tensor([[[[0,0,.........0,0,0,0,0,2.2124],
          [0,0,.........0,0,0,0,0,5.6258],
          ...............................
          [0,0,.........0,0,0,0,0,-5.7776]],
         [[0,2.2124,2.2124,........2.2124],
          [0,0.9077,0.9077,........0.9077],
          ................................
          [0,-5.7776,-5.7776,.....-5.7776]],
          ................................
          ................................
          ................................
         [[0,2.2124,2.2124,........2.2124],
          [0,5.6258,5.6258,........5.6258],
          ................................
          [0,-5.7776,-5.7776,.....-5.7776]]]])

经历过mask之后的beginning_input内容

tensor([[[[-inf,-inf,......-inf,-inf,2.2124],
          [-inf,-inf,......-inf,-inf,5.6258],
          ..................................
          [-inf,-inf,......-inf,-inf,-5.7776]],
         [[-inf,-inf,......-inf,2.2124,2.2124],
          [-inf,-inf,......-inf,5.6258,5.6258],
          ...................................
          [-inf,-inf,......-inf,-5.7776,-5.7776]],
         [[-inf,-inf,......2.2124,2.2124,2.2124],
          [-inf,-inf,......5.6258,5.6258,5.6258],
          ...................................
          [-inf,-inf,.....-5.7776,-5.7776,-5.7776]],
          ....................................
          ....................................
          ....................................
          ....................................
         [[-inf,2.2124,....2.2124,2.2124,2.2124],
          [-inf,5.6258,....5.6258,5.6258,5.6258],
          ....................................
          [-inf,-5.7776...-5.7776,-5.7776,-5.7776]]]])

可以看出来,遮盖的掩码在左上角的部分,同理可得ending_input被遮盖前的部分和遮盖之后的部分
ending_input遮盖前的部分:

tensor([[[[2.2124,2.2124,2.2124,......2.2124,2.2124,2.2124],
          [5.6258,5.6258,5.6258,......5.6258,5.6258,5.6258],
          .................................................
         [-5.7776,-5.7776,-5.7776,...-5.7776,-5.7776,-5.7776]],
         .....................................................
         .....................................................
         .....................................................
         .....................................................
         [[2.2124,2.2124,2.2124.......2.2124,2.2124,2.2124],
          [5.6258,5.6258,5.6258.......5.6258,5.6258,5.6258],
          ....................................................
        [-5.7776,-5.7776,-5.7776.....-5.7776,-5.7776,-5.7776]],
        [[2.2124,0.0000,0.0000........0.0000,0.0000,0.0000],
         [5.6258,0.0000,0.0000........0.0000,0.0000,0.0000],
         .....................................................
        [-5.7776,0.0000,0.0000........0.0000,0.0000,0.0000]]]])

ending_input遮盖后的部分

tensor([[[[2.2124,2.2124,2.2124......2.2124,2.2124,-inf],
          [5.6258,5.6258,5.6258......5.6258,5.6258,-inf],
          ..............................................
        [-5.7776,-5.7776,-5.7776....-5.7776,-5.7776,-inf]],
         [[2.2124,2.2124,2.2124......2.2124,-inf,-inf],
          [5.6258,5.6258,5.6258......5.6258,-inf,-inf],
          ..............................................
        [-5.7776,-5.7776,-5.7776............-inf,-inf]]
          ...........................................
        

最后的内容:整体感受

本身输入的维度query = (72,2,512,64),key = (72,2,512,64),前面一系列的维度变换就不说了,相乘之后得到的维度的长度也是512,然后采用间隔去取内容的方法获取新的diagonal_attention_scores的内容,让它变成相乘之后只有一半(256,512)的对应矩阵,然后又经历一系列的翻转变换得到(256,512),最后通过取出部分的矩阵取出的长度为(256,256)的最终长度内容。
总而言之,longformer的结构整体确实非常的复杂,而且设计的非常奇特,因此目前也只是走了一遍流程,对于其中的复杂之处很多内容暂时还只是到达知其然而不知其所以然的位置。

  • 2
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Longformer是一种基于Transformer架构的预训练语言模型,它在处理长文本时表现出色。然而,需要注意的是,Longformer的最长处理长度并不是4096个token,而是根据具体的模型设置而定。为了在可接受的时间内得到结果,论文作者在预训练Longformer模型时将输入长度限制在了4096个token内,超过这个长度的部分会被截断\[1\]。 另外,尽管Longformer的时间复杂度与输入长度成线性关系,但这并不意味着Longformer对计算资源的需求较小。相反,Longformer对计算资源的需求远大于RoBERTa。如果想在Longformer上进行预训练或微调,使用v100等高性能计算资源是一个不错的选择\[1\]。 关于Huggingface中Longformer的实现和原始Longformer实现之间的区别,目前还没有得到明确的回答。你可以关注相关的GitHub issue来获取最新的信息\[2\]。 此外,为了证明Longformer的优异性能并不仅仅是因为对RoBERTa进行额外训练所带来的,作者进行了一组消融实验。实验结果表明,即使在序列长度和注意力机制上与RoBERTa完全相同的情况下,Longformer的效果仍然比RoBERTa更好\[3\]。 #### 引用[.reference_title] - *1* *2* [Longformer论文解读代码解析](https://blog.csdn.net/weixin_42105164/article/details/120768081)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^control_2,239^v3^insert_chatgpt"}} ] [.reference_item] - *3* [Longformer详解](https://blog.csdn.net/qq_37236745/article/details/109675752)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^control_2,239^v3^insert_chatgpt"}} ] [.reference_item] [ .reference_list ]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值