Longformer模型难点解读
longformer在延伸maxlen的同时,结构上也存在着很多的难点,这里逐步分析出来。
总的运行测试代码
import torch
#from tokenization import FullTokenizer
#from bertmodels import Bert
from pythonicforbert import FullTokenizer
from pythonicforbert import Nezha,NezhaConfig
from pythonicforbert import get_model_function
import json
longformer_bin_file = '/home/xiaoguzai/模型/Longformer/pytorch_model.bin'
longformer_config_file = '/home/xiaoguzai/模型/Longformer/config.json'
LongFormerModel,LongFormerConfig,get_data = get_model_function('longformer-base')
import json
with open('/home/xiaoguzai/模型/Longformer/config.json','r',encoding='utf8')as fp:
json_data = json.load(fp)
longformerconfig = LongFormerConfig(**json_data)
longformer = LongFormerModel(longformerconfig)
longformermodel = get_data(longformer,longformer_bin_file)
#bert.eval()
longformermodel.eval()
input_ids = torch.ones(2,1025).long()
output_ids = longformermodel(input_ids)
print('output_id2 = ')
print(output_ids)
_sliding_chunks_query_key_matmul函数之中的结构变换
这里最难懂的是这样的几句
query_size = list(query.size())
query_size[1] = query_size[1]*2-1
query_stride = list(query.stride())
query_stride[1] = query_stride[1]//2
query = query.as_strided(size=query_size,stride=query_stride)
首先这里的query_size和key_size都是512的整数倍,因为longformer最长长度为4096,对于不是整数倍的数值,longformer会自动填充成为整数倍。
这里输入的query_size = (24,1,512,64),(24,2,512,64),…(24,n,512,64)等多种情况,其中batch_size = 2,第一个24=batch_size*num_heads,第二个1或者2为有几个512,然后后面两位一般固定为512和64,512为longformer一个周期固定的长度,64为size_per_head一个注意力头的大小。
所以这里本质上就是query_size[1]乘上2然后减去1,query_stride[1]//2之后得到的新的tensor内容
这里由于query和key的数值过大,无法直接看出变换后的tensor,可以采用化繁为简的方法,先看小的tensor的变换,进而发现规律。
import torch
import numpy as np
import random
def setup_seed(seed):
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
random.seed(seed)
torch.backends.cudnn.deterministic = True
# 设置随机数种子
setup_seed(20)
data = torch.rand(5,2,4,3)
print('data = ')
print(data)
data_size = list(data.size())
data_size[1] = data_size[1]*2-1
data_stride = list(data.stride())
data_stride[1] = data_stride[1]//2
data = data.as_strided(size=data_size,stride=data_stride)
print('data = ')
print(data)
1.当size[1] = 1的情况下,data数据保持不变
2.当size[1] = 2的情况下,data数据中间变为乘2加一,对应数据变化如下:
原先的数据:
data =
tensor([[[[0.5615, 0.1774, 0.8147],
[0.3295, 0.2319, 0.7832],
[0.8544, 0.1012, 0.1877],
[0.9310, 0.0899, 0.3156]],
[[0.9423, 0.2536, 0.7388],
[0.5404, 0.4356, 0.4430],
[0.6257, 0.0379, 0.7130],
[0.3229, 0.9631, 0.2284]]]])
现在的数据
print('data = ')
print(data)
data =
tensor([[[[0.5615, 0.1774, 0.8147],
[0.3295, 0.2319, 0.7832],
[0.8544, 0.1012, 0.1877],
[0.9310, 0.0899, 0.3156]],
[[0.8544, 0.1012, 0.1877],
[0.9310, 0.0899, 0.3156],
[0.9423, 0.2536, 0.7388],
[0.5404, 0.4356, 0.4430]],
[[0.9423, 0.2536, 0.7388],
[0.5404, 0.4356, 0.4430],
[0.6257, 0.0379, 0.7130],
[0.3229, 0.9631, 0.2284]]]])
可以看出来,由于第二维的4为偶数,所以中间的数据由上下两波数据拼接而成
[[0.8544, 0.1012, 0.1877],
[0.9310, 0.0899, 0.3156],
[0.9423, 0.2536, 0.7388],
[0.5404, 0.4356, 0.4430]]
如果中间数据为奇数,试验一波,原装的数据为
data =
tensor([[[[0.5615, 0.1774, 0.8147],
[0.3295, 0.2319, 0.7832],
[0.8544, 0.1012, 0.1877],
[0.9310, 0.0899, 0.3156],
[0.9423, 0.2536, 0.7388]],
[[0.5404, 0.4356, 0.4430],
[0.6257, 0.0379, 0.7130],
[0.3229, 0.9631, 0.2284],
[0.4489, 0.2113, 0.6839],
[0.7478, 0.4627, 0.7742]]]])
经历过as_stride函数之后,新的数据为
data =
tensor([[[[0.5615, 0.1774, 0.8147],
[0.3295, 0.2319, 0.7832],
[0.8544, 0.1012, 0.1877],
[0.9310, 0.0899, 0.3156],
[0.9423, 0.2536, 0.7388]],
[[0.1012, 0.1877, 0.9310],
[0.0899, 0.3156, 0.9423],
[0.2536, 0.7388, 0.5404],
[0.4356, 0.4430, 0.6257],
[0.0379, 0.7130, 0.3229]],
[[0.7388, 0.5404, 0.4356],
[0.4430, 0.6257, 0.0379],
[0.7130, 0.3229, 0.9631],
[0.2284, 0.4489, 0.2113],
[0.6839, 0.7478, 0.4627]]]])
可以看出,中间一波提取出来了数据
0.1012, 0.1877],
[0.9310, 0.0899, 0.3156],
[0.9423, 0.2536, 0.7388]],
[[0.5404, 0.4356, 0.4430],
[0.6257, 0.0379, 0.7130],
[0.3229,
而最后一波提取出来了数据
0.7388]],
[[0.5404, 0.4356, 0.4430],
[0.6257, 0.0379, 0.7130],
[0.3229, 0.9631, 0.2284],
[0.4489, 0.2113, 0.6839],
[0.7478, 0.4627,
构成了新的tensor内容
3.当size[1]=3,对应数据变化如下:
data =
tensor([[[[0.5615, 0.1774, 0.8147],
[0.3295, 0.2319, 0.7832],
[0.8544, 0.1012, 0.1877],
[0.9310, 0.0899, 0.3156]],
[[0.9423, 0.2536, 0.7388],
[0.5404, 0.4356, 0.4430],
[0.6257, 0.0379, 0.7130],
[0.3229, 0.9631, 0.2284]],
[[0.4489, 0.2113, 0.6839],
[0.7478, 0.4627, 0.7742],
[0.3861, 0.0727, 0.8736],
[0.3510, 0.3279, 0.3254]]]])
变化之后的内容如下(与size[1]=2变化类似)
data =
tensor([[[[0.5615, 0.1774, 0.8147],
[0.3295, 0.2319, 0.7832],
[0.8544, 0.1012, 0.1877],
[0.9310, 0.0899, 0.3156]],
[[0.8544, 0.1012, 0.1877],
[0.9310, 0.0899, 0.3156],
[0.9423, 0.2536, 0.7388],
[0.5404, 0.4356, 0.4430]],
[[0.9423, 0.2536, 0.7388],
[0.5404, 0.4356, 0.4430],
[0.6257, 0.0379, 0.7130],
[0.3229, 0.9631, 0.2284]],
[[0.6257, 0.0379, 0.7130],
[0.3229, 0.9631, 0.2284],
[0.4489, 0.2113, 0.6839],
[0.7478, 0.4627, 0.7742]],
[[0.4489, 0.2113, 0.6839],
[0.7478, 0.4627, 0.7742],
[0.3861, 0.0727, 0.8736],
[0.3510, 0.3279, 0.3254]]]])
attention之中attention_scores的形状变化
这里
attentions = torch.matmul(query,key.transpose(-1,-2))
得到的
attentions = (24,5,512,512)
(传入的key = (24,5,512,512),value=(24,5,512,512))
接下来padding在最后一层铺满数值0
attention_scores = nn.functional.pad(
attention_scores,(0,0,0,1)
)
得到attention_scores = (24,5,513,512)
然后调用view函数
attention_scores = attention_scores.view(*attention_sccores.size()[:-2],attention_scores.size(-1),attention_scores.size(-2))
这里原先每一层最后由(24,5,513,512)->(24,5,512,513),每一层最后的原先铺了512个零,(共铺了24*5个,跟513没有关系,513是多了一列),现在变换完形状之后仍然铺了512个零,所以每一个最后一列多了一个非零的数,最后是512个零
接下来代码
diagonal_attention_scores = attention_scores.new_empty(
(batch_size*num_attention_heads,seq_len//one_sided_attn_window_size,one_sided_attn_window_size,one_sided_attn_window_size*2+1)
)
这里我们的输入的inputs = (2,1025),所以batch_size = 2,num_attention_heads = 12,seq_len = 1024+512(超出不够512补齐512) = 1536,one_sided_attn_window_size = 256,seq_len//one_sided_attn_window_size = 1536//256 = 6。
接下来进入diagonal_attention_scores环节:
diagonal_attention_scores内容讲解
1.第一次diagonal_attention_scores
diagonal_attention_scores[0:batch_size*num_attention_heads,0:seq_len//one_sided_attn_window_size-1,\
0:one_sided_attn_window_size,one_sided_attn_window_size:one_sided_attn_window_size*2+1] = \
attention_scores[0:batch_size*num_attention_heads,0:seq_len//one_sided_attn_window_size-1,0:one_sided_attn_window_size,0:one_sided_attn_window_size+1]
diagonal_attention_scores[0:2*12,0:1536//256-1,0:256,256:256*2+1] = attention_scores[0:2*12,0:1536//256-1,0:256,0:256+1]
diagonal_attention_scores[0:24,0:5,0:256,256:513] = attention_scores[0:24,0:5,0:256,0:257]
对应的转换图片
注意是下面的attention_scores矩阵赋值给上面的diagonal_attention_scores矩阵
注意给定的区间是左闭右开的
2.第二波diagonal_attention_scores
diagonal_attention_scores[0:batch_size*num_attention_heads,-1,0:one_sided_attn_window_size,one_sided_attn_window_size:one_sided_attn_window_size*2+1] = \
attention_scores[0:batch_size*num_attention_heads,-1,one_sided_attn_window_size:one_sided_attn_window_size*2,0:one_sided_attn_window_size+1]
diagonal_attention_scores[0:2*12,-1,0:256,256:256*2+1] = attention_scores[0:2*12,-1,256:256*2,0:256+1]
diagonal_attention_scores[0:24,-1,0:256,256:513] = attention_scores[0:24,-1,256:512,0:257]
注意是下面的attention_scores矩阵赋值给上面的diagonal_attention_scores矩阵
3.第三波diagonal_attention_scores
diagonal_attention_scores[0:batch_size*num_attention_heads,1:seq_len//one_sided_attn_window_size,\
0:one_sided_attn_window_size,0:one_sided_attn_window_size] = \
attention_scores[0:batch_size*num_attention_heads,0:seq_len//one_sided_attn_window_size-1,\
one_sided_attn_window_size-1:one_sided_attn_window_size*2-1,one_sided_attn_window_size+1:one_sided_attn_window_size*2+1]
diagonal_attention_scores[0:2*12,1:1536//256,0:256,0:256] = \
attention_scores[0:2*12,0:1536//256-1,256-1:256*2-1,256+1:256*2+1]
diagonal_attention_scores[0:24,1:6,0:256,0:256] = \
attention_scores[0:24,0:5,255:511,257:513]
注意是下面的attention_scores矩阵赋值给上面的diagonal_attention_scores矩阵
第四波diagonal_attention_scores
diagonal_attention_scores[0:batch_size*num_attention_heads,0,1:one_sided_attn_window_size,1:one_sided_attn_window_size] = \
attention_scores[0:batch_size*num_attention_heads,0,0:one_sided_attn_window_size-1,one_sided_attn_window_size+2:2*one_sided_attn_window_size+1]
diagonal_attention_scores[0:24,0,1:256,1:256] = attention_scores[0:24,0,0:255,258:513]
注意是下面的attention_scores矩阵赋值给上面的diagonal_attention_scores矩阵
总结:这波longformer之中的填充内容
前面若干维(大致)
最后面一维
注意是下面的attention_scores给上面的diagonal_attention_scores矩阵进行赋值
从数组中抽取数据进行改变的操作(_sliding_chunks_query_key_matmul的剩余操作)
剩余的_sliding_chunks_query_key_matmul函数部分还有一些操作
diagonal_attention_scores = diagonal_attention_scores.view(batch_size,num_attention_heads,seq_len,2*one_sided_attn_window_size+1).transpose(2,1)
得到的形状
diagonal_attention_scores = (2,12,1536,513)->(2,1536,12,513)
总体对应的内容如下
来的代码部分正好是对上三角矩阵和下三角矩阵进行mask,从而得到dilated_sliding_window内容
接下来进行操作
beginning_mask_2d = diagonal_attention_scores.new_ones(one_sided_attn_window_size,one_sided_attn_window_size+1).tril().flip(dims=[0])
beginning_mask = beginning_mask_2d[None,:,None,:]
ending_mask = beginning_mask.flip(dims=(1,3))
这里构成的是一个(256,257)的右下为零的左上三角矩阵
得到的结果中beginning_mask和ending_mask的结果为
beginning_mask =
tensor([[[[1., 1., 1., 1., 1., 1.,........................................ 1., 1., 0.]],
[[1., 1., 1., 1., 1., 1.,........................................ 1., 0., 0.]],
[[1., 1., 1., 1., 1., 1.,........................................ 0., 0., 0.]],
..............................................................................
[[1., 0., 0., 0., 0., 0.,....................................... 0., 0., 0.]]]])
ending_mask =
tensor([[[[0., 0., 0., 0., 0., 0.,....................................... 0., 0., 1.]],
[[0., 0., 0., 0., 0., 0.,........................................ 0., 1., 1.]],
..............................................................................
[[0., 1., 1., 1., 1., 1.,........................................ 1., 1., 1.]]]])
这里我们比较一下beginning_input被beginning_mask标注之前的内容以及beginning_input被beginning_mask标注之后的内容,即经过
beginning_input.masked_fill_(beginning_mask==1,-float("inf"))
感觉上面的图片应该
之前的内容和之后的内容
之前的beginning_input
tensor([[[[0,0,.........0,0,0,0,0,2.2124],
[0,0,.........0,0,0,0,0,5.6258],
...............................
[0,0,.........0,0,0,0,0,-5.7776]],
[[0,2.2124,2.2124,........2.2124],
[0,0.9077,0.9077,........0.9077],
................................
[0,-5.7776,-5.7776,.....-5.7776]],
................................
................................
................................
[[0,2.2124,2.2124,........2.2124],
[0,5.6258,5.6258,........5.6258],
................................
[0,-5.7776,-5.7776,.....-5.7776]]]])
经历过mask之后的beginning_input内容
tensor([[[[-inf,-inf,......-inf,-inf,2.2124],
[-inf,-inf,......-inf,-inf,5.6258],
..................................
[-inf,-inf,......-inf,-inf,-5.7776]],
[[-inf,-inf,......-inf,2.2124,2.2124],
[-inf,-inf,......-inf,5.6258,5.6258],
...................................
[-inf,-inf,......-inf,-5.7776,-5.7776]],
[[-inf,-inf,......2.2124,2.2124,2.2124],
[-inf,-inf,......5.6258,5.6258,5.6258],
...................................
[-inf,-inf,.....-5.7776,-5.7776,-5.7776]],
....................................
....................................
....................................
....................................
[[-inf,2.2124,....2.2124,2.2124,2.2124],
[-inf,5.6258,....5.6258,5.6258,5.6258],
....................................
[-inf,-5.7776...-5.7776,-5.7776,-5.7776]]]])
可以看出来,遮盖的掩码在左上角的部分,同理可得ending_input被遮盖前的部分和遮盖之后的部分
ending_input遮盖前的部分:
tensor([[[[2.2124,2.2124,2.2124,......2.2124,2.2124,2.2124],
[5.6258,5.6258,5.6258,......5.6258,5.6258,5.6258],
.................................................
[-5.7776,-5.7776,-5.7776,...-5.7776,-5.7776,-5.7776]],
.....................................................
.....................................................
.....................................................
.....................................................
[[2.2124,2.2124,2.2124.......2.2124,2.2124,2.2124],
[5.6258,5.6258,5.6258.......5.6258,5.6258,5.6258],
....................................................
[-5.7776,-5.7776,-5.7776.....-5.7776,-5.7776,-5.7776]],
[[2.2124,0.0000,0.0000........0.0000,0.0000,0.0000],
[5.6258,0.0000,0.0000........0.0000,0.0000,0.0000],
.....................................................
[-5.7776,0.0000,0.0000........0.0000,0.0000,0.0000]]]])
ending_input遮盖后的部分
tensor([[[[2.2124,2.2124,2.2124......2.2124,2.2124,-inf],
[5.6258,5.6258,5.6258......5.6258,5.6258,-inf],
..............................................
[-5.7776,-5.7776,-5.7776....-5.7776,-5.7776,-inf]],
[[2.2124,2.2124,2.2124......2.2124,-inf,-inf],
[5.6258,5.6258,5.6258......5.6258,-inf,-inf],
..............................................
[-5.7776,-5.7776,-5.7776............-inf,-inf]]
...........................................
最后的内容:整体感受
本身输入的维度query = (72,2,512,64),key = (72,2,512,64),前面一系列的维度变换就不说了,相乘之后得到的维度的长度也是512,然后采用间隔去取内容的方法获取新的diagonal_attention_scores的内容,让它变成相乘之后只有一半(256,512)的对应矩阵,然后又经历一系列的翻转变换得到(256,512),最后通过取出部分的矩阵取出的长度为(256,256)的最终长度内容。
总而言之,longformer的结构整体确实非常的复杂,而且设计的非常奇特,因此目前也只是走了一遍流程,对于其中的复杂之处很多内容暂时还只是到达知其然而不知其所以然的位置。