🚩🚩🚩Transformer实战-系列教程总目录
有任何问题欢迎在下面留言
本篇文章的代码运行界面均在Pycharm中进行
本篇文章配套的代码资源已经上传
Vision Transformer 源码解读1
Vision Transformer 源码解读2
Vision Transformer 源码解读3
Vision Transformer 源码解读4
4、Embbeding类------构造函数
self.embeddings = Embeddings(config, img_size=img_size)
class Embeddings(nn.Module):
"""Construct the embeddings from patch, position embeddings.
"""
def __init__(self, config, img_size, in_channels=3):
super(Embeddings, self).__init__()
self.hybrid = None
img_size = _pair(img_size)
if config.patches.get("grid") is not None:
grid_size = config.patches["grid"]
patch_size = (img_size[0] // 16 // grid_size[0], img_size[1] // 16 // grid_size[1])
n_patches = (img_size[0] // 16) * (img_size[1] // 16)
self.hybrid = True
else:
patch_size = _pair(config.patches["size"])
n_patches = (img_size[0] // patch_size[0]) * (img_size[1] // patch_size[1])
self.hybrid = False
if self.hybrid:
self.hybrid_model = ResNetV2(block_units=config.resnet.num_layers,
width_factor=config.resnet.width_factor)
in_channels = self.hybrid_model.width * 16
self.patch_embeddings = Conv2d(in_channels=in_channels,
out_channels=config.hidden_size,
kernel_size=patch_size,
stride=patch_size)
self.position_embeddings = nn.Parameter(torch.zeros(1, n_patches+1, config.hidden_size))
self.cls_token = nn.Parameter(torch.zeros(1, 1, config.hidden_size))
self.dropout = Dropout(config.transformer["dropout_rate"])
def forward(self, x):
# print(x.shape)
B = x.shape[0]
cls_tokens = self.cls_token.expand(B, -1, -1)
# print(cls_tokens.shape)
if self.hybrid:
x = self.hybrid_model(x)
x = self.patch_embeddings(x)#Conv2d: Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
# print(x.shape)
x = x.flatten(2)
# print(x.shape)
x = x.transpose(-1, -2)
# print(x.shape)
x = torch.cat((cls_tokens, x), dim=1)
# print(x.shape)
embeddings = x + self.position_embeddings
# print(embeddings.shape)
embeddings = self.dropout(embeddings)
# print(embeddings.shape)
return embeddings
接上前面的debug模式,在构造模型部分一直步入到Embbeding类中:
- 构造函数,传入了图像大小224*224,通道数3,以及配置参数
- patch_size=[16,16],16*16的区域选出一份特征,这个参数自己定义
- n_patches,224224的图像能够切分出1616的格子数量,(224/16)(224/16)=1414=196个
- 196就是我们要定义的序列的长度了
- patch_embeddings,是一个二维卷积,输入通道为3,输出通道为768,卷积核为patch_size=1616,步长为1616,步长为1616就表明原本224224的图像卷积后的长宽就为14*14了
- position_embeddings,初始化参数全部为0 ,形状为[1,197,768],197=196+1,加一的原因是在Transformer模型中,通常会在序列的开始添加一个可学习的类标记(class token),它在训练过程中帮助模型捕获全局信息以用于分类任务。position_embeddings是用来记录位置信息的
- cls_token,初始化参数全部为0,形状为[1,1,768]
- 因为要涉及到全连接层,所以加上Dropout
5、Encoder类------构造函数
self.encoder = Encoder(config, vis)
class Encoder(nn.Module):
def __init__(self, config, vis):
super(Encoder, self).__init__()
self.vis = vis
self.layer = nn.ModuleList()
self.encoder_norm = LayerNorm(config.hidden_size, eps=1e-6)
for _ in range(config.transformer["num_layers"]):
layer = Block(config, vis)
self.layer.append(copy.deepcopy(layer))
def forward(self, hidden_states):
# print(hidden_states.shape)
attn_weights = []
for layer_block in self.layer:
hidden_states, weights = layer_block(hidden_states)
if self.vis:
attn_weights.append(weights)
encoded = self.encoder_norm(hidden_states)
return encoded, attn_weights
接上前面的debug模式,在构造模型部分步入到Encoder类中:
- 构造函数传进配置参数
- vis,设置可视化
- layer,设置PyTorch的一个列表
- encoder_norm,LayerNorm,Batch Normalization是对Batch做归一化,LayerNorm对层
- 循环添加
Block
:循环config.transformer["num_layers"]
次,每次都创建一个Block
实例并添加到self.layer
中。这里的Block
是一个定义了Transformer编码器层的类,它包括自注意力机制和前馈网络。copy.deepcopy(layer)
确保每次都是向ModuleList
添加一个新的、独立的Block
副本
之前ConvNet的任务中,都是使用Batch 做归一化,为什么Transformer是对Layer做归一化呢,Transformer是在NLP任务中提出来的,每一句话的单词个数都不一样,太长的阶段,短的补0,如果是对batch做归一化,长句子的后面一些地方要和短句子补0的地方做归一化,改用Layer归一化实现显著提升效果的情况。
Vision Transformer 源码解读1
Vision Transformer 源码解读2
Vision Transformer 源码解读3
Vision Transformer 源码解读4