论文解读:Do You Even Need Attention?
开源代码:Do You Even Need Attention?
由于我还没接触transformer框架(占个坑,后面补充),故仅对文中提到的Feed-Forward层进行分析,后续再对该开源代码的整体进行分析。 文中提到了应用在patch维度上的前馈层取代了 vision transformer 中的注意层,由此产生的架构只是一系列以交替方式应用于patch和特征维度的前馈层。
在图中可以看到多层时,Feed-Forward层交替处理Features和Patches,同时LinearBlock网络中采用残差结构,网络展开如下:
LinearBlock(
(mlp1): Mlp(
(fc1): Linear(in_features=192, out_features=768, bias=True)
(act): GELU()
(fc2): Linear(in_features=768, out_features=192, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
(norm1): LayerNorm((192,), eps=1e-06, elementwise_affine=True)
(mlp2): Mlp(
(fc1): Linear(in_features=197, out_features=788, bias=True)
(act): GELU()
(fc2): Linear(in_features=788, out_features=197, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
分析得到FeedForward的网络结构图如下所示:
LinearBlock模块设计的源码:
class LinearBlock(nn.Module):
def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, qk_scale=None, drop=0., attn_drop=0.,
drop_path=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm, num_tokens=197):
super().__init__()
# First stage
self.mlp1 = Mlp(in_features=dim, hidden_features=int(dim * mlp_ratio), act_layer=act_layer, drop=drop)
self.norm1 = norm_layer(dim)
# Second stage
self.mlp2 = Mlp(in_features=num_tokens, hidden_features=int(
num_tokens * mlp_ratio), act_layer=act_layer, drop=drop)
self.norm2 = norm_layer(num_tokens)
# Dropout (or a variant)
self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
def forward(self, x):
x = x + self.drop_path(self.mlp1(self.norm1(x)))
x = x.transpose(-2, -1)
x = x + self.drop_path(self.mlp2(self.norm2(x)))
x = x.transpose(-2, -1)
return x