很早之前就出来的文章,简单mark一下
论文地址:https://arxiv.org/pdf/2003.10027.pdf
Abstract:
Rectified linear units (ReLU)通常在深度神经网络中使用。 到目前为止,ReLU及其衍生版本(非参数或参数)都是静态的,对所有输入样本无差别。 在本文中,我们提出了动态ReLU(DY-ReLU),这是一种动态修正器,其参数由超函数在所有输入元素上生成。 关键见解是DY-ReLU将全局上下文编码为超函数,并相应地调整了分段线性激活函数。 与静态的版本相比,DY-ReLU的额外计算成本可忽略不计,但表示能力却明显提高,尤其是对于轻量级神经网络而言。 只需将DY-ReLU用于MobileNetV2,ImageNet分类的前1位准确性就可以从72.0%提高到76.2%,而仅增加5%的FLOPs。
Introduction:
现有的激活函数如下图所示,常用的Relu及其衍生版本均是对所有输入表现相同,这种静态的处理方法,让作者提出疑问:激活函数应该是静态的还是应该为动态的?
因此,针对上述疑问,本文提出了动态的ReLU激活函数(Dynamic ReLU),一种参数化分片线性函数它参数通过辅助函数计算得到。下图给出了该动态激活函数示意图,其核心观点在于:通过辅助函数编码输入的全局上下文信息并用于指导后续的分片线性激活函数。
Dynamic ReLU:
A.DY-ReLU
其中主要包含辅助和激活两个函数:
辅助函数Q(x):计算激活函数的参数,
激活函数F-q(x): 用于计算输入的激活输出,它的参数通过上述辅助函数生成
B.Variations of Dynamic ReLU
本文总共提出了3种类型的DyReLU:
1.DyReLUA:跨空间与通道共享,
2.DyReLUB:跨空间共享,通道不共享
3.DyReLUC:空间与通道均不共享。
作者通过实验得出以下几点发现:
1.DyReLUB与DyReLUC更适合于图像分类任务;
2.DyReLUB与DyReLUC更适合于关键点检测的骨干网络,而DyReLUC更适合于关键点检测的head网络;
3.在图像分类方面,DyReLU在MobileNetV2的嵌入应用可以得到4.2% 的性能提升;
4.在关键点检测方面,DyReLU的应用可以得到3.5AP的性能提升。
代码:
import torch
import torch.nn as nn
class DyReLU(nn.Module):
def __init__(self, channels, reduction=4, k=2, conv_type='2d'):
super(DyReLU, self).__init__()
self.channels = channels
self.k = k
self.conv_type = conv_type
assert self.conv_type in ['1d', '2d']
self.fc1 = nn.Linear(channels, channels // reduction)
self.relu = nn.ReLU(inplace=True)
self.fc2 = nn.Linear(channels // reduction, 2 * k)
self.sigmoid = nn.Sigmoid()
self.register_buffer('lambdas', torch.Tensor([1.] * k + [0.5] * k).float())
self.register_buffer('init_v', torch.Tensor([1.] + [0.] * (2 * k - 1)).float())
def get_relu_coefs(self, x):
theta = torch.mean(x, axis=-1)
if self.conv_type == '2d':
theta = torch.mean(theta, axis=-1)
theta = self.fc1(theta)
theta = self.relu(theta)
theta = self.fc2(theta)
theta = 2 * self.sigmoid(theta) - 1
return theta
def forward(self, x):
raise NotImplementedError
class DyReLUA(DyReLU):
def __init__(self, channels, reduction=4, k=2, conv_type='2d'):
super(DyReLUA, self).__init__(channels, reduction, k, conv_type)
self.fc2 = nn.Linear(channels // reduction, 2 * k)
def forward(self, x):
assert x.shape[1] == self.channels
theta = self.get_relu_coefs(x)
relu_coefs = theta.view(-1, 2 * self.k) * self.lambdas + self.init_v
# BxCxL -> LxCxBx1
x_perm = x.transpose(0, -1).unsqueeze(-1)
output = x_perm * relu_coefs[:, :self.k] + relu_coefs[:, self.k:]
# LxCxBx2 -> BxCxL
result = torch.max(output, dim=-1)[0].transpose(0, -1)
return result
class DyReLUB(DyReLU):
def __init__(self, channels, reduction=8, k=2, conv_type='2d'):
super(DyReLUB, self).__init__(channels, reduction, k, conv_type)
self.fc2 = nn.Linear(channels // reduction, 2 * k * channels)
def forward(self, x):
assert x.shape[1] == self.channels
theta = self.get_relu_coefs(x)
relu_coefs = theta.view(-1, self.channels, 2 * self.k) * self.lambdas + self.init_v
if self.conv_type == '1d':
# BxCxL -> LxBxCx1
x_perm = x.permute(2, 0, 1).unsqueeze(-1)
output = x_perm * relu_coefs[:, :, :self.k] + relu_coefs[:, :, self.k:]
# LxBxCx2 -> BxCxL
result = torch.max(output, dim=-1)[0].permute(1, 2, 0)
elif self.conv_type == '2d':
# BxCxHxW -> HxWxBxCx1
x_perm = x.permute(2, 3, 0, 1).unsqueeze(-1)
output = x_perm * relu_coefs[:, :, :self.k] + relu_coefs[:, :, self.k:]
# HxWxBxCx2 -> BxCxHxW
result = torch.max(output, dim=-1)[0].permute(2, 3, 0, 1)
return result
class DyReLUC(nn.Module):
def __init__(self,
channels,
reduction=4,
k=2,
tau=10,
gamma=1/3):
super().__init__()
self.channels = channels
self.reduction = reduction
self.k = k
self.tau = tau
self.gamma = gamma
self.coef = nn.Sequential(
nn.AdaptiveAvgPool2d(1),
nn.Conv2d(channels, channels // reduction, 1),
nn.ReLU(),
nn.Conv2d(channels // reduction, 2 * k * channels, 1),
nn.Sigmoid()
)
self.sptial = nn.Conv2d(channels, 1, 1)
# default parameter setting
# lambdaA = 1.0, lambdaB = 0.5;
# alphaA1 = 1, alphaA2=alphaB1=alphaB2=0
self.register_buffer('lambdas', torch.Tensor([1.] * k + [0.5] * k).float())
self.register_buffer('bias', torch.Tensor([1.] + [0.] * (2 * k - 1)).float())
def forward(self, x):
N, C, H, W = x.size()
coef = self.coef(x)
coef = 2 * coef - 1
# coefficient update
coef = coef.view(-1, self.channels, 2 * self.k) * self.lambdas + self.bias
# spatial
gamma = self.gamma * H * W
spatial = self.sptial(x)
spatial = spatial.view(N, self.channels, -1) / self.tau
spatial = torch.softmax(spatial, dim=-1) * gamma
spatial = torch.clamp(spatial, 0, 1).view(N, 1, H, W)
# activations
# NCHW --> HWNC1
x_perm = x.permute(2, 3, 0, 1).unsqueeze(-1)
# HWNC1 * NCK --> HWNCK
output = x_perm * coef[:, :, :self.k] + coef[:, :, self.k:]
# permute spatial from NCHW to HWNC1
spatial = spatial.permute(2, 3, 0, 1).unsqueeze(-1)
output = spatial * output
# maxout and HWNC --> NCHW
result = torch.max(output, dim=-1)[0].permute(2, 3, 0, 1)
return result
Experiments:
1.vs RELU及其衍生版本:
2.DY-RELU提升: