★★★ 本文源自AlStudio社区精品项目,【点击此处】查看更多精品内容 >>>
摘要
基础模型的核心是“更多不同”的哲学,计算机视觉和自然语言处理的惊人成功就是例证。 然而,优化的挑战和Transformer模型固有的复杂性要求范式向简单转变。 在本研究中,我们介绍了VanillaNet,一个在设计中包含优雅的神经网络架构。 通过避免高深度、快捷方式和复杂的操作,如自注意力,VanillaNet是令人耳目一新的简洁,但非常强大。 每一层都被精心制作成紧凑和简单的结构,非线性激活函数在训练后被剪枝,以恢复原始的架构。 VanillaNet克服了固有复杂性的挑战,使其成为资源紧张环境的理想选择。 其易于理解和高度简化的体系结构为高效部署打开了新的可能性。 大量实验表明,VanillaNet提供了与著名的深度神经网络和视觉Transformer相当的性能,展示了极简主义在深度学习中的力量。 VanillaNet的这一远见之旅具有重大的潜力,可以重新定义并挑战基础模型的现状,为优雅有效的模型设计开辟一条新的道路。
1. VanillaNet
在过去的几十年里,研究人员在神经网络的基本设计上达成了一些共识。大多数最先进的图像分类网络架构应该由三部分组成:
- 主干块,用于将输入图像从3个通道转换为多个通道,并进行下采样,一个学习有用的信息主题
- 主体,通常有四个阶段,每个阶段都是通过堆叠相同的块来派生的。在每个阶段之后,特征的通道将扩展,而高度和宽度将减小。不同的网络利用和堆叠不同种类的块来构建深度模型。
- 全连接层分类输出。
尽管现有的深度网络取得了成功,但它们利用大量复杂层来为以下任务提取高级特征。例如,著名的ResNet需要34或50个带shortcat的层才能在ImageNet上实现超过70%的top-1精度。Vit的基础版本由62层组成,因为自注意力中的K、Q、V需要多层来计算。随着AI芯片雨来越大,神经网络推理速度的瓶颈不再是FLOPs或参数,因为现代GPU可以很容易地进行并行计算。相比之下,它们复杂的设计和较大的深度阻碍了它们的速度。为此我们提出了Vanilla网络,即VanillaNet,其框架图如图1所示。我们遵循流行的神经网络设计,包括主干、主体和全连接层。与现有的深度网络不同,我们在每个阶段只使用一层,以建立一个尽可能少的层的极其简单的网络。该网络的特点是不采用shortcut(shortcut会增加访存时间),同时没有复杂的模块如自注意力等。

在深度学习中,通过在训练阶段引入更强的容量来增强模型的性能是很常见的。为此,我们建议利用深度训练技术来提高所提出的VanillaNet在训练期间的能力。
- 优化策略1: 深度训练,浅层推理
为了提升VanillaNet这个架构的非线性,我们提出首先提出了深度训练(Deep training)策略,在训练过程中把一个卷积层拆成两个卷积层,并在中间插入如下的非线性操作:
A ′ ( x ) = ( 1 − λ ) A ( x ) + λ x A^{\prime}(x)=(1-\lambda) A(x)+\lambda x A′(x)=(1−λ)A(x)+λx
其中, A A A 是传统的非线性激活函数,最简单的还是 ReLU, λ \lambda λ 会随着模型的优化逐渐变为1,两个卷积层就可以合并成为一层,不改变VanillaNet的结构。
- 优化策略2:换激活函数
既然我们想提升VanillaNet的非线性,一个更直接的方案是有没有非线性更强的激活函数,并且这个激活函数好并行速度快?为了实现这个既要又要的宪法,我们提出一种基于级数启发的激活函数,把多个ReLU加权加偏置堆叠起来:
A s ( x ) = ∑ i = 1 n a i A ( x + b i ) A s ( x h , w , c ) = ∑ i , j ∈ { − n , n } a i , j , c A ( x i + h , j + w , c + b c ) \begin{array}{c} A_{s}(x)=\sum_{i=1}^{n} a_{i} A\left(x+b_{i}\right)\\ A_{s}\left(x_{h, w, c}\right)=\sum_{i, j \in\{-n, n\}} a_{i, j, c} A\left(x_{i+h, j+w, c}+b_{c}\right) \end{array} As(x)=∑i=1naiA(x+bi)As(xh,w,c)=∑i,j∈{−n,n}ai,j,cA(xi+h,j+w,c+bc)
然后再进行微调,提升这个激活函数对信息的感知能力。
2. 代码复现
2.1 下载并导入所需的库
!pip install paddlex
%matplotlib inline
import paddle
import paddle.fluid as fluid
import numpy as np
import matplotlib.pyplot as plt
from paddle.vision.datasets import Cifar10
from paddle.vision.transforms import Transpose
from paddle.io import Dataset, DataLoader
from paddle import nn
import paddle.nn.functional as F
import paddle.vision.transforms as transforms
import os
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import paddlex
import itertools
2.2 创建数据集
train_tfm = transforms.Compose([
transforms.RandomResizedCrop(224, scale=(0.6, 1.0)),
transforms.ColorJitter(brightness=0.2,contrast=0.2, saturation=0.2),
transforms.RandomHorizontalFlip(0.5),
transforms.RandomRotation(20),
paddlex.transforms.MixupImage(),
transforms.ToTensor(),
transforms.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
])
test_tfm = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
])
paddle.vision.set_image_backend('cv2')
# 使用Cifar10数据集
train_dataset = Cifar10(data_file='data/data152754/cifar-10-python.tar.gz', mode='train', transform = train_tfm, )
val_dataset = Cifar10(data_file='data/data152754/cifar-10-python.tar.gz', mode='test',transform = test_tfm)
print("train_dataset: %d" % len(train_dataset))
print("val_dataset: %d" % len(val_dataset))
train_dataset: 50000
val_dataset: 10000
batch_size=128
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, drop_last=True, num_workers=4)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, drop_last=False, num_workers=4)
2.3 模型的创建
2.3.1 标签平滑
class LabelSmoothingCrossEntropy(nn.Layer):
def __init__(self, smoothing=0.1):
super().__init__()
self.smoothing = smoothing
def forward(self, pred, target):
confidence = 1. - self.smoothing
log_probs = F.log_softmax(pred, axis=-1)
idx = paddle.stack([paddle.arange(log_probs.shape[0]), target], axis=1)
nll_loss = paddle.gather_nd(-log_probs, index=idx)
smooth_loss = paddle.mean(-log_probs, axis=-1)
loss = confidence * nll_loss + self.smoothing * smooth_loss
return loss.mean()
2.3.2 DropPath
def drop_path(x, drop_prob=0.0, training=False):
"""
Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
the original name is misleading as 'Drop Connect' is a different form of dropout in a separate paper...
See discussion: https://github.com/tensorflow/tpu/issues/494#issuecomment-532968956 ...
"""
if drop_prob == 0.0 or not training:
return x
keep_prob = paddle.to_tensor(1 - drop_prob)
shape = (paddle.shape(x)[0],) + (1,) * (x.ndim - 1)
random_tensor = keep_prob + paddle.rand(shape, dtype=x.dtype)
random_tensor = paddle.floor(random_tensor) # binarize
output = x.divide(keep_prob) * random_tensor
return output
class DropPath(nn.Layer):
def __init__(self, drop_prob=None):
super(DropPath, self).__init__()
self.drop_prob = drop_prob
def forward(self, x):
return drop_path(x, self.drop_prob, self.training)
2.3.3 VanillaNet模型的创建
class activation(nn.ReLU):
def __init__(self, dim, act_num=3, deploy=False):
super(activation, self).__init__()
self.deploy = deploy
self.weight = self.create_parameter(shape=(dim, 1, act_num*2 + 1, act_num*2 + 1), default_initializer=nn.initializer.TruncatedNormal(std=.02))
self.bias = None
self.bn = nn.BatchNorm2D(dim, epsilon=1e-6)
self.dim = dim
self.act_num = act_num
def forward(self, x):
if self.deploy:
return F.conv2d(
super(activation, self).forward(x),
self.weight, self.bias, padding=(self.act_num*2 + 1)//2, groups=self.dim)
else:
return self.bn(F.conv2d(
super(activation, self).forward(x),
self.weight, padding=self.act_num, groups=self.dim))
def _fuse_bn_tensor(self, weight, bn):
kernel = weight
running_mean = bn._mean
running_var = bn._variance
gamma = bn.weight
beta = bn.bias
eps = bn.epsilon
std = (running_var + eps).sqrt()
t = (gamma / std).reshape((-1, 1, 1, 1))
return kernel * t, beta + (0 - running_mean) * gamma / std
def switch_to_deploy(self):
kernel, bias = self._fuse_bn_tensor(self.weight, self.bn)
self.weight.data = kernel
self.bias = self.create_parameter(shape=[self.dim], default_initializer=nn.initializer.Constant(0.0))
self.bias.data = bias
self.__delattr__('bn')
self.deploy = True
class Block(nn.Layer):
def __init__(self, dim, dim_out, act_num=3, stride=2, deploy=False, ada_pool=None):
super().__init__()
self.act_learn = 1
self.deploy = deploy
if self.deploy:
self.conv = nn.Conv2D(dim, dim_out, kernel_size=1)
else:
self.conv1 = nn.Sequential(
nn.Conv2D(dim, dim, kernel_size=1),
nn.BatchNorm2D(dim, epsilon=1e-6),
)
self.conv2 = nn.Sequential(
nn.Conv2D(dim, dim_out, kernel_size=1),
nn.BatchNorm2D(dim_out, epsilon=1e-6)
)
if not ada_pool:
self.pool = nn.Identity() if stride == 1 else nn.MaxPool2D(stride)
else:
self.pool = nn.Identity() if stride == 1 else nn.AdaptiveMaxPool2D((ada_pool, ada_pool))
self.act = activation(dim_out, act_num)
def forward(self, x):
if self.deploy:
x = self.conv(x)
else:
x = self.conv1(x)
x = F.leaky_relu(x,self.act_learn)
x = self.conv2(x)
x = self.pool(x)
x = self.act(x)
return x
def _fuse_bn_tensor(self, conv, bn):
kernel = conv.weight
bias = conv.bias
running_mean = bn._mean
running_var = bn._variance
gamma = bn.weight
beta = bn.bias
eps = bn.epsilon
std = (running_var + eps).sqrt()
t = (gamma / std).reshape(-1, 1, 1, 1)
return kernel * t, beta + (bias - running_mean) * gamma / std
def switch_to_deploy(self):
kernel, bias = self._fuse_bn_tensor(self.conv1[0], self.conv1[1])
self.conv1[0].weight = kernel
self.conv1[0].bias = bias
kernel, bias = self._fuse_bn_tensor(self.conv2[0], self.conv2[1])
self.conv = self.conv2[0]
self.conv.weight = paddle.matmul(kernel.transpose([0, 3, 2, 1]), self.conv1[0].weight.squeeze(3).squeeze(2)).transpose([0, 3, 2, 1])
self.conv.bias = bias + (self.conv1[0].bias.view(1,-1,1,1)*kernel).sum(3).sum(2).sum(1)
self.__delattr__('conv1')
self.__delattr__('conv2')
self.act.switch_to_deploy()
self.deploy = True
class VanillaNet(nn.Layer):
def __init__(self, in_chans=3, num_classes=1000, dims=[96, 192, 384, 768],
drop_rate=0, act_num=3, strides=[2,2,2,1], deploy=False, ada_pool=None, **kwargs):
super().__init__()
self.deploy = deploy
if self.deploy:
self.stem = nn.Sequential(
nn.Conv2D(in_chans, dims[0], kernel_size=4, stride=4),
activation(dims[0], act_num)
)
else:
self.stem1 = nn.Sequential(
nn.Conv2D(in_chans, dims[0], kernel_size=4, stride=4),
nn.BatchNorm2D(dims[0], epsilon=1e-6),
)
self.stem2 = nn.Sequential(
nn.Conv2D(dims[0], dims[0], kernel_size=1, stride=1),
nn.BatchNorm2D(dims[0], epsilon=1e-6),
activation(dims[0], act_num)
)
self.act_learn = 1
self.stages = nn.LayerList()
for i in range(len(strides)):
if not ada_pool:
stage = Block(dim=dims[i], dim_out=dims[i+1], act_num=act_num, stride=strides[i], deploy=deploy)
else:
stage = Block(dim=dims[i], dim_out=dims[i+1], act_num=act_num, stride=strides[i], deploy=deploy, ada_pool=ada_pool[i])
self.stages.append(stage)
self.depth = len(strides)
if self.deploy:
self.cls = nn.Sequential(
nn.AdaptiveAvgPool2D((1,1)),
nn.Dropout(drop_rate),
nn.Conv2D(dims[-1], num_classes, 1),
)
else:
self.cls1 = nn.Sequential(
nn.AdaptiveAvgPool2D((1,1)),
nn.Dropout(drop_rate),
nn.Conv2D(dims[-1], num_classes, 1),
nn.BatchNorm2D(num_classes, epsilon=1e-6),
)
self.cls2 = nn.Sequential(
nn.Conv2D(num_classes, num_classes, 1)
)
self.apply(self._init_weights)
def _init_weights(self, m):
tn = nn.initializer.TruncatedNormal(std=.02)
zero = nn.initializer.Constant(0.0)
if isinstance(m, (nn.Conv2D, nn.Linear)):
tn(m.weight)
zero(m.bias)
def change_act(self, m):
for i in range(self.depth):
self.stages[i].act_learn = m
self.act_learn = m
def forward(self, x):
if self.deploy:
x = self.stem(x)
else:
x = self.stem1(x)
x = F.leaky_relu(x,self.act_learn)
x = self.stem2(x)
for i in range(self.depth):
x = self.stages[i](x)
if self.deploy:
x = self.cls(x)
else:
x = self.cls1(x)
x = F.leaky_relu(x,self.act_learn)
x = self.cls2(x)
return x.reshape((x.shape[0],-1))
def _fuse_bn_tensor(self, conv, bn):
kernel = conv.weight
bias = conv.bias
running_mean = bn._mean
running_var = bn._variance
gamma = bn.weight
beta = bn.bias
eps = bn.epsilon
std = (running_var + eps).sqrt()
t = (gamma / std).reshape(-1, 1, 1, 1)
return kernel * t, beta + (bias - running_mean) * gamma / std
def switch_to_deploy(self):
self.stem2[2].switch_to_deploy()
kernel, bias = self._fuse_bn_tensor(self.stem1[0], self.stem1[1])
self.stem1[0].weight = kernel
self.stem1[0].bias = bias
kernel, bias = self._fuse_bn_tensor(self.stem2[0], self.stem2[1])
self.stem1[0].weight = paddle.einsum('oi,icjk->ocjk', kernel.squeeze(3).squeeze(2), self.stem1[0].weight)
self.stem1[0].bias = bias + (self.stem1[0].bias.reshape((1,-1,1,1)) * kernel).sum(3).sum(2).sum(1)
self.stem = nn.Sequential(*[self.stem1[0], self.stem2[2]])
self.__delattr__('stem1')
self.__delattr__('stem2')
for i in range(self.depth):
self.stages[i].switch_to_deploy()
kernel, bias = self._fuse_bn_tensor(self.cls1[2], self.cls1[3])
self.cls1[2].weight = kernel
self.cls1[2].bias = bias
kernel, bias = self.cls2[0].weight, self.cls2[0].bias
self.cls1[2].weight = paddle.matmul(kernel.transpose([0, 3, 2, 1]), self.cls1[2].weight.squeeze(3).squeeze(2)).transpose([0, 3, 2, 1])
self.cls1[2].bias = bias + (self.cls1[2].bias.reshape((1,-1,1,1)) * kernel).sum(3).sum(2).sum(1)
self.cls = nn.Sequential(*self.cls1[0:3])
self.__delattr__('cls1')
self.__delattr__('cls2')
self.deploy = True
def vanillanet_5(pretrained=False,in_22k=False, **kwargs):
model = VanillaNet(dims=[128*4, 256*4, 512*4, 1024*4], strides=[2,2,2], **kwargs)
return model
def vanillanet_6(pretrained=False,in_22k=False, **kwargs):
model = VanillaNet(dims=[128*4, 256*4, 512*4, 1024*4, 1024*4], strides=[2,2,2,1], **kwargs)
return model
def vanillanet_7(pretrained=False,in_22k=False, **kwargs):
model = VanillaNet(dims=[128*4, 128*4, 256*4, 512*4, 1024*4, 1024*4], strides=[1,2,2,2,1], **kwargs)
return model
def vanillanet_8(pretrained=False, in_22k=False, **kwargs):
model = VanillaNet(dims=[128*4, 128*4, 256*4, 512*4, 512*4, 1024*4, 1024*4], strides=[1,2,2,1,2,1], **kwargs)
return model
def vanillanet_9(pretrained=False, in_22k=False, **kwargs):
model = VanillaNet(dims=[128*4, 128*4, 256*4, 512*4, 512*4, 512*4, 1024*4, 1024*4], strides=[1,2,2,1,1,2,1], **kwargs)
return model
def vanillanet_10(pretrained=False, in_22k=False, **kwargs):
model = VanillaNet(
dims=[128*4, 128*4, 256*4, 512*4, 512*4, 512*4, 512*4, 1024*4, 1024*4],
strides=[1,2,2,1,1,1,2,1],
**kwargs)
return model
def vanillanet_11(pretrained=False, in_22k=False, **kwargs):
model = VanillaNet(
dims=[128*4, 128*4, 256*4, 512*4, 512*4, 512*4, 512*4, 512*4, 1024*4, 1024*4],
strides=[1,2,2,1,1,1,1,2,1],
**kwargs)
return model
def vanillanet_12(pretrained=False, in_22k=False, **kwargs):
model = VanillaNet(
dims=[128*4, 128*4, 256*4, 512*4, 512*4, 512*4, 512*4, 512*4, 512*4, 1024*4, 1024*4],
strides=[1,2,2,1,1,1,1,1,2,1],
**kwargs)
return model
def vanillanet_13(pretrained=False, in_22k=False, **kwargs):
model = VanillaNet(
dims=[128*4, 128*4, 256*4, 512*4, 512*4, 512*4, 512*4, 512*4, 512*4, 512*4, 1024*4, 1024*4],
strides=[1,2,2,1,1,1,1,1,1,2,1],
**kwargs)
return model
def vanillanet_13_x1_5(pretrained=False, in_22k=False, **kwargs):
model = VanillaNet(
dims=[128*6, 128*6, 256*6, 512*6, 512*6, 512*6, 512*6, 512*6, 512*6, 512*6, 1024*6, 1024*6],
strides=[1,2,2,1,1,1,1,1,1,2,1],
**kwargs)
return model
def vanillanet_13_x1_5_ada_pool(pretrained=False, in_22k=False, **kwargs):
model = VanillaNet(
dims=[128*6, 128*6, 256*6, 512*6, 512*6, 512*6, 512*6, 512*6, 512*6, 512*6, 1024*6, 1024*6],
strides=[1,2,2,1,1,1,1,1,1,2,1],
ada_pool=[0,40,20,0,0,0,0,0,0,10,0],
**kwargs)
return model
2.3.4 模型的参数
model = vanillanet_5(num_classes=10, deploy=False)
paddle.summary(model, (1, 3, 224, 224))

model = vanillanet_5(num_classes=10, deploy=True)
paddle.summary(model, (1, 3, 224, 224))

2.4 训练
learning_rate = 0.001
n_epochs = 50
decay_epochs = 20
paddle.seed(42)
np.random.seed(42)
work_path = 'work/model'
# VanillaNet-5
model = vanillanet_5(num_classes=1000, deploy=False)
criterion = LabelSmoothingCrossEntropy()
scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=learning_rate, T_max=50000 // batch_size * n_epochs, verbose=False)
optimizer = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=scheduler, weight_decay=1e-5)
gate = 0.0
threshold = 0.0
best_acc = 0.0
val_acc = 0.0
loss_record = {'train': {'loss': [], 'iter': []}, 'val': {'loss': [], 'iter': []}} # for recording loss
acc_record = {'train': {'acc': [], 'iter': []}, 'val': {'acc': [], 'iter': []}} # for recording accuracy
loss_iter = 0
acc_iter = 0
for epoch in range(n_epochs):
# ---------- Training ----------
if epoch <= decay_epochs:
act_learn = epoch / decay_epochs * 1.0
else:
act_learn = 1.0
model.change_act(act_learn)
model.train()
train_num = 0.0
train_loss = 0.0
val_num = 0.0
val_loss = 0.0
accuracy_manager = paddle.metric.Accuracy()
val_accuracy_manager = paddle.metric.Accuracy()
print("#===epoch: {}, lr={:.10f}===#".format(epoch, optimizer.get_lr()))
for batch_id, data in enumerate(train_loader):
x_data, y_data = data
labels = paddle.unsqueeze(y_data, axis=1)
logits = model(x_data)
loss = criterion(logits, y_data)
acc = paddle.metric.accuracy(logits, labels)
accuracy_manager.update(acc)
if batch_id % 10 == 0:
loss_record['train']['loss'].append(loss.numpy())
loss_record['train']['iter'].append(loss_iter)
loss_iter += 1
loss.backward()
optimizer.step()
scheduler.step()
optimizer.clear_grad()
train_loss += loss
train_num += len(y_data)
total_train_loss = (train_loss / train_num) * batch_size
train_acc = accuracy_manager.accumulate()
acc_record['train']['acc'].append(train_acc)
acc_record['train']['iter'].append(acc_iter)
acc_iter += 1
# Print the information.
print("#===epoch: {}, train loss is: {}, train acc is: {:2.2f}%===#".format(epoch, total_train_loss.numpy(), train_acc*100))
# ---------- Validation ----------
model.eval()
for batch_id, data in enumerate(val_loader):
x_data, y_data = data
labels = paddle.unsqueeze(y_data, axis=1)
with paddle.no_grad():
logits = model(x_data)
loss = criterion(logits, y_data)
acc = paddle.metric.accuracy(logits, labels)
val_accuracy_manager.update(acc)
val_loss += loss
val_num += len(y_data)
total_val_loss = (val_loss / val_num) * batch_size
loss_record['val']['loss'].append(total_val_loss.numpy())
loss_record['val']['iter'].append(loss_iter)
val_acc = val_accuracy_manager.accumulate()
acc_record['val']['acc'].append(val_acc)
acc_record['val']['iter'].append(acc_iter)
print("#===epoch: {}, val loss is: {}, val acc is: {:2.2f}%===#".format(epoch, total_val_loss.numpy(), val_acc*100))
# ===================save====================
if val_acc > best_acc:
best_acc = val_acc
paddle.save(model.state_dict(), os.path.join(work_path, 'best_model.pdparams'))
paddle.save(optimizer.state_dict(), os.path.join(work_path, 'best_optimizer.pdopt'))
print(best_acc)
paddle.save(model.state_dict(), os.path.join(work_path, 'final_model.pdparams'))
paddle.save(optimizer.state_dict(), os.path.join(work_path, 'final_optimizer.pdopt'))

2.5 结果分析
def plot_learning_curve(record, title='loss', ylabel='CE Loss'):
''' Plot learning curve of your CNN '''
maxtrain = max(map(float, record['train'][title]))
maxval = max(map(float, record['val'][title]))
ymax = max(maxtrain, maxval) * 1.1
mintrain = min(map(float, record['train'][title]))
minval = min(map(float, record['val'][title]))
ymin = min(mintrain, minval) * 0.9
total_steps = len(record['train'][title])
x_1 = list(map(int, record['train']['iter']))
x_2 = list(map(int, record['val']['iter']))
figure(figsize=(10, 6))
plt.plot(x_1, record['train'][title], c='tab:red', label='train')
plt.plot(x_2, record['val'][title], c='tab:cyan', label='val')
plt.ylim(ymin, ymax)
plt.xlabel('Training steps')
plt.ylabel(ylabel)
plt.title('Learning curve of {}'.format(title))
plt.legend()
plt.show()
plot_learning_curve(loss_record, title='loss', ylabel='CE Loss')

plot_learning_curve(acc_record, title='acc', ylabel='Accuracy')

import time
work_path = 'work/model'
model = vanillanet_5(num_classes=10, deploy=False)
model.change_act(1.0)
model_state_dict = paddle.load(os.path.join(work_path, 'best_model.pdparams'))
model.set_state_dict(model_state_dict)
model.eval()
aa = time.time()
for batch_id, data in enumerate(val_loader):
x_data, y_data = data
labels = paddle.unsqueeze(y_data, axis=1)
with paddle.no_grad():
logits = model(x_data)
bb = time.time()
print("Throughout:{}".format(int(len(val_dataset)//(bb - aa))))
Throughout:462
62
总结
本文提出的VanillaNet让我们重新思考浅层网络的有效性以及如何设计一个浅层网络达到与深层网络可比的性能,同时也提出了几个见解:
- shortcut在浅层网络中未必适用(浅层网络层数本身就少,shortcut可能会进一步加剧这个现象)
- 浅层网络与深层网络的性能差可能来源于二者之间的非线性表达能力
参考文献
此文章为搬运
原项目链接
文章介绍了一种名为VanillaNet的新颖神经网络架构,它通过减少深度、避免快捷方式和复杂的操作,如自注意力,实现了简洁而强大的设计。VanillaNet展示了浅层网络可以与深度模型相媲美的性能,尤其适合资源有限的环境。文章还提出了深度训练和新激活函数等优化策略,以增强非线性表达能力。
2919

被折叠的 条评论
为什么被折叠?



