图像分类相关
基于Nvidia 的deep learning example
GitHub:https://github.com/NVIDIA/DeepLearningExamples
这个系列只是督促自己可以好好的学习一些优秀的开源项目 😁,日后遇到一些相关的task可以作为个入门,吸取些精华。
模型
ResNet50
残差 经典
Nvidia 中的实现的网络称为ResNet50 v1.5,是原始ResNet50的一个更改版本。
更改点: 在需要下采样的bottleneck block中,v1是在第一个1*1的卷积层采用的stride = 2 , v1.5是在3*3的卷积层采用的stride = 2
准确率:⬆️0.5% top1
性能:⬇️ 5% imgs/sec
训练:混合精度 mixed precision
为混合精度训练支持::channels last (NHWC):: 🤔️
推荐阅读: (beta) Channels Last Memory Format in PyTorch — PyTorch Tutorials 1.5.1 documentation
初始化方式:He初始化
Optimizer:
- SGD
- Momentum: 0.875
- Learning rate: 0.256 for 256 batch size,linearly scale for other batch size.
- Learning rate schedule: cosine LR schedule
- ::512 batch size and up:: : linear warmup of the learning rate Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
- Weight decay: 1/32768 not apply on BN trainable parameters (gamma/bias )
- label smoothing = 0.1
训练的trick,一种正则化策略。 - Mixup regularization
Data augmentation :
- Training
- Normalization
- Random resized crop to 224x224
- Scale from 8% to 100%
- Aspect ratio from 3/4 to 4/3
- Random horizontal flip
- Inference
- Normalization
- Scale to 256x256
- Center crop to 224x224
::We use NVIDIA DALI , which speeds up data loading when CPU becomes a bottleneck. DALI can use CPU or GPU, and outperforms the PyTorch native dataloader.::
- 针对Nvidia Ampere GPU 的优化:::TensorFloat-32::
TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x
ResNeXt101-32x4d
paper: Aggregated Residual Transformations for Deep Neural Networks
本质: split-transform-merge / grouped convolutions
32代表group的个数,4d 代表通道数
::cuDNN7更新说明:Grouped Convolutions for models such as ResNeXt and Xception and CTC (Connectionist Temporal Classification) loss layer for temporal classification::
除网络结构,其他设置等基本同上
SE-ResNeXt101-32x4d
在ResNeXt的基础上,增加了SE (squeeze and excitation) module
channel 的attention
class SqueezeAndExcitation(nn.Module):
def __init__(self, planes, squeeze):
super(SqueezeAndExcitation, self).__init__()
self.squeeze = nn.Linear(planes, squeeze)
self.expand = nn.Linear(squeeze, planes)
self.relu = nn.ReLU(inplace=True)
self.sigmoid = nn.Sigmoid()
def forward(self, x):
out = torch.mean(x.view(x.size(0), x.size(1), -1), 2)
out = self.squeeze(out)
out = self.relu(out)
out = self.expand(out)
out = self.sigmoid(out)
out = out.unsqueeze(2).unsqueeze(3)
return out
除网络结构,其他设置等基本同上
模型比较
基础概念介绍
throughput 吞吐量
:The quantity of data being sent and received within a unit of time
单位时间内处理数据数量 (batch size /distributed process)
例如: images/sec
def calc_ips(batch_size, time):
world_size = (
torch.distributed.get_world_size() if torch.distributed.is_initialized() else 1
)
tbs = world_size * batch_size
return tbs / time
latency 延迟
: The time taken for a packet to be transferred across a network. You can measure this as one-way to its destination or as a round trip.
AVG 、90%、95%、99%
针对的是一个batch 的数据所花的时间
Avg :计算平均
其他对数据取分位数
np.quantile(self.vals, self.q, interpolation=“nearest”)
FLOPS
注意全大写,是floating point operations per second的缩写,意指每秒浮点运算次数,理解为计算速度。是一个衡量硬件性能的指标。
FLOPs
注意s小写,是floating point operations的缩写(s表复数),意指浮点运算数,理解为计算量。可以用来衡量算法/模型的复杂度。
Accuracy vs FLOPs
Throughput vs Latency
总结
::通过对Nvidia的deeplearning的demo的阅读,可以学习到一些模型加速以的tricks::
后续相关的工作也可以在这个代码上面进行扩展,尤其是当进行多卡并行训练的时候,可以借鉴里面的一些处理方式。
请教
不太理解为什么 channels last 可以加速。如果有小伙伴知道,请在留言区指导下,十分感谢🙏