图像分类相关

最新推荐文章于 2024-06-05 07:00:43 发布

二楼后座Scarlett

最新推荐文章于 2024-06-05 07:00:43 发布

阅读量452

点赞数 1

分类专栏：学习笔记文章标签：深度学习人工智能图像识别

本文链接：https://blog.csdn.net/u014448054/article/details/107413094

版权

学习笔记专栏收录该内容

11 篇文章 0 订阅

订阅专栏

图像分类相关

基于Nvidia 的deep learning example
GitHub：https://github.com/NVIDIA/DeepLearningExamples

这个系列只是督促自己可以好好的学习一些优秀的开源项目 😁，日后遇到一些相关的task可以作为个入门，吸取些精华。

模型

ResNet50

残差经典
Nvidia 中的实现的网络称为ResNet50 v1.5，是原始ResNet50的一个更改版本。

更改点：在需要下采样的bottleneck block中，v1是在第一个1*1的卷积层采用的stride = 2 , v1.5是在3*3的卷积层采用的stride = 2

准确率：⬆️0.5% top1
性能：⬇️ 5% imgs/sec

训练：混合精度 mixed precision

为混合精度训练支持::channels last (NHWC):: 🤔️
推荐阅读： (beta) Channels Last Memory Format in PyTorch — PyTorch Tutorials 1.5.1 documentation

初始化方式：He初始化

Optimizer:

SGD
Momentum: 0.875
Learning rate: 0.256 for 256 batch size，linearly scale for other batch size.
Learning rate schedule: cosine LR schedule
::512 batch size and up:: : linear warmup of the learning rate Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Weight decay: 1/32768 not apply on BN trainable parameters (gamma/bias )
label smoothing = 0.1
训练的trick，一种正则化策略。
Mixup regularization

Data augmentation :

Training

Normalization
Random resized crop to 224x224
- Scale from 8% to 100%
- Aspect ratio from 3/4 to 4/3
Random horizontal flip

Inference

Normalization
Scale to 256x256
Center crop to 224x224

::We use NVIDIA DALI , which speeds up data loading when CPU becomes a bottleneck. DALI can use CPU or GPU, and outperforms the PyTorch native dataloader.::

针对Nvidia Ampere GPU 的优化：::TensorFloat-32::
TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x

ResNeXt101-32x4d

paper： Aggregated Residual Transformations for Deep Neural Networks
本质： split-transform-merge / grouped convolutions

32代表group的个数，4d 代表通道数
在这里插入图片描述
::cuDNN7更新说明：Grouped Convolutions for models such as ResNeXt and Xception and CTC (Connectionist Temporal Classification) loss layer for temporal classification::

除网络结构，其他设置等基本同上

SE-ResNeXt101-32x4d

在ResNeXt的基础上，增加了SE (squeeze and excitation) module
在这里插入图片描述

channel 的attention

class SqueezeAndExcitation(nn.Module):
    def __init__(self, planes, squeeze):
        super(SqueezeAndExcitation, self).__init__()
        self.squeeze = nn.Linear(planes, squeeze)
        self.expand = nn.Linear(squeeze, planes)
        self.relu = nn.ReLU(inplace=True)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        out = torch.mean(x.view(x.size(0), x.size(1), -1), 2)
        out = self.squeeze(out)
        out = self.relu(out)
        out = self.expand(out)
        out = self.sigmoid(out)
        out = out.unsqueeze(2).unsqueeze(3)

        return out

除网络结构，其他设置等基本同上

模型比较

基础概念介绍

throughput 吞吐量

:The quantity of data being sent and received within a unit of time
单位时间内处理数据数量 (batch size /distributed process)
例如： images/sec

def calc_ips(batch_size, time):
    world_size = (
        torch.distributed.get_world_size() if torch.distributed.is_initialized() else 1
    )
    tbs = world_size * batch_size
    return tbs / time

latency 延迟

: The time taken for a packet to be transferred across a network. You can measure this as one-way to its destination or as a round trip.
AVG 、90%、95%、99%
针对的是一个batch 的数据所花的时间
Avg ：计算平均
其他对数据取分位数

np.quantile(self.vals, self.q, interpolation=“nearest”)

FLOPS

注意全大写，是floating point operations per second的缩写，意指每秒浮点运算次数，理解为计算速度。是一个衡量硬件性能的指标。

FLOPs

注意s小写，是floating point operations的缩写（s表复数），意指浮点运算数，理解为计算量。可以用来衡量算法/模型的复杂度。

Accuracy vs FLOPs

在这里插入图片描述

Throughput vs Latency

在这里插入图片描述

总结

::通过对Nvidia的deeplearning的demo的阅读，可以学习到一些模型加速以的tricks::
后续相关的工作也可以在这个代码上面进行扩展，尤其是当进行多卡并行训练的时候，可以借鉴里面的一些处理方式。

请教

不太理解为什么 channels last 可以加速。如果有小伙伴知道，请在留言区指导下，十分感谢🙏

二楼后座Scarlett

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
图像分类相关

图像分类相关基于Nvidia 的deep learning exampleGitHub：https://github.com/NVIDIA/DeepLearningExamples这个系列只是督促自己可以好好的学习一些优秀的开源项目 ????，日后遇到一些相关的task可以作为个入门，吸取些精华。模型ResNet50残差经典Nvidia 中的实现的网络称为ResNet50 v1.5，是原始ResNet50的一个更改版本。更改点：在需要下采样的bottleneck block中，v1是
复制链接

扫一扫