我的机器学习支线「模型复杂度」

昊大侠

已于 2022-09-19 18:17:31 修改

阅读量1.7k

点赞数

分类专栏：机器学习文章标签：机器学习深度学习人工智能

于 2022-06-06 10:14:28 首次发布

本文链接：https://blog.csdn.net/weixin_49371288/article/details/125141423

版权

机器学习专栏收录该内容

7 篇文章 0 订阅

订阅专栏

文章目录

- 模型复杂度

模型复杂度

模型复杂度通常是指前向过程的计算量（反映模型所需要的计算时间）和参数个数（反映模型所需要的计算机内存空间）

时间复杂度

用于评价模型运行效率高低，通常意味着模型运行速度

计算复杂度使用浮点运算数 FLOPs
另外并行性也会影响模型运行速度，可使用最大顺序操作数 Minimum number of sequential operations 和吞吐量 Throughput (image/s) 以及推理时间 Inference time (bacth/ms) 衡量

其中吞吐量与推理时间不仅仅与模型有关，还与硬件性能有关

FLOPs

1. Convolution

$FLOPs=(2\times C_{input}\cdot S_{filter_h}\cdot S_{filter_w}-1)^*\cdot C_{output}\cdot S_{input_h}\cdot S_{input_w}\\ \begin{aligned}\\ e.g.\quad &C_{input}=3\quad C_{output}=4\quad S_{filter_h}=S_{filter_w}=3\quad S_{input_h}=S_{input_w}=6\\ &FLOPs=(2\times3\times3^2-1)\times4\times6^2=7632 \end{aligned}$
* 卷积有偏置则不需要 -1

2. Attention

$FLOPs=\begin{cases} 2D_kND_x\;+\;2D_kN^2\;+\;1\\ 3D^2N\;+\;2DN^2\;+\;1\quad if\quad D_x=D_k=D_v=Our\,D_{model}=D \end{cases}$

3. Fully connected

假设全连接包括输入层隐含层输出层三层，输入层包含 N 批次 D 个神经元，隐含层包含 N 批次 4D 个神经元，输出层进行非线性激活

$\begin{aligned}\\ FLOPs\;&=\;(D+D-1)^*\cdot 4D\cdot N\\ &=\;8D^2N-4DN \end{aligned}$
* 全连接有偏置则不需要 -1

空间复杂度

用于评价模型占用空间大小，通常意味着模型能否运行

参数量 Parameters
数据位数 Data bits

Parameters

$Parameters=Volume(Tensor_{Weight})$

Data bits

$Float32\quad or\quad Float64\quad\cdots$

深度学习模型调研

0. Attention Is All You Need

Per-layer complexity, minimum number of sequential operations for different layer types and maximum path length

$n$ 是 sequence length、 $d$ 是 representation dimension、 $k$ 是卷积核尺寸和 $r$ 受限自注意力机制的领域尺寸

首次提出完全基于注意力和全联接的 Transformer 架构的自然语言处理神经网络，maximum path length $O (x)$ 其 $x$ 越大代表在长距离依赖的结点传递信息时，信息交互越难，信息丢失越严重

1. Densely Connected Convolutional Networks

具有 BottleNeck 结构的 DenseNet- L $(k = n)$ ，L 代表模型深度，即可学习的层数（卷积层与全连接层） $k$ 为输入的 feature 经过一个 Dense Block 中的一个 Dense Layer 后增加的特征通道数，经过一个 Dense Block 后，紧接着的 Transition Layer 后会将当前 feature 的特征通道数压缩一半

“If a dense block contains m feature-maps, we let the following transition layer generate $⌊ θ m ⌋$ output featuremaps, where $0 < θ \leq 1$ is referred to as the compression factor.”

“We refer the DenseNet with $θ < 1$ as DenseNet-C, and we set $θ = 0.5$ in our experiment. When both the bottleneck and transition layers with $θ < 1$ are used, we refer to our model as DenseNet-BC.”

2. Deep Residual Learning for Image Recognition

其中 FLOPs 被误为 MACs，实际 FLOPs 应该是上述的两倍大小，L-layer 中 L 代表可学习的层数

加入 bottleneck 结构后网络参数量明显下降，实现了超过 1000 层的网络

3. https://github.com/sovrasov/flops-counter.pytorch

通过调用外部库 flops-counter 计算的主流卷积模型的参数量和乘加操作数，并相应给出了 Top1 和 Top5 精度

Model	Input Resolution	Params(M)	MACs(G)	Acc@1	Acc@5
alexnet	224x224	61.1	0.72	56.432	79.194
densenet121	224x224	7.98	2.88	74.646	92.136
densenet161	224x224	28.68	7.82	77.56	93.798
densenet169	224x224	14.15	3.42	76.026	92.992
densenet201	224x224	20.01	4.37	77.152	93.548
dpn107	224x224	86.92	18.42	79.746	94.684
dpn131	224x224	79.25	16.13	79.432	94.574
dpn68	224x224	12.61	2.36	75.868	92.774
dpn68b	224x224	12.61	2.36	77.034	93.59
dpn92	224x224	37.67	6.56	79.4	94.62
dpn98	224x224	61.57	11.76	79.224	94.488
inceptionv3	299x299	27.16	5.73	77.294	93.454
inceptionv4	299x299	42.68	12.31	80.062	94.926
resnet101	224x224	44.55	7.85	77.438	93.672
resnet152	224x224	60.19	11.58	78.428	94.11
resnet18	224x224	11.69	1.82	70.142	89.274
resnet34	224x224	21.8	3.68	73.554	91.456
resnet50	224x224	25.56	4.12	76.002	92.98
se_resnet101	224x224	49.33	7.63	78.396	94.258
se_resnet152	224x224	66.82	11.37	78.658	94.374
se_resnet50	224x224	28.09	3.9	77.636	93.752
vgg11	224x224	132.86	7.63	68.97	88.746
vgg13	224x224	133.05	11.34	69.662	89.264
vgg16	224x224	138.36	15.5	71.636	90.354
vgg19	224x224	143.67	19.67	72.08	90.822

4. AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

VIT 完全基于注意力机制和全连接的视觉神经网络

5. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Swin 完全基于具有滑动窗口的注意力机制和全连接的视觉神经网络

散射成像领域的模型对比

以下的计算 Batch 统一设置为 2

1. Deep speckle correlation: a deep learning approachtoward scalable imaging through scattering media

Input Resolution : $256\times 256$
Parameters : $21.8505\times 10^6$
FLOPs : $0.0577\times 10^9$
Throughput : $8.9\,image/s$
Inference time : $223.2022\,batch/ms$

2. High-generalization deep sparse pattern reconstruction: feature extraction of speckles using self-attention armed convolutional neural networks

SA-CNN

Input Resolution : $256\times 256$
Parameters : $13.9231\times 10^6$
FLOPs : $17.4204\times 10^9$
Throughput : $40.8\,image/s$
Inference time : $49.0446\,batch/ms$

SA-CNN-Single

Input Resolution : $256\times 256$
Parameters : $13.5972\times 10^6$
FLOPs : $8.9002\times 10^9$
Throughput : $44.4\,image/s$
Inference time : $45.0413\,batch/ms$

其中 -Single 是仅有中间一层注意力

3. Our SpT UNet

SpT UNet

Input Resolution : $200\times 200\quad 224\times 224\quad 256\times 256$
Parameters : $6.6184\times 10^6$
FLOPs : $19.3602\times 10^9\quad 24.2856\times 10^9\quad 31.7197\times 10^9$
Throughput : $86.9\,image/s\quad 83.3\,image/s\quad 62.5\,image/s\quad$
Inference time : $23.0214\,batch/ms\quad 24.0215\,batch/ms\quad 31.3427\,batch/ms$

SpT UNet-B

Input Resolution : $200\times 200\quad 224\times 224\quad 256\times 256$
Parameters : $2.4179\times 10^6$
FLOPs : $8.2659\times 10^9\quad 16.2256\times 10^9\quad 21.2318\times 10^9$
Throughput : $105.2\,image/s\quad 95.2\,image/s\quad 72.9\,image/s\quad$
Inference time : $19.0217\,batch/ms\quad 21.0189\,batch/ms\quad 27.4584\,batch/ms$