文章目录
- 模型复杂度
- 时间复杂度
- 空间复杂度
- 深度学习模型调研
- 0. Attention Is All You Need
- 1. Densely Connected Convolutional Networks
- 2. Deep Residual Learning for Image Recognition
- 3. https://github.com/sovrasov/flops-counter.pytorch
- 4. AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE
- 5. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
- 散射成像领域的模型对比
模型复杂度
模型复杂度通常是指前向过程的计算量(反映模型所需要的计算时间)和参数个数(反映模型所需要的计算机内存空间)
时间复杂度
用于评价模型运行效率高低,通常意味着模型运行速度
-
计算复杂度使用浮点运算数 FLOPs
-
另外并行性也会影响模型运行速度,可使用最大顺序操作数 Minimum number of sequential operations 和吞吐量 Throughput (image/s) 以及推理时间 Inference time (bacth/ms) 衡量
其中吞吐量与推理时间不仅仅与模型有关,还与硬件性能有关
FLOPs
1. Convolution
![卷积](https://i-blog.csdnimg.cn/blog_migrate/7241934c3c626e1d115b7aa5f749201f.jpeg)
F
L
O
P
s
=
(
2
×
C
i
n
p
u
t
⋅
S
f
i
l
t
e
r
h
⋅
S
f
i
l
t
e
r
w
−
1
)
∗
⋅
C
o
u
t
p
u
t
⋅
S
i
n
p
u
t
h
⋅
S
i
n
p
u
t
w
e
.
g
.
C
i
n
p
u
t
=
3
C
o
u
t
p
u
t
=
4
S
f
i
l
t
e
r
h
=
S
f
i
l
t
e
r
w
=
3
S
i
n
p
u
t
h
=
S
i
n
p
u
t
w
=
6
F
L
O
P
s
=
(
2
×
3
×
3
2
−
1
)
×
4
×
6
2
=
7632
FLOPs=(2\times C_{input}\cdot S_{filter_h}\cdot S_{filter_w}-1)^*\cdot C_{output}\cdot S_{input_h}\cdot S_{input_w}\\ \begin{aligned}\\ e.g.\quad &C_{input}=3\quad C_{output}=4\quad S_{filter_h}=S_{filter_w}=3\quad S_{input_h}=S_{input_w}=6\\ &FLOPs=(2\times3\times3^2-1)\times4\times6^2=7632 \end{aligned}
FLOPs=(2×Cinput⋅Sfilterh⋅Sfilterw−1)∗⋅Coutput⋅Sinputh⋅Sinputwe.g.Cinput=3Coutput=4Sfilterh=Sfilterw=3Sinputh=Sinputw=6FLOPs=(2×3×32−1)×4×62=7632
* 卷积有偏置则不需要 -1
2. Attention
![image-20220122113102360](https://i-blog.csdnimg.cn/blog_migrate/7783a59d9596464824ad7c4fc1a6ed50.png)
F L O P s = { 2 D k N D x + 2 D k N 2 + 1 3 D 2 N + 2 D N 2 + 1 i f D x = D k = D v = O u r D m o d e l = D FLOPs=\begin{cases} 2D_kND_x\;+\;2D_kN^2\;+\;1\\ 3D^2N\;+\;2DN^2\;+\;1\quad if\quad D_x=D_k=D_v=Our\,D_{model}=D \end{cases} FLOPs={2DkNDx+2DkN2+13D2N+2DN2+1ifDx=Dk=Dv=OurDmodel=D
3. Fully connected
假设全连接包括输入层隐含层输出层三层,输入层包含 N 批次 D 个神经元,隐含层包含 N 批次 4D 个神经元,输出层进行非线性激活
F
L
O
P
s
=
(
D
+
D
−
1
)
∗
⋅
4
D
⋅
N
=
8
D
2
N
−
4
D
N
\begin{aligned}\\ FLOPs\;&=\;(D+D-1)^*\cdot 4D\cdot N\\ &=\;8D^2N-4DN \end{aligned}
FLOPs=(D+D−1)∗⋅4D⋅N=8D2N−4DN
* 全连接有偏置则不需要 -1
空间复杂度
用于评价模型占用空间大小,通常意味着模型能否运行
- 参数量 Parameters
- 数据位数 Data bits
Parameters
P a r a m e t e r s = V o l u m e ( T e n s o r W e i g h t ) Parameters=Volume(Tensor_{Weight}) Parameters=Volume(TensorWeight)
Data bits
F l o a t 32 o r F l o a t 64 ⋯ Float32\quad or\quad Float64\quad\cdots Float32orFloat64⋯
深度学习模型调研
0. Attention Is All You Need
Per-layer complexity, minimum number of sequential operations for different layer types and maximum path length
![image-20220121125955712](https://i-blog.csdnimg.cn/blog_migrate/3098959c3884383109a49d76e10693c1.png)
n n n 是 sequence length、 d d d 是 representation dimension、 k k k 是卷积核尺寸和 r r r 受限自注意力机制的领域尺寸
首次提出完全基于注意力和全联接的 Transformer 架构的自然语言处理神经网络,maximum path length O ( x ) O(x) O(x) 其 x x x 越大代表在长距离依赖的结点传递信息时,信息交互越难,信息丢失越严重
1. Densely Connected Convolutional Networks
![image-20220120211612502](https://i-blog.csdnimg.cn/blog_migrate/aba7b2ffd74b8da2a5dec24d9c9f622e.png)
![image-20220120204504919](https://i-blog.csdnimg.cn/blog_migrate/2cdafd21ac7e312f9ccda41b064c0d6c.png)
具有 BottleNeck 结构的 DenseNet- L ( k = n ) (k=n) (k=n) ,L 代表模型深度,即可学习的层数(卷积层与全连接层) k k k 为输入的 feature 经过一个 Dense Block 中的一个 Dense Layer 后增加的特征通道数,经过一个 Dense Block 后,紧接着的 Transition Layer 后会将当前 feature 的特征通道数压缩一半
“If a dense block contains m feature-maps, we let the following transition layer generate ⌊ θ m ⌋ ⌊ θm ⌋ ⌊θm⌋ output featuremaps, where 0 < θ ≤ 1 0 <θ ≤ 1 0<θ≤1 is referred to as the compression factor.”
“We refer the DenseNet with θ < 1 θ<1 θ<1 as DenseNet-C, and we set θ = 0.5 θ = 0.5 θ=0.5 in our experiment. When both the bottleneck and transition layers with θ < 1 θ < 1 θ<1 are used, we refer to our model as DenseNet-BC.”
2. Deep Residual Learning for Image Recognition
![image-20220121203650386](https://i-blog.csdnimg.cn/blog_migrate/4e6b307671bafec5758e2405222533d6.png)
其中 FLOPs 被误为 MACs,实际 FLOPs 应该是上述的两倍大小,L-layer 中 L 代表可学习的层数
![image-20220120205312977](https://i-blog.csdnimg.cn/blog_migrate/c7ea434fd584227e590ff930d2d22b47.png)
加入 bottleneck 结构后网络参数量明显下降,实现了超过 1000 层的网络
3. https://github.com/sovrasov/flops-counter.pytorch
通过调用外部库 flops-counter 计算的主流卷积模型的参数量和乘加操作数,并相应给出了 Top1 和 Top5 精度
Model | Input Resolution | Params(M) | MACs(G) | Acc@1 | Acc@5 |
---|---|---|---|---|---|
alexnet | 224x224 | 61.1 | 0.72 | 56.432 | 79.194 |
densenet121 | 224x224 | 7.98 | 2.88 | 74.646 | 92.136 |
densenet161 | 224x224 | 28.68 | 7.82 | 77.56 | 93.798 |
densenet169 | 224x224 | 14.15 | 3.42 | 76.026 | 92.992 |
densenet201 | 224x224 | 20.01 | 4.37 | 77.152 | 93.548 |
dpn107 | 224x224 | 86.92 | 18.42 | 79.746 | 94.684 |
dpn131 | 224x224 | 79.25 | 16.13 | 79.432 | 94.574 |
dpn68 | 224x224 | 12.61 | 2.36 | 75.868 | 92.774 |
dpn68b | 224x224 | 12.61 | 2.36 | 77.034 | 93.59 |
dpn92 | 224x224 | 37.67 | 6.56 | 79.4 | 94.62 |
dpn98 | 224x224 | 61.57 | 11.76 | 79.224 | 94.488 |
inceptionv3 | 299x299 | 27.16 | 5.73 | 77.294 | 93.454 |
inceptionv4 | 299x299 | 42.68 | 12.31 | 80.062 | 94.926 |
resnet101 | 224x224 | 44.55 | 7.85 | 77.438 | 93.672 |
resnet152 | 224x224 | 60.19 | 11.58 | 78.428 | 94.11 |
resnet18 | 224x224 | 11.69 | 1.82 | 70.142 | 89.274 |
resnet34 | 224x224 | 21.8 | 3.68 | 73.554 | 91.456 |
resnet50 | 224x224 | 25.56 | 4.12 | 76.002 | 92.98 |
se_resnet101 | 224x224 | 49.33 | 7.63 | 78.396 | 94.258 |
se_resnet152 | 224x224 | 66.82 | 11.37 | 78.658 | 94.374 |
se_resnet50 | 224x224 | 28.09 | 3.9 | 77.636 | 93.752 |
vgg11 | 224x224 | 132.86 | 7.63 | 68.97 | 88.746 |
vgg13 | 224x224 | 133.05 | 11.34 | 69.662 | 89.264 |
vgg16 | 224x224 | 138.36 | 15.5 | 71.636 | 90.354 |
vgg19 | 224x224 | 143.67 | 19.67 | 72.08 | 90.822 |
4. AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE
![image-20220122152835106](https://i-blog.csdnimg.cn/blog_migrate/a204697d998d550c40138b8ecc7c1134.png)
![image-20220122154046936](https://i-blog.csdnimg.cn/blog_migrate/7a249ede0e4352a01e3af201b0a7320c.png)
VIT 完全基于注意力机制和全连接的视觉神经网络
5. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
![image-20220122152002688](https://i-blog.csdnimg.cn/blog_migrate/f78e4ebe649f479e523d3b7baa05cf82.png)
![image-20220122153651890](https://i-blog.csdnimg.cn/blog_migrate/451843167dceda51912df70cbd41aaec.png)
Swin 完全基于具有滑动窗口的注意力机制和全连接的视觉神经网络
散射成像领域的模型对比
以下的计算 Batch 统一设置为 2
1. Deep speckle correlation: a deep learning approachtoward scalable imaging through scattering media
![image-20220122144343659](https://i-blog.csdnimg.cn/blog_migrate/fb0bc879a57f7018498645d650d3ae4f.png)
-
Input Resolution : 256 × 256 256\times 256 256×256
-
Parameters : 21.8505 × 1 0 6 21.8505\times 10^6 21.8505×106
-
FLOPs : 0.0577 × 1 0 9 0.0577\times 10^9 0.0577×109
-
Throughput : 8.9 i m a g e / s 8.9\,image/s 8.9image/s
-
Inference time : 223.2022 b a t c h / m s 223.2022\,batch/ms 223.2022batch/ms
2. High-generalization deep sparse pattern reconstruction: feature extraction of speckles using self-attention armed convolutional neural networks
![image-20220122144525611](https://i-blog.csdnimg.cn/blog_migrate/a008a5b5ee6272513041e6aad8619ff4.png)
SA-CNN
- Input Resolution : 256 × 256 256\times 256 256×256
- Parameters : 13.9231 × 1 0 6 13.9231\times 10^6 13.9231×106
- FLOPs : 17.4204 × 1 0 9 17.4204\times 10^9 17.4204×109
- Throughput : 40.8 i m a g e / s 40.8\,image/s 40.8image/s
- Inference time : 49.0446 b a t c h / m s 49.0446\,batch/ms 49.0446batch/ms
SA-CNN-Single
- Input Resolution : 256 × 256 256\times 256 256×256
- Parameters : 13.5972 × 1 0 6 13.5972\times 10^6 13.5972×106
- FLOPs : 8.9002 × 1 0 9 8.9002\times 10^9 8.9002×109
- Throughput : 44.4 i m a g e / s 44.4\,image/s 44.4image/s
- Inference time : 45.0413 b a t c h / m s 45.0413\,batch/ms 45.0413batch/ms
其中 -Single 是 仅有中间一层注意力
3. Our SpT UNet
![Xnip2022-01-22_15-06-45](https://i-blog.csdnimg.cn/blog_migrate/574c857e69ff4f891a71d3f51051f7fa.png)
SpT UNet
- Input Resolution : 200 × 200 224 × 224 256 × 256 200\times 200\quad 224\times 224\quad 256\times 256 200×200224×224256×256
- Parameters : 6.6184 × 1 0 6 6.6184\times 10^6 6.6184×106
- FLOPs : 19.3602 × 1 0 9 24.2856 × 1 0 9 31.7197 × 1 0 9 19.3602\times 10^9\quad 24.2856\times 10^9\quad 31.7197\times 10^9 19.3602×10924.2856×10931.7197×109
- Throughput : 86.9 i m a g e / s 83.3 i m a g e / s 62.5 i m a g e / s 86.9\,image/s\quad 83.3\,image/s\quad 62.5\,image/s\quad 86.9image/s83.3image/s62.5image/s
- Inference time : 23.0214 b a t c h / m s 24.0215 b a t c h / m s 31.3427 b a t c h / m s 23.0214\,batch/ms\quad 24.0215\,batch/ms\quad 31.3427\,batch/ms 23.0214batch/ms24.0215batch/ms31.3427batch/ms
SpT UNet-B
- Input Resolution : 200 × 200 224 × 224 256 × 256 200\times 200\quad 224\times 224\quad 256\times 256 200×200224×224256×256
- Parameters : 2.4179 × 1 0 6 2.4179\times 10^6 2.4179×106
- FLOPs : 8.2659 × 1 0 9 16.2256 × 1 0 9 21.2318 × 1 0 9 8.2659\times 10^9\quad 16.2256\times 10^9\quad 21.2318\times 10^9 8.2659×10916.2256×10921.2318×109
- Throughput : 105.2 i m a g e / s 95.2 i m a g e / s 72.9 i m a g e / s 105.2\,image/s\quad 95.2\,image/s\quad 72.9\,image/s\quad 105.2image/s95.2image/s72.9image/s
- Inference time : 19.0217 b a t c h / m s 21.0189 b a t c h / m s 27.4584 b a t c h / m s 19.0217\,batch/ms\quad 21.0189\,batch/ms\quad 27.4584\,batch/ms 19.0217batch/ms21.0189batch/ms27.4584batch/ms
其中 -B 是 puffed 下采样和 leaky 上采样采用 Bottleneck 结构