MobileNet v1
MobileNets: Effective Convolutional Neural Networks for Mobile Vision Applications
Contents
Section 1 & 2
To build small and efficient neural network
- compressing pretrained networks
- train small networks directily
MobileNet is a class of network that works on restricted resources like latency and size.
Section 3
Depthwise Separable Convolution
Let’s assume that we have a input data
F
F
F of shape
D
F
×
D
F
×
M
D_F\times D_F \times M
DF×DF×M, a expected output data
G
G
G of shape
D
G
×
D
G
×
N
D_G\times D_G\times N
DG×DG×N, and a convolution kernel
K
K
K of shape
D
K
×
D
K
×
M
D_K \times D_K \times M
DK×DK×M.
For a standard convolution, to achieve the result, we have
N
N
N kernels names
K
1
,
K
2
,
.
.
.
,
K
N
K_1, K_2,...,K_N
K1,K2,...,KN, and the computation cost is:
D
K
×
D
K
×
M
×
D
G
×
D
G
×
N
\boxed{D_K \times D_K \times M} \times \boxed{D_G \times D_G } \times N
DK×DK×M×DG×DG×NSince
D
G
≤
D
F
D_G \le D_F
DG≤DF, the maximum computation cost is:
D
K
×
D
K
×
M
×
D
F
×
D
F
×
N
\boxed{D_K \times D_K \times M} \times \boxed{D_F \times D_F } \times N
DK×DK×M×DF×DF×N
name | shape |
---|---|
input F F F | D F × D F × M D_F \times D_F \times M DF×DF×M |
output G G G | D G × D G × N D_G \times D_G \times N DG×DG×N |
kernel K 1 , . . . , K N K_1,...,K_N K1,...,KN | D K × D K × M × N \boxed{ D_K \times D_K \times M } \times N DK×DK×M×N |
maximum computation cost | D K × D K × M × D F × D F × N \boxed{D_K \times D_K \times M} \times \boxed{D_F \times D_F } \times N DK×DK×M×DF×DF×N |
For depthwise separable convolution, the kernel is separated into 2 parts:
M
M
M depthwise convolution kernel of shape
D
K
×
D
K
×
1
D_K \times D_K \times 1
DK×DK×1 and
N
N
N pointwise convolution kernels of shape
1
×
1
×
M
1 \times 1 \times M
1×1×M. Thus, the computation cost is:
D
K
×
D
K
×
1
×
D
G
×
D
G
×
M
+
1
×
1
×
M
×
D
G
×
D
G
×
N
\boxed{D_K \times D_K \times 1} \times \boxed{D_G \times D_G } \times M + \boxed{1 \times 1 \times M} \times \boxed{D_G \times D_G } \times N
DK×DK×1×DG×DG×M+1×1×M×DG×DG×NSince
D
G
≤
D
F
D_G \leq D_F
DG≤DF, the maxinum computation is:
D
K
×
D
K
×
1
×
D
F
×
D
F
×
M
+
1
×
1
×
M
×
D
F
×
D
F
×
N
\boxed{D_K \times D_K \times 1} \times \boxed{D_F \times D_F } \times M + \boxed{1 \times 1 \times M} \times \boxed{D_F \times D_F } \times N
DK×DK×1×DF×DF×M+1×1×M×DF×DF×N
which is:
D
K
×
D
K
×
D
F
×
D
F
×
M
+
D
F
×
D
F
×
N
×
M
D_K \times D_K \times D_F \times D_F \times M + D_F \times D_F \times N \times M
DK×DK×DF×DF×M+DF×DF×N×M
name | shape |
---|---|
input F F F | D F × D F × M D_F \times D_F \times M DF×DF×M |
output G G G | D G × D G × N D_G \times D_G \times N DG×DG×N |
kernel K K K | D K × D K × M D_K \times D_K \times M DK×DK×M |
pointwise convolution | 1 × 1 × M × N \boxed{1 \times 1 \times M} \times N 1×1×M×N |
maximum computation cost — depthwise part | D K × D K × 1 × D F × D F × M \boxed{D_K \times D_K \times 1} \times \boxed{D_F \times D_F } \times M DK×DK×1×DF×DF×M |
maximum computation cost — pointwise part | 1 × 1 × M × D F × D F × N \boxed{1 \times 1 \times M} \times \boxed{D_F \times D_F } \times N 1×1×M×DF×DF×N |
Network Structure
Shrinking hyper-parameters
- Width multiplier
α
\alpha
α
This hyper-parameter changes the input channel number to M ′ = α M M'=\alpha M M′=αM and output channel number N ′ = α N N'=\alpha N N′=αN. Thus, the maximum computation cost becomes:
D K × D K × D F × D F × α M + D F × D F × α N × α M D_K \times D_K \times D_F \times D_F \times \alpha M + D_F \times D_F \times \alpha N \times \alpha M DK×DK×DF×DF×αM+DF×DF×αN×αMwhich is:
α × D K × D K × D F × D F × M + α 2 × D F × D F × N × M \alpha \times D_K \times D_K \times D_F \times D_F \times M + \alpha^{2} \times D_F \times D_F \times N \times M α×DK×DK×DF×DF×M+α2×DF×DF×N×M - Resolution multiplier
ρ
\rho
ρ
This hyper-parameter changes the input resolution to ρ D F × ρ D F \rho D_F \times \rho D_F ρDF×ρDFand output resolution to ρ D G × ρ D G \rho D_G \times \rho D_G ρDG×ρDG, and maximum compution cost becomes:
ρ D K × ρ D K × ρ D F × ρ D F × M + ρ D F × ρ D F × N × M \rho D_K \times \rho D_K \times \rho D_F \times \rho D_F \times M + \rho D_F \times \rho D_F \times N \times M ρDK×ρDK×ρDF×ρDF×M+ρDF×ρDF×N×Mwhich is:
ρ 4 × D K × D K × D F × D F × M + ρ 2 × D F × D F × N × M \rho^{4} \times D_K \times D_K \times D_F \times D_F \times M + \rho^{2} \times D_F \times D_F \times N \times M ρ4×DK×DK×DF×DF×M+ρ2×DF×DF×N×M - When combine
α
\alpha
α and
ρ
\rho
ρ together, weget the final maximum conputation cost as follows:
ρ D K × ρ D K × ρ D F × ρ D F × α M + ρ D F × ρ D F × α N × α M \rho D_K \times \rho D_K \times \rho D_F \times \rho D_F \times \alpha M + \rho D_F \times \rho D_F \times \alpha N \times \alpha M ρDK×ρDK×ρDF×ρDF×αM+ρDF×ρDF×αN×αMwhere 0 < α ≤ 1 , 0 < ρ ≤ 1 0 < \alpha \leq 1, 0 < \rho \leq 1 0<α≤1,0<ρ≤1.
Section 4 & 5
From the shown results in Table 4, the MobileNet do reduce a great number of parameters with a little accuracy sacrificed. And the hyper-parameters
α
\alpha
α and
ρ
\rho
ρ performs well when larger than 0.5.
According the paper, MobileNet also perform well in other tasks.