Introduction
The standard convolution layer involve
(
c
h
a
n
n
e
l
i
n
p
u
t
,
c
h
a
n
n
e
l
o
u
t
p
u
t
,
f
i
l
t
e
r
_
s
i
z
e
)
(channel_{input},channel_{output},filter\_size)
(channelinput,channeloutput,filter_size),for instance,
I
n
p
u
t
−
c
h
a
n
n
e
l
:
10
Input-channel: 10
Input−channel:10
O
u
t
p
u
t
−
c
h
a
n
n
e
l
:
20
Output-channel:20
Output−channel:20
F
i
l
t
e
r
−
s
i
z
e
:
7
Filter-size:7
Filter−size:7
可以得到
P
a
r
a
m
e
t
e
r
s
=
(
7
×
7
×
10
+
1
)
×
20
=
9820
Parameters =(7\times7\times10+1)\times20=9820
Parameters=(7×7×10+1)×20=9820
the amount of parameters of the standard convolution is so much that the model is more probably over-fitting.So the depth-wise convolution and the depth-wise separable convolution is proposed to avoid this scenarios.
depth-wise convolution
We use each filter channels only at one input channels.
To produce the same effect with normal convolution, what we need to do is select a channel ,make all the elements zeros in the filter except that channel and then convolve.
Although the parameters remain same , depth-wise convolution can produce 3 output channels wth only one 3-channel filter, but the standard convolution produce only 1 output channel with the same filter.
depth-wise separable convolution
We perform depth-wise convolution at horizontal dimension( height and width ) and after that we use
1
×
1
1\times1
1×1convolution to cover the depth dimension so that we can produce any channels we want.
Parameter:
Requirements:
I
n
p
u
t
−
c
h
a
n
n
e
l
:
3
Input-channel: 3
Input−channel:3
O
u
t
p
u
t
−
c
h
a
n
n
e
l
:
3
Output-channel:3
Output−channel:3
f
i
l
t
e
r
−
s
i
z
e
:
3
filter-size:3
filter−size:3
the standard convolution:
P
a
r
a
m
e
t
e
r
s
=
(
3
×
3
×
3
+
1
)
×
3
=
84
Parameters=(3\times3\times3+1)\times3=84
Parameters=(3×3×3+1)×3=84
depth-wise separable convolution :
P
a
r
a
m
e
t
e
r
s
=
(
3
×
3
+
1
)
×
3
+
3
×
3
=
39
Parameters=(3\times3+1)\times3+3\times3=39
Parameters=(3×3+1)×3+3×3=39
Having too many parameters forces function to memorize lather than learn and thus over-fitting.Depth-wise separable convolution save us from that.