Method
progressive learning scheme
Proposed method learn the architecture distribution from Dirichlet distribution from Dirichlet sample,which already injects certain stochasticity. If we directly apply the proposed method with the partial channel connection, the accuracy of the final architecture will decrease dramatically.
To this end, we propose to gradually increase the fraction of channels that are forwarded to the mixed-operations and meanwhile prunes the operations space based on the learnt distribution.
In practice, we should widen the convolution kernel or the channel number of BatchNorm layers since we will gradually increase the fraction of channels that fed into the mixed-operations.To this end, random mapping function is used to enlarge every convolution weight which is similar to Net2Net
For example:
W
o
l
d
∈
R
o
u
t
o
,
,
i
n
o
,
h
,
w
W
n
e
w
=
R
o
u
t
n
,
i
n
n
,
h
,
w
W_{old} \in \R ^{out_o,, in_o, h, w} \\W_{new}=\R ^{out_n, in_n, h, w} \\
Wold∈Routo,,ino,h,wWnew=Routn,inn,h,w
If we widen the input channels:
r
=
i
n
n
−
i
n
o
i
n
d
e
x
=
r
a
n
d
(
0
,
o
u
t
o
,
s
i
z
e
=
(
r
,
)
W
n
e
w
=
C
o
n
c
a
t
(
W
o
l
d
,
W
o
l
d
[
i
n
d
e
x
,
:
,
:
,
:
]
,
d
i
m
=
0
)
r = in_n - in_o\\ inde x = rand(0,out_o,size=(r,)\\ W_{new}=Concat(W_{old}, W_{old}[index,:,:,:],dim=0)
r=inn−inoindex=rand(0,outo,size=(r,)Wnew=Concat(Wold,Wold[index,:,:,:],dim=0)
Similarly, we can obtain the correspoding W n e w W_{new} Wnew fro widenning the output channel.
Experiment
Since Dirichlet concentration beta must be positive, we apply the shifted exponential linear mapping β = E L U ( n ) + 1 \beta=ELU(n) + 1 β=ELU(n)+1 and optimize over n.