differentiable architecture search
1. Introduction
before: discovering state-of-the-art neural network architecture requires substantial effort of human expers.
Contribution:
1.We introduce a novel algorithm for differentable network architecture search based on bilevel optimization(二阶优化) .propose a method for efficient architecture search called DARTs and relax the search space to be continuous,so tha the atchitecture can be optimized;
2.We achieve remarkable efficiency impovement.Through extensive experiments on image classification and language modeling tasks we show that gradient-based architecture search achieve highly competitive results on CIFAR-10 and outperforms the state of the art on PTB;
The code of DARTS is available to us at https://github.com/quark0/darts.
https://github.com/quark0/darts
2.Diffrerentiable architecture search
2.1.SEARCH SPACE
We search for a computation cell as the building block of the final architecture. The learned cell could either be stacked to form a convolutional network or connected to form a recurrent network.
cell: a cell is a directed acyclic (无循环)graph consisting of an ordered sequence of N nodes;
node:
x
(
i
)
x^{(i)}
x(i) is a latent representation(e.g. a feature map in convolution network);
directed edge:
(
i
,
j
)
(i,j)
(i,j);
operation:
o
(
i
,
j
)
o^{(i,j)}
o(i,j) is a operation that transform
x
(
i
)
x^{(i)}
x(i).
Construction:
1.the cell has two input nodes and a single output node;
2.the input nodes are defined as the cell outputs in the previous two layer;
Each intermediate node is compued based on all of its predecessors:
x
(
j
)
=
∑
i
<
j
o
(
i
,
j
)
(
x
(
i
)
)
\begin{equation} x^{(j)}=\sum_{i<j} o^{(i,j)}(x^{(i)}) \end{equation}
x(j)=i<j∑o(i,j)(x(i))
2.2. Continuous Relaxation and Optimization
2.2.1 meaning
- O:a set of candidate operation
- o ( ⋅ ) o(\cdot) o(⋅)is some function which is applied to x ( i ) x^{(i)} x(i)
2.2.2 the function is below:
o
ˉ
(
i
,
j
)
(
x
)
=
∑
o
∈
O
e
x
p
(
α
0
(
i
,
j
)
)
∑
o
′
∈
O
e
x
p
(
α
o
′
(
i
,
j
)
)
\begin{equation} \bar{o}^{(i,j)}(x)=\sum_{o\in O}\frac{exp(\alpha_0^{(i,j)})}{\sum_{o^{'}\in O}exp(\alpha_{o_{'}}^{(i,j)})} \end{equation}
oˉ(i,j)(x)=o∈O∑∑o′∈Oexp(αo′(i,j))exp(α0(i,j))
1.
(
i
,
j
)
(i,j)
(i,j)are parameterized by a vector
α
(
i
,
j
)
\alpha^{(i,j)}
α(i,j) of dimension |O|.
2.The task of architecture search reduces to learning a set of continuous variables
α
=
{
α
(
i
,
j
)
}
\alpha=\{\alpha^{(i,j)}\}
α={α(i,j)}
3.at the end of search, a discrete architecture can be obtained by replacing each mixed operation
o
ˉ
i
,
j
\bar o^{i,j}
oˉi,j, at the most of time ,
o
(
i
,
j
)
=
a
r
g
m
a
x
o
∈
O
α
o
(
i
,
j
)
o^{(i,j)}=argmax_{o\in O} \alpha_o^{(i,j)}
o(i,j)=argmaxo∈Oαo(i,j)
After relaxation, our goal is to joinly learn the architecture
α
\alpha
α and the weight
w
w
w.
L
t
r
a
i
n
L_{train}
Ltrain:training loss;
L
v
a
l
L_{val}
Lval:validation loss;
Both losses are determined not only by the architecture
α
\alpha
α,but also the weights w in the network.
2.2.3 Goals
- α ∗ \alpha^* α∗--------------------minimize the validation loss L v a l ( w ∗ , α ∗ ) L_{val}(w^*,\alpha^*) Lval(w∗,α∗)
- w ∗ w^* w∗--------------------minimize the training loss L t r a i n ( w , α ∗ ) L_{train}(w,\alpha^*) Ltrain(w,α∗)
It is a bilevel optimization problem with
α
\alpha
α as the upper-level variable and
w
w
w as the lower-level variable:
m
i
n
α
L
v
a
l
(
w
∗
(
α
)
,
α
)
s
.
t
.
w
∗
(
α
)
=
a
r
g
m
i
n
w
L
t
r
a
i
n
(
w
,
α
)
\begin{equation} \begin{aligned} &\underset {\alpha} {min} L_{val}(w^*(\alpha),\alpha)\\ &s.t.\ w^*(\alpha)=argmin_w L_{train}(w,\alpha)\\ \end{aligned} \end{equation}
αminLval(w∗(α),α)s.t. w∗(α)=argminwLtrain(w,α)
Algorithm:
2.3 Approximate Architecture Gradient
Evaluating the architecture gradient exactly can be prohibitive due to the expensive inner optimizaton.So we propose a simple approximate scheme as follows:
∇
α
L
v
a
l
(
w
∗
(
α
)
,
α
)
≈
∇
α
L
v
a
l
(
w
−
ξ
∇
w
L
t
r
a
n
(
w
,
α
)
,
α
)
\begin{align} &\nabla_{\alpha}L_{val}(w^*(\alpha),\alpha)\\ \approx &\nabla_{\alpha}L_{val}(w-\xi\nabla_wL_{tran}(w,\alpha),\alpha) \end{align}
≈∇αLval(w∗(α),α)∇αLval(w−ξ∇wLtran(w,α),α)
meaning:
- w w w:current weights maintained by the algorithm;
- ξ \xi ξ:the learning rate for a step of inner optimization;
- ∇ α L v a l ( w , α ) \nabla_{\alpha}L_{val}(w,\alpha) ∇αLval(w,α):if w w w is already a local optimum(局部最优) for the inner optimization,and thus ∇ w L t r a i n ( w , α ) \nabla_wL_{train}(w,\alpha) ∇wLtrain(w,α)=0,so w = w ∗ ( α ) w=w^*(\alpha) w=w∗(α) and w ∗ ( α ) w^*(\alpha) w∗(α)= w − ξ ∇ w L t r a i n ( w , α ) w-\xi \nabla_w L_{train}(w,\alpha) w−ξ∇wLtrain(w,α);
Idea
approximate
w
∗
w^*
w∗ by adapting
w
w
w using only a single training step, without solving the inner optimization completely bu training until convergence.
while we are not currently aware of the convergence guarantees for out optimization algorithm, in practice it is able to reach a fixed point with a suitable choixe of
ξ
\xi
ξ.
citation: https://zhuanlan.zhihu.com/p/156832334
2.3.1 prodcure
- Firstly,
∇
α
L
v
a
l
(
w
−
ξ
∇
w
L
t
r
a
i
n
(
w
,
α
)
.
α
)
\nabla_{\alpha}L_{val}(w-\xi\nabla_w L_{train}(w,\alpha).\alpha)
∇αLval(w−ξ∇wLtrain(w,α).α) can be simplized as
∇
α
f
(
g
1
(
α
)
,
g
2
(
α
)
)
\nabla_{\alpha}f(g_1(\alpha),g_2(\alpha))
∇αf(g1(α),g2(α))
⋅ f ( ⋅ , ⋅ ) = L v a l ( ⋅ , ⋅ ) ⋅ g 1 ( α ) = w − ξ L t r a i n ( w , α ) ⋅ g 2 ( α ) = α \cdot f(\cdot,\cdot)=L_{val}(\cdot,\cdot)\\ \cdot g_1(\alpha)=w-\xi L_{train}(w,\alpha)\\ \cdot g_2(\alpha)=\alpha ⋅f(⋅,⋅)=Lval(⋅,⋅)⋅g1(α)=w−ξLtrain(w,α)⋅g2(α)=α
∇
α
f
(
g
1
(
α
)
,
g
2
(
α
)
)
=
∇
α
g
1
(
α
)
⋅
D
1
f
(
g
1
(
α
)
,
g
2
(
α
2
)
)
+
∇
α
g
2
(
α
)
⋅
D
2
f
(
g
1
(
α
)
,
g
2
(
α
2
)
)
∇
α
g
1
(
α
)
=
−
ξ
∇
α
,
w
2
2
L
t
r
a
i
n
(
w
,
α
)
∇
α
g
2
(
α
)
=
1
D
1
f
(
g
1
(
α
)
,
g
2
(
α
)
)
=
∇
w
,
L
v
a
l
(
w
,
,
α
)
D
2
f
(
g
1
(
α
)
,
g
2
(
α
)
)
=
∇
α
L
v
a
l
(
w
,
,
α
)
∇
α
L
v
a
l
(
w
−
ξ
∇
w
L
t
r
a
i
n
(
w
,
α
)
,
α
)
=
∇
α
L
v
a
l
(
w
′
,
α
)
−
ξ
∇
α
,
w
2
⋅
∇
w
′
L
v
a
l
(
w
′
,
α
)
\nabla_{\alpha}f(g_1(\alpha),g_2(\alpha)) =\nabla_{\alpha}g_1(\alpha)\cdot D_1 f(g_1(\alpha),g_2(\alpha_2))+\nabla_{\alpha}g_2(\alpha)\cdot D_2 f(g_1(\alpha),g_2(\alpha_2))\\ \nabla_{\alpha}g_1(\alpha)=-\xi\nabla^2_{\alpha,w2}L_{train}(w,\alpha)\\ \nabla_{\alpha}g_2(\alpha)=1\\ D_1f(g_1(\alpha),g_2(\alpha))=\nabla_{w^,}L_{val}(w^,,\alpha)\\ D_2f(g_1(\alpha),g_2(\alpha))=\nabla_{\alpha}L_{val}(w^,,\alpha)\\ \nabla_{\alpha}L_{val}(w-\xi\nabla_{w}L_{train}(w,\alpha),\alpha)=\nabla_{\alpha}L_{val}(w',\alpha)-\xi\nabla_{\alpha,w}^2\cdot \nabla_{w'}L_{val}(w',\alpha)
∇αf(g1(α),g2(α))=∇αg1(α)⋅D1f(g1(α),g2(α2))+∇αg2(α)⋅D2f(g1(α),g2(α2))∇αg1(α)=−ξ∇α,w22Ltrain(w,α)∇αg2(α)=1D1f(g1(α),g2(α))=∇w,Lval(w,,α)D2f(g1(α),g2(α))=∇αLval(w,,α)∇αLval(w−ξ∇wLtrain(w,α),α)=∇αLval(w′,α)−ξ∇α,w2⋅∇w′Lval(w′,α)
3.
s
.
t
.
ϵ
=
0.01
/
∥
∇
w
′
,
α
∥
2
∇
α
,
w
2
L
t
r
a
i
n
(
w
,
α
)
⋅
∇
w
′
L
v
a
l
(
w
′
,
α
)
≈
∇
α
L
t
r
a
i
n
(
w
+
,
α
)
−
∇
α
L
t
r
a
i
n
(
w
−
,
α
)
2
ϵ
w
−
/
+
=
w
−
/
+
ϵ
∇
w
′
L
v
a
l
(
w
′
,
α
)
f
′
(
x
0
)
⋅
A
≈
f
(
x
0
+
h
A
)
−
f
(
x
0
−
h
A
)
2
h
s.t. \ \epsilon=0.01/\| \nabla_{w',\alpha}\| _2\\ \nabla^2_{\alpha,w}L_{train}(w,\alpha)\cdot \nabla_{w'}L_{val}(w',\alpha)\approx\frac {\nabla_{\alpha}L_{train}(w^+,\alpha)-\nabla_{\alpha}L_{train}(w^-,\alpha)} {2\epsilon}\\ w^{-/+}=w-/+\epsilon\nabla_{w'}L_{val}(w',\alpha)\\ f'(x_0)\cdot A\approx\frac {f(x_0+hA)-f(x_0-hA)} {2h}
s.t. ϵ=0.01/∥∇w′,α∥2∇α,w2Ltrain(w,α)⋅∇w′Lval(w′,α)≈2ϵ∇αLtrain(w+,α)−∇αLtrain(w−,α)w−/+=w−/+ϵ∇w′Lval(w′,α)f′(x0)⋅A≈2hf(x0+hA)−f(x0−hA)