DARTS:DIFFERENTIABLE ARCHITECTURE SEARCH

1. Introduction

before: discovering state-of-the-art neural network architecture requires substantial effort of human expers.
Contribution:
1.We introduce a novel algorithm for differentable network architecture search based on bilevel optimization(二阶优化) .propose a method for efficient architecture search called DARTs and relax the search space to be continuous,so tha the atchitecture can be optimized;
2.We achieve remarkable efficiency impovement.Through extensive experiments on image classification and language modeling tasks we show that gradient-based architecture search achieve highly competitive results on CIFAR-10 and outperforms the state of the art on PTB;
The code of DARTS is available to us at https://github.com/quark0/darts.

https://github.com/quark0/darts

2.Diffrerentiable architecture search

2.1.SEARCH SPACE

We search for a computation cell as the building block of the final architecture. The learned cell could either be stacked to form a convolutional network or connected to form a recurrent network.
cell: a cell is a directed acyclic (无循环)graph consisting of an ordered sequence of N nodes;
node: x ( i ) x^{(i)} x(i) is a latent representation(e.g. a feature map in convolution network);
directed edge: ( i , j ) (i,j) (i,j);
operation: o ( i , j ) o^{(i,j)} o(i,j) is a operation that transform x ( i ) x^{(i)} x(i).
Construction:
1.the cell has two input nodes and a single output node;
2.the input nodes are defined as the cell outputs in the previous two layer;
Each intermediate node is compued based on all of its predecessors:
x ( j ) = ∑ i < j o ( i , j ) ( x ( i ) ) \begin{equation} x^{(j)}=\sum_{i<j} o^{(i,j)}(x^{(i)}) \end{equation} x(j)=i<jo(i,j)(x(i))

2.2. Continuous Relaxation and Optimization

2.2.1 meaning

  1. O:a set of candidate operation
  2. o ( ⋅ ) o(\cdot) o()is some function which is applied to x ( i ) x^{(i)} x(i)

2.2.2 the function is below:

o ˉ ( i , j ) ( x ) = ∑ o ∈ O e x p ( α 0 ( i , j ) ) ∑ o ′ ∈ O e x p ( α o ′ ( i , j ) ) \begin{equation} \bar{o}^{(i,j)}(x)=\sum_{o\in O}\frac{exp(\alpha_0^{(i,j)})}{\sum_{o^{'}\in O}exp(\alpha_{o_{'}}^{(i,j)})} \end{equation} oˉ(i,j)(x)=oOoOexp(αo(i,j))exp(α0(i,j))
1. ( i , j ) (i,j) (i,j)are parameterized by a vector α ( i , j ) \alpha^{(i,j)} α(i,j) of dimension |O|.
2.The task of architecture search reduces to learning a set of continuous variables α = { α ( i , j ) } \alpha=\{\alpha^{(i,j)}\} α={α(i,j)}
3.at the end of search, a discrete architecture can be obtained by replacing each mixed operation o ˉ i , j \bar o^{i,j} oˉi,j, at the most of time , o ( i , j ) = a r g m a x o ∈ O α o ( i , j ) o^{(i,j)}=argmax_{o\in O} \alpha_o^{(i,j)} o(i,j)=argmaxoOαo(i,j)
After relaxation, our goal is to joinly learn the architecture α \alpha α and the weight w w w.
L t r a i n L_{train} Ltrain:training loss;
L v a l L_{val} Lval:validation loss;
Both losses are determined not only by the architecture α \alpha α,but also the weights w in the network.

2.2.3 Goals

  1. α ∗ \alpha^* α--------------------minimize the validation loss L v a l ( w ∗ , α ∗ ) L_{val}(w^*,\alpha^*) Lval(w,α)
  2. w ∗ w^* w--------------------minimize the training loss L t r a i n ( w , α ∗ ) L_{train}(w,\alpha^*) Ltrain(w,α)

It is a bilevel optimization problem with α \alpha α as the upper-level variable and w w w as the lower-level variable:
m i n α L v a l ( w ∗ ( α ) , α ) s . t .   w ∗ ( α ) = a r g m i n w L t r a i n ( w , α ) \begin{equation} \begin{aligned} &\underset {\alpha} {min} L_{val}(w^*(\alpha),\alpha)\\ &s.t.\ w^*(\alpha)=argmin_w L_{train}(w,\alpha)\\ \end{aligned} \end{equation} αminLval(w(α),α)s.t. w(α)=argminwLtrain(w,α)
Algorithm:
在这里插入图片描述

2.3 Approximate Architecture Gradient

Evaluating the architecture gradient exactly can be prohibitive due to the expensive inner optimizaton.So we propose a simple approximate scheme as follows:
∇ α L v a l ( w ∗ ( α ) , α ) ≈ ∇ α L v a l ( w − ξ ∇ w L t r a n ( w , α ) , α ) \begin{align} &\nabla_{\alpha}L_{val}(w^*(\alpha),\alpha)\\ \approx &\nabla_{\alpha}L_{val}(w-\xi\nabla_wL_{tran}(w,\alpha),\alpha) \end{align} αLval(w(α),α)αLval(wξwLtran(w,α),α)
meaning:

  1. w w w:current weights maintained by the algorithm;
  2. ξ \xi ξ:the learning rate for a step of inner optimization;
  3. ∇ α L v a l ( w , α ) \nabla_{\alpha}L_{val}(w,\alpha) αLval(w,α):if w w w is already a local optimum(局部最优) for the inner optimization,and thus ∇ w L t r a i n ( w , α ) \nabla_wL_{train}(w,\alpha) wLtrain(w,α)=0,so w = w ∗ ( α ) w=w^*(\alpha) w=w(α) and w ∗ ( α ) w^*(\alpha) w(α)= w − ξ ∇ w L t r a i n ( w , α ) w-\xi \nabla_w L_{train}(w,\alpha) wξwLtrain(w,α)

Idea
approximate w ∗ w^* w by adapting w w w using only a single training step, without solving the inner optimization completely bu training until convergence.
while we are not currently aware of the convergence guarantees for out optimization algorithm, in practice it is able to reach a fixed point with a suitable choixe of ξ \xi ξ.
citation: https://zhuanlan.zhihu.com/p/156832334

2.3.1 prodcure

  1. Firstly, ∇ α L v a l ( w − ξ ∇ w L t r a i n ( w , α ) . α ) \nabla_{\alpha}L_{val}(w-\xi\nabla_w L_{train}(w,\alpha).\alpha) αLval(wξwLtrain(w,α).α) can be simplized as ∇ α f ( g 1 ( α ) , g 2 ( α ) ) \nabla_{\alpha}f(g_1(\alpha),g_2(\alpha)) αf(g1(α),g2(α))
    ⋅ f ( ⋅ , ⋅ ) = L v a l ( ⋅ , ⋅ ) ⋅ g 1 ( α ) = w − ξ L t r a i n ( w , α ) ⋅ g 2 ( α ) = α \cdot f(\cdot,\cdot)=L_{val}(\cdot,\cdot)\\ \cdot g_1(\alpha)=w-\xi L_{train}(w,\alpha)\\ \cdot g_2(\alpha)=\alpha f(,)=Lval(,)g1(α)=wξLtrain(w,α)g2(α)=α

∇ α f ( g 1 ( α ) , g 2 ( α ) ) = ∇ α g 1 ( α ) ⋅ D 1 f ( g 1 ( α ) , g 2 ( α 2 ) ) + ∇ α g 2 ( α ) ⋅ D 2 f ( g 1 ( α ) , g 2 ( α 2 ) ) ∇ α g 1 ( α ) = − ξ ∇ α , w 2 2 L t r a i n ( w , α ) ∇ α g 2 ( α ) = 1 D 1 f ( g 1 ( α ) , g 2 ( α ) ) = ∇ w , L v a l ( w , , α ) D 2 f ( g 1 ( α ) , g 2 ( α ) ) = ∇ α L v a l ( w , , α ) ∇ α L v a l ( w − ξ ∇ w L t r a i n ( w , α ) , α ) = ∇ α L v a l ( w ′ , α ) − ξ ∇ α , w 2 ⋅ ∇ w ′ L v a l ( w ′ , α ) \nabla_{\alpha}f(g_1(\alpha),g_2(\alpha)) =\nabla_{\alpha}g_1(\alpha)\cdot D_1 f(g_1(\alpha),g_2(\alpha_2))+\nabla_{\alpha}g_2(\alpha)\cdot D_2 f(g_1(\alpha),g_2(\alpha_2))\\ \nabla_{\alpha}g_1(\alpha)=-\xi\nabla^2_{\alpha,w2}L_{train}(w,\alpha)\\ \nabla_{\alpha}g_2(\alpha)=1\\ D_1f(g_1(\alpha),g_2(\alpha))=\nabla_{w^,}L_{val}(w^,,\alpha)\\ D_2f(g_1(\alpha),g_2(\alpha))=\nabla_{\alpha}L_{val}(w^,,\alpha)\\ \nabla_{\alpha}L_{val}(w-\xi\nabla_{w}L_{train}(w,\alpha),\alpha)=\nabla_{\alpha}L_{val}(w',\alpha)-\xi\nabla_{\alpha,w}^2\cdot \nabla_{w'}L_{val}(w',\alpha) αf(g1(α),g2(α))=αg1(α)D1f(g1(α),g2(α2))+αg2(α)D2f(g1(α),g2(α2))αg1(α)=ξα,w22Ltrain(w,α)αg2(α)=1D1f(g1(α),g2(α))=w,Lval(w,,α)D2f(g1(α),g2(α))=αLval(w,,α)αLval(wξwLtrain(w,α),α)=αLval(w,α)ξα,w2wLval(w,α)
3.
s . t .   ϵ = 0.01 / ∥ ∇ w ′ , α ∥ 2 ∇ α , w 2 L t r a i n ( w , α ) ⋅ ∇ w ′ L v a l ( w ′ , α ) ≈ ∇ α L t r a i n ( w + , α ) − ∇ α L t r a i n ( w − , α ) 2 ϵ w − / + = w − / + ϵ ∇ w ′ L v a l ( w ′ , α ) f ′ ( x 0 ) ⋅ A ≈ f ( x 0 + h A ) − f ( x 0 − h A ) 2 h s.t. \ \epsilon=0.01/\| \nabla_{w',\alpha}\| _2\\ \nabla^2_{\alpha,w}L_{train}(w,\alpha)\cdot \nabla_{w'}L_{val}(w',\alpha)\approx\frac {\nabla_{\alpha}L_{train}(w^+,\alpha)-\nabla_{\alpha}L_{train}(w^-,\alpha)} {2\epsilon}\\ w^{-/+}=w-/+\epsilon\nabla_{w'}L_{val}(w',\alpha)\\ f'(x_0)\cdot A\approx\frac {f(x_0+hA)-f(x_0-hA)} {2h} s.t. ϵ=0.01/∥w,α2α,w2Ltrain(w,α)wLval(w,α)2ϵαLtrain(w+,α)αLtrain(w,α)w/+=w/+ϵwLval(w,α)f(x0)A2hf(x0+hA)f(x0hA)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值