DARTS:DIFFERENTIABLE ARCHITECTURE SEARCH

最新推荐文章于 2024-08-02 23:33:35 发布

身处罗马的希腊人

最新推荐文章于 2024-08-02 23:33:35 发布

阅读量127

点赞数

文章标签：计算机视觉 python

本文链接：https://blog.csdn.net/weixin_45358930/article/details/131913700

版权

differentiable architecture search

1. Introduction
2.Diffrerentiable architecture search

1. Introduction

before: discovering state-of-the-art neural network architecture requires substantial effort of human expers.
Contribution:
1.We introduce a novel algorithm for differentable network architecture search based on bilevel optimization(二阶优化) .propose a method for efficient architecture search called DARTs and relax the search space to be continuous,so tha the atchitecture can be optimized;
2.We achieve remarkable efficiency impovement.Through extensive experiments on image classification and language modeling tasks we show that gradient-based architecture search achieve highly competitive results on CIFAR-10 and outperforms the state of the art on PTB;
The code of DARTS is available to us at https://github.com/quark0/darts.

https://github.com/quark0/darts

2.Diffrerentiable architecture search

2.1.SEARCH SPACE

We search for a computation cell as the building block of the final architecture. The learned cell could either be stacked to form a convolutional network or connected to form a recurrent network.
cell: a cell is a directed acyclic （无循环）graph consisting of an ordered sequence of N nodes;
node: $x^{(i)}$ is a latent representation(e.g. a feature map in convolution network);
directed edge: $(i, j)$ ;
operation: $o^{(i,j)}$ is a operation that transform $x^{(i)}$ .
Construction:
1.the cell has two input nodes and a single output node;
2.the input nodes are defined as the cell outputs in the previous two layer;
Each intermediate node is compued based on all of its predecessors:
$\begin{equation} x^{(j)}=\sum_{i<j} o^{(i,j)}(x^{(i)}) \end{equation}$

2.2. Continuous Relaxation and Optimization

2.2.1 meaning

O:a set of candidate operation
$o(\cdot)$ is some function which is applied to $x^{(i)}$

2.2.2 the function is below:

$\begin{equation} \bar{o}^{(i,j)}(x)=\sum_{o\in O}\frac{exp(\alpha_0^{(i,j)})}{\sum_{o^{'}\in O}exp(\alpha_{o_{'}}^{(i,j)})} \end{equation}$
1. $(i, j)$ are parameterized by a vector $\alpha^{(i,j)}$ of dimension |O|.
2.The task of architecture search reduces to learning a set of continuous variables $\alpha=\{\alpha^{(i,j)}\}$
3.at the end of search, a discrete architecture can be obtained by replacing each mixed operation $\bar o^{i,j}$ , at the most of time , $o^{(i,j)}=argmax_{o\in O} \alpha_o^{(i,j)}$
After relaxation, our goal is to joinly learn the architecture $\alpha$ and the weight $w$ .
$L_{train}$ :training loss;
$L_{val}$ :validation loss;
Both losses are determined not only by the architecture $\alpha$ ,but also the weights w in the network.

2.2.3 Goals

$\alpha^*$ --------------------minimize the validation loss $L_{val}(w^*,\alpha^*)$
$w^*$ --------------------minimize the training loss $L_{train}(w,\alpha^*)$

It is a bilevel optimization problem with $\alpha$ as the upper-level variable and $w$ as the lower-level variable:
$\begin{equation} \begin{aligned} &\underset {\alpha} {min} L_{val}(w^*(\alpha),\alpha)\\ &s.t.\ w^*(\alpha)=argmin_w L_{train}(w,\alpha)\\ \end{aligned} \end{equation}$
Algorithm:
在这里插入图片描述

2.3 Approximate Architecture Gradient

Evaluating the architecture gradient exactly can be prohibitive due to the expensive inner optimizaton.So we propose a simple approximate scheme as follows:
$\begin{align} &\nabla_{\alpha}L_{val}(w^*(\alpha),\alpha)\\ \approx &\nabla_{\alpha}L_{val}(w-\xi\nabla_wL_{tran}(w,\alpha),\alpha) \end{align}$
meaning:

$w$ :current weights maintained by the algorithm；
$\xi$ :the learning rate for a step of inner optimization；
$\nabla_{\alpha}L_{val}(w,\alpha)$ ：if $w$ is already a local optimum（局部最优） for the inner optimization,and thus $\nabla_wL_{train}(w,\alpha)$ =0,so $w=w^*(\alpha)$ and $w^*(\alpha)$ = $w-\xi \nabla_w L_{train}(w,\alpha)$ ；

Idea
approximate $w^*$ by adapting $w$ using only a single training step, without solving the inner optimization completely bu training until convergence.
while we are not currently aware of the convergence guarantees for out optimization algorithm, in practice it is able to reach a fixed point with a suitable choixe of $\xi$ .
citation: https://zhuanlan.zhihu.com/p/156832334

2.3.1 prodcure

Firstly, $\nabla_{\alpha}L_{val}(w-\xi\nabla_w L_{train}(w,\alpha).\alpha)$ can be simplized as $\nabla_{\alpha}f(g_1(\alpha),g_2(\alpha))$
$\cdot f(\cdot,\cdot)=L_{val}(\cdot,\cdot)\\ \cdot g_1(\alpha)=w-\xi L_{train}(w,\alpha)\\ \cdot g_2(\alpha)=\alpha$

$\nabla_{\alpha}f(g_1(\alpha),g_2(\alpha)) =\nabla_{\alpha}g_1(\alpha)\cdot D_1 f(g_1(\alpha),g_2(\alpha_2))+\nabla_{\alpha}g_2(\alpha)\cdot D_2 f(g_1(\alpha),g_2(\alpha_2))\\ \nabla_{\alpha}g_1(\alpha)=-\xi\nabla^2_{\alpha,w2}L_{train}(w,\alpha)\\ \nabla_{\alpha}g_2(\alpha)=1\\ D_1f(g_1(\alpha),g_2(\alpha))=\nabla_{w^,}L_{val}(w^,,\alpha)\\ D_2f(g_1(\alpha),g_2(\alpha))=\nabla_{\alpha}L_{val}(w^,,\alpha)\\ \nabla_{\alpha}L_{val}(w-\xi\nabla_{w}L_{train}(w,\alpha),\alpha)=\nabla_{\alpha}L_{val}(w',\alpha)-\xi\nabla_{\alpha,w}^2\cdot \nabla_{w'}L_{val}(w',\alpha)$
3.
$\ \epsilon=0.01/\| \nabla_{w',\alpha}\| _2\\ \nabla^2_{\alpha,w}L_{train}(w,\alpha)\cdot \nabla_{w'}L_{val}(w',\alpha)\approx\frac {\nabla_{\alpha}L_{train}(w^+,\alpha)-\nabla_{\alpha}L_{train}(w^-,\alpha)} {2\epsilon}\\ w^{-/+}=w-/+\epsilon\nabla_{w'}L_{val}(w',\alpha)\\ f'(x_0)\cdot A\approx\frac {f(x_0+hA)-f(x_0-hA)} {2h}$

身处罗马的希腊人

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
DARTS:DIFFERENTIABLE ARCHITECTURE SEARCH

cell: a cell is a directed acyclic （无循环）graph consisting of an ordered sequence of N nodes;
复制链接

扫一扫