Relative Neural Architecture Search via Slow-Fast Learning
First author:Tan Hao [PDF]
NAS: Neural Architecture Search 神经架构搜素
automating the design of artificial neural networks
Motivation
To benefit form the merits while overcoming the deficiencies of the differentiable NAS and population-based NAS.
Deficiencies
Differentiable NAS
Search by gradient can be ineffective due to the lack of proper diversity
Population-Based NAS
Search efficiency is poor due to the stochastic crossover/mutation and a large number of performance evaluations.
Method
continuous encoding scheme
Cell-based architecture
Two types of cells: the normal cell and the reduction cell(down sampling to reduce the size of the feature)
Encodes the node and the operation separately(represented by a real value interval)
The network is a DAG(Directed acyclic graph)
different with DARTS: 1) no requirement of differentiability 2) directly encode the operations between pairwise nodes into real values
Pros: provide more flexibility and versatility
networks can achieve promising performance and high transferability for different tasks by adjusting the total number of cells in the final architecture
slow-fast learning paradigm
inspire by *Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, Kaiming He*; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 6202-6211 , which proposed use Fast pathway(high frame rate) and Slow pathway(low frame rate) in model for video recognition
in each pair of architecture vectors, conside the one with worse performance as slow-learning and the one with better performance as fast-learning.
architecture vectors are upated by pseudo-gradient mechanism which detemined by slow-learning and fast-learning
At each generation, the population is randomly divided into N / 2 N/2 N/2 pairs, α p , s g \boldsymbol{\alpha}_{p, s}^{g} αp,sg is updated by learning form α p , f g \boldsymbol{\alpha}_{p, f}^{g} αp,fg with:
Δ α p , s g = λ 1 ( α p , f g − α p , s g ) + λ 2 Δ α p , s g − 1 \Delta \boldsymbol{\alpha}_{p, s}^{g}=\lambda_{1}\left(\boldsymbol{\alpha}_{p, f}^{g}-\boldsymbol{\alpha}_{p, s}^{g}\right)+\lambda_{2} \Delta \boldsymbol{\alpha}_{p, s}^{g-1} Δαp,sg=λ1(αp,fg−αp,sg)+λ2Δαp,sg−1
Due to pseudo-gradient based mechanism, the RelativeNAS is applicable any other generic continuously encoded search space
noval performance estimation strategy
adopt an operation collection as a weight set to estimate the performances
the weight set is not directly trained but update in an online manner
RelativeNAS is intuitively feasible to use performance estimations to obtain the approximate validation losses of the candidate architectures.
Pros: save substantial computation costs
Result
Pick CIFAR-10 as dataset
-
It only takes about nine hours with a unique 1080Ti or seven hours with a Tesla V100 to complete the above search procedure.
-
RelativeNAS + Cutout has low Test Error(2.34%) and middle Params(3.93M) , efficient
Transferability Analyses
- Intra-task Transferability: CIFAR-100, ImageNet
- Inter-task Transferability: Object Detection, Semantic Segmentation, Keypoint Detection
Conclusion
This work highlight the merits of differentiable NAS and combining population-based NAS, to be more effective and more efficient. Moreover, the proposed slow-fast learning paradigm can be also potentially applicable to other generic learning/optimization tasks.