【演化计算】【论文研读】Completely Automated CNN Architecture Design Based on Blocks

最新推荐文章于 2020-11-28 16:25:37 发布

StevenGerrad

最新推荐文章于 2020-11-28 16:25:37 发布

阅读量867

点赞数 1

分类专栏：论文研读文章标签：深度学习

本文链接：https://blog.csdn.net/qq_40690815/article/details/103647775

版权

论文研读专栏收录该内容

9 篇文章 0 订阅

订阅专栏

Completely Automated CNN Architecture Design Based on Blocks 论文研读

I. INTRODUCTION
II. BACKGROUND
- A. Genetic Algorithms
- B. ResNet and DenseNet Blocks
III. PROPOSED ALGORITHM
IV. EXPERIMENT DESIGN
V. EXPERIMENTAL RESULTS

the paper dist: https://ieeexplore.ieee.org/document/8742788/

😄 😆 😊 😃 😏 😍 😘 😚 😳 😌 😆 😁 😉 😜 😝 😀 😗 😙 😛 😴 😟 😦 😧 😮 😬 😕 😯 😑 😒 😅 😓 😥 😩 😔 😞 😖 😨 😰 😣 😢 😭 😂 😲 😱

I. INTRODUCTION

Generally, given a CNN, denoted by $A$ , havingn architecture related parameters $\lambda_1$ ,…, $\lambda_n$ whose decision spaces are $\Lambda_1$ ,…, $\Lambda_n$ , respectively, the CNN architecture design is to optimize the problem formulated as follows:

$\left\{ \begin{aligned} argmin_λ & L(A_λ ,D_{train},D{valid}) \\ s.t. \lambda \in \Lambda & \end{aligned} \right.$

where $λ$ ={ $λ_1$ ,…, $λ_n$ }, $\Lambda$ = $\Lambda_1$ ×···× $\Lambda_n$ , $A_λ$ denotes the CNN, $A$ adopting the architecture parameter setting $λ$ , and $L$ (·) measures the performance of $A_λ$ on the validation data $D_{valid}$ after $A_λ$ has been trained on the training data $D_{train}$ . In the case of classiﬁcation tasks, $L$ (·) measures the classiﬁcation error of the tasks to which $A$ is applied. Typically, the gradient-based algorithms, such as stochastic gradient descent (SGD) [6], are employed to train the weights of $A_λ$ , as $L$ (·) is differentiable (or approximately differentiable) with respect to the weights.

However, we never know the best depth of the CNN in solving a new problem. To this end, Large-scale Evolution utilizes a variable-length encoding scheme where the CNNs can adaptively change their depths for the problems. However, Large-scale Evolution uses only the mutation operator but not any crossover operator during the search process. In evolutionary algorithms, the crossover operator and the mutation operator play complementary roles of local search and global search. Without using the crossover operator,the mutationoperatorworksjust like a randomsearch at different start positions. Nevertheless, it is not surprising that Large-scale Evolution does not use the crossover operator since the crossover operator is originally designed for the ﬁxed-length encoding scheme.

To achieve this goal, the objectives have been speciﬁed in the following.

The proposed algorithm does mandate any prerequisite knowledge from the users in base CNN design, investigated data set, and GAs. The CNN whose architecture is designed by the proposed algorithm can be directly used without any recomposition, preprocessing, or postprocessing.
The variable-length encoding scheme is employed for searching the optimal depth of the CNN. To adopt the variable-length encoding, a new crossover operator and a mutation operator are designed and incorporated into the proposed algorithm to collectively exploit and explore the search space in ﬁnding the best CNN architectures.
An efﬁcient encoding strategy is designed based on the ResNet block (RB) and DenseNet block (DB) for speeding up the architecture design, and the limited computational resource is utilized, while the promising performance can be achieved by the proposed algorithm. Noting that, although the RB and DB are used in the proposed algorithm, the users are not required to have expertise in these blocks when they are using the proposed algorithm.

II. BACKGROUND

A. Genetic Algorithms

GAs [30] are a class of heuristic population-based computational paradigm. Generally, a GA works as follows.

Step 1: Initialization of a population of individuals each of which represents a candidate solution of the problem through the employed encoding strategy.
Step 2: Evaluation of the ﬁtness of each individual in the population based on the encoded information and the ﬁtness function.
Step 3: Mating selection of promising parent individuals from the current population, and then, generate offspring with crossover and mutation operators.
Step 4: Evaluation of the ﬁtness of the generated offspring.
Step 5: Environmental selection of a population of individuals with a promising performance from the current population, and then, replace the current population by the selected population.
Step 6: Go to Step 3 if the termination criterion is not met; otherwise, return the individual with the best ﬁtness as the best solution for the problem.

Commonly, a maximal generation number is predeﬁned as the termination criterion.

B. ResNet and DenseNet Blocks


Fig. 1. Example of the RB.	Fig. 2. Example of the DB including four convolutional layers.

Fig. 1 shows an example of an RB, which is composed of three convolutional layers2 and one skip connection. In this example, the convolutional layers are denoted as conv1, conv2, and conv3.

On conv1, the spatial size of the input is reduced by a smaller number of ﬁlters with a size of 1×1, to lower the computational complexity of conv2.
On conv2, ﬁlters with a larger size, such as 3 × 3, are used to learn features with the same spatial size.
On conv3, ﬁlters with a size of 1×1 are used again, and the spatial size is increased for generating more features. The input is added, denoted by ⊕, to the output of conv3 as the ﬁnal output of the RD. Noting that if the spatial sizes of the input and conv3’s output are unequal, a group of convolutional operations with the ﬁlters of 1×1 size is applied on the input, to achieve the same spatial size as that of conv3’s output, for the addition.

Fig. 2 exhibits an example of a DB. For the convenience of the introduction, we give only four convolutional layers in the DB. In practice, the DB can have a different number of convolutional layers, which is tuned by users. In the DB, each convolutional layer receives inputs from not only the input data but also the output of all the previous convolutional layers. In addition, there is a parameter k for controlling the spatial size of the input and output of the same convolutional layer. If the spatial size of the input is a, then the spatial size of the output is a +k, which is achieved by the convolutional operation using the corresponding number of ﬁlters.

Efforts in [37] and [38] have been put on investigating the mechanism behind the success of RBs and DBs and revealed that RBs and DBs are able to mitigate the adverse impact of the gradient vanishing problem [39], based on which a deep architecture is capable of effectively learning the hierarchical representations of the input data and then improving the ﬁnal classiﬁcation accuracy in turn. In addition, the dense connections in DBs have also been claimed to be able to reuse the low-level features to increase the discrimination of features learned at the top layers of CNNs [18].

III. PROPOSED ALGORITHM

For the convenience of the development, the proposed algorithm is named AE-CNN (automatically evolving CNN) in short, and the evolved CNN is used solely for image classiﬁcation tasks.

A. Algorithm Overview

使用GA(遗传算法)框架

Algorithm 1 shows the framework of AE-CNN, which is composed of three parts.

First, the population is randomly initialized with a predeﬁned size of N (see line 1).
Then, the individuals are evaluated for the ﬁtness (see line 2).
Next, all individuals in the population take part in the evolutionary process of GA with the maximal generation number of T (see lines 3–14).
Finally, the best CNN architecture is decoded from the best individual that is chosen from the ﬁnal population based on the ﬁtness (see line 15).

During the evolutionary process, an empty population is initialized for including offspring (see line 5), and then, a new offspring is generated from selected parents with the crossover and mutation operations, while the parents are selected by the binary tournament selection (see lines 6–10); after the ﬁtness of the generated offspring has been evaluated (see line 11), a new population is selected with the environmental selection operation (see line 12) from the current population (containingthe currentindividualsandthe generatedoffspring) as the parent solutions surviving into the next evolutionary process (i.e., the next generation). Noting that the symbol of |·|shown in line 6 is a cardinality operator. The phases of “population initialization,” “ﬁtness evaluation,” “offspring generation,” and “environmental selection” are documented in Sections III-B–III-E, respectively.

B. Population Initialization

no fully connected layersconstrain pooling layer

Generally, all the individuals are initialized in a random manner with a uniform distribution,as have introduced in Section II-A that each individual in GAs represents a candidate solution of the problem to be solved.

In the proposed algorithm, CNNs are constructed based on RBs, DBs, and pooling layers, which is motivated by the remarkable success of ResNet [17] and DenseNet [18], while the fully connected layers are not considered(不考率使用全连接层) in the proposed algorithm. The main reason is that the fully connected layers easily cause the overﬁtting phenomenon [40] due to their full-connection nature.

Next, we will explain the details of lines 8 and 11 because other parts of Algorithm 2 are straightforward.

Speciﬁcally(对pooling layer 的专门处理，限制pooling size过大), the pooling layers in CNNs perform the dimension reduction on their input data, and the most commonly used pooling operation is to halve the input size, which can be seen from the state-of-the-art CNNs [2]–[5], [17], [18].

To this end, the employed pooling layers cannot be arbitrarily speciﬁed but following the constraint that has been calculated, as shown in line 2.

For example, if the input size is 32×32, the number of used pooling layers cannot be larger than six because six pooling layers will reduce the dimension of the input data to 1×1, and one extra pooling layer on the dimension of 1×1 will lead to the logic error.

In the proposed algorithm, we design a new encoding strategy aiming at effectively modeling CNNs with different architectures.

For the used RBs, based on the conﬁguration of the state-of-the-art CNNs [17], [42], we set the ﬁlter size of conv2 to 3×3, which is also used for the convolutional layers in the used DBs.
For the used pooling layers, we set the same stride as the step size to 2×2 based on the conventions, which means that such a single pooling layer in the evolved CNN halves the input dimension for one time.
To this end, the unknown parameter settings for RBs are the spatial sizes of input and output, those for DBs are the spatial sizes of input and output, as well as k, and that for pooling layers are only their types, i.e., the max or mean pooling type.

Note that the number of convolutional layers in the DB is known because it can be derived by the spatial sizes of input and output, as well as k. Accordingly, the proposed encoding strategy is based on three different types of units and their positions in the CNNs. The units are the RB unit (RBU), the DB unit (DBU), and the pooling layer unit (PU).

Speciﬁcally, an RBU and a DBU contain multiple RBs and DBs, respectively, while a PU is composed of only a single pooling layer.

Our justiﬁcations are that:

by putting multiple of RBs or DBs into an RBU or a DBU, the depth of the CNN can be signiﬁcantly changed compared with stacking RBs or DBs one-by-one, which will speed up the heuristic search of the proposed algorithm by easily changing the depth of the CNN.
one PU consisting of a single pooling layer is more ﬂexible than consisting of multiple pooling layers because the effect of multiple consequent pooling layers can be achieved by stacking multiple PUs.
In addition, we also add one parameter to represent the unit type for the convenience of the algorithm implementation.

In summary, the encoded information for an RBU is the type, the number of RBs, the input spatial size, and the output spatial size, which are denoted as type, amount, in, and out, respectively.On the other hand, the encoded information for a DBU is the same as those of an RBU, in addition to the additional parameter k. Only one parameter is needed in a PU for encoding the pooling type.

Fig. 3 shows an example of the proposed algorithm in encoding a CNN containing nine units. Speciﬁcally, each number in the top-left corner of the block denotes the position of the unit in the CNN. The unit is an RBU, a DBU, or a PU if the type is 1, 2, or 3, respectively. Noting that the proposed encoding strategy does not constrain the maximal length of each individual, which means that the proposed algorithm can adaptively ﬁnd the best CNN architecture with a proper depth through the designed variable-length encoding strategy.

C. Fitness Evaluation

classiﬁcation accuracy Xavier initializerSGD

Noting that the weight initialize method and the training method are the Xavier initializer [43] and the SGD with momentum, respectively, which are commonly used in the deep learning community. Finally, the trained CNN is evaluated on the validation data (see line 5), and the evaluated classiﬁcation accuracy is considered as the ﬁtness of the individual (see line 6).

D. Offspring Generation

Offspring

Adopting the best ones as the parents could easily cause the loss of diversity in the population, which in turn leads to the premature convergence [44], [45], and as a result, the best performance of the population cannot be achieved [46], [47] due to trapping into the local minima [48], [49]. To address this problem, a general way is to select promising parents via a random way.

In the proposed AE-CNN algorithm, the binary tournament selection [50] is used for this purpose [50], [51] based on the conventions of the GA community.

The binary tournament selection randomly selects two individuals from the population, and the one with a higher ﬁtness is chosen as one parent individual. By repeating this process again, another parent individual is chosen, and then, these two parent individuals perform the crossover operation. (用两轮锦标赛选择出parents)

Noting that two offspring are generated (每次crossover产生两个后代)after each crossover operation, and N offspring are generated in each generation, i.e., the crossover operation is performed N/2 times during each generation where N stands for the population size.

Crossover

In the proposed algorithm, we employ the one-point crossover operator. The reason is that the one-point crossover has been widely used in GP [31]. GP is another important class of evolutionary algorithms, and the individuals in GP are common with different lengths. Algorithm 4 shows the crossover operation in the proposed algorithm.


	Fig. 4. (a) Two selected parent individuals for the crossover operation and (b) generated offspring. The numbers in each block denote the corresponding conﬁguration, and the red numbers in Fig. 4(b) denote the necessary changes after the crossover operation.

Noting that some necessary changes are automatically made on the generated offspring if required. For example, the in of the current unit should be equal to the out of the previous unit, and other cascade adjustments are caused by this change. For a better understanding of the crossover operation, an example is shown in Fig. 4.

Mutation

In the proposed algorithm, the available mutation types are as follows:

adding (adding an RBU, adding a DBU, or adding a PU to the selected position);
removing (removing the unit at the selected position);
modifying(modifying the encoded information of the unit at the selected position).


	Fig. 5. Example of the “adding an RBU” mutation. (a) First and second rows denote the selected individual for the mutation and the randomly initialized RBU for the “adding and RBU” mutation at the fourth position of the individual to be mutated. (b) Mutated individual. The red numbers denote the necessary changes after the mutation.

In addition, a series of necessary adjustments will also be automatically performed based on the logic of composing a valid CNN as highlighted in the crossover operation. For better understanding the mutation, an example in terms of the “adding an RBU” is shown in Fig. 5.

E. Environmental Selection

IV. EXPERIMENT DESIGN

A. Peer Competitors

B. Benchmark Data Sets

CIFAR10 CIFAR100

Fig. 6. Randomly selected examples from each of the three categories of (a) CIFAR10 and (b) CIFAR100, and each category has ten examples.

C. Parameter Settings

Particularly, the population size and the maximal generation number are set to be 20.
The probabilities of crossover and mutation are set to 0.9 and 0.2, respectively.
Based on the conventions of the machine learning community, the validation data are randomly split from the training data with the proportion of 1/5.
Finally, all the classiﬁcation error rates are evaluated on the same test data for the comparison.
…

V. EXPERIMENTAL RESULTS

在这里插入图片描述

StevenGerrad

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
1
评论
【演化计算】【论文研读】Completely Automated CNN Architecture Design Based on Blocks

Completely Automated CNN Architecture Design Based on Blocks 论文研读I. INTRODUCTIONthe paper dist: https://ieeexplore.ieee.org/document/8742788/I. INTRODUCTION
复制链接

扫一扫

专栏目录