【演化计算】【论文研读】Completely Automated CNN Architecture Design Based on Blocks

the paper dist: https://ieeexplore.ieee.org/document/8742788/

😄 😆 😊 😃 😏 😍 😘 😚 😳 😌 😆 😁 😉 😜 😝 😀 😗 😙 😛 😴 😟 😦 😧 😮 😬 😕 😯 😑 😒 😅 😓 😥 😩 😔 😞 😖 😨 😰 😣 😢 😭 😂 😲 😱

I. INTRODUCTION

Generally, given a CNN, denoted by A A A, havingn architecture related parameters λ 1 \lambda_1 λ1,…, λ n \lambda_n λn whose decision spaces are Λ 1 \Lambda_1 Λ1,…, Λ n \Lambda_n Λn, respectively, the CNN architecture design is to optimize the problem formulated as follows:

{ a r g m i n λ L ( A λ , D t r a i n , D v a l i d ) s . t . λ ∈ Λ \left\{ \begin{aligned} argmin_λ & L(A_λ ,D_{train},D{valid}) \\ s.t. \lambda \in \Lambda & \end{aligned} \right. {argminλs.t.λΛL(Aλ,Dtrain,Dvalid)

where λ λ λ ={ λ 1 λ_1 λ1,…, λ n λ_n λn}, Λ \Lambda Λ= Λ 1 \Lambda_1 Λ1×···× Λ n \Lambda_n Λn, A λ A_λ Aλ denotes the CNN, A A A adopting the architecture parameter setting λ λ λ, and L L L(·) measures the performance of A λ A_λ Aλ on the validation data D v a l i d D_{valid} Dvalid after A λ A_λ Aλ has been trained on the training data D t r a i n D_{train} Dtrain. In the case of classification tasks, L L L(·) measures the classification error of the tasks to which A A A is applied. Typically, the gradient-based algorithms, such as stochastic gradient descent (SGD) [6], are employed to train the weights of A λ A_λ Aλ, as L L L(·) is differentiable (or approximately differentiable) with respect to the weights.

However, we never know the best depth of the CNN in solving a new problem. To this end, Large-scale Evolution utilizes a variable-length encoding scheme where the CNNs can adaptively change their depths for the problems. However, Large-scale Evolution uses only the mutation operator but not any crossover operator during the search process. In evolutionary algorithms, the crossover operator and the mutation operator play complementary roles of local search and global search. Without using the crossover operator,the mutationoperatorworksjust like a randomsearch at different start positions. Nevertheless, it is not surprising that Large-scale Evolution does not use the crossover operator since the crossover operator is originally designed for the fixed-length encoding scheme.

To achieve this goal, the objectives have been specified in the following.

  1. The proposed algorithm does mandate any prerequisite knowledge from the users in base CNN design, investigated data set, and GAs. The CNN whose architecture is designed by the proposed algorithm can be directly used without any recomposition, preprocessing, or postprocessing.
  2. The variable-length encoding scheme is employed for searching the optimal depth of the CNN. To adopt the variable-length encoding, a new crossover operator and a mutation operator are designed and incorporated into the proposed algorithm to collectively exploit and explore the search space in finding the best CNN architectures.
  3. An efficient encoding strategy is designed based on the ResNet block (RB) and DenseNet block (DB) for speeding up the architecture design, and the limited computational resource is utilized, while the promising performance can be achieved by the proposed algorithm. Noting that, although the RB and DB are used in the proposed algorithm, the users are not required to have expertise in these blocks when they are using the proposed algorithm.

II. BACKGROUND

A. Genetic Algorithms

GAs [30] are a class of heuristic population-based computational paradigm. Generally, a GA works as follows.

  • Step 1: Initialization of a population of individuals each of which represents a candidate solution of the problem through the employed encoding strategy.
  • Step 2: Evaluation of the fitness of each individual in the population based on the encoded information and the fitness function.
  • Step 3: Mating selection of promising parent individuals from the current population, and then, generate offspring with crossover and mutation operators.
  • Step 4: Evaluation of the fitness of the generated offspring.
  • Step 5: Environmental selection of a population of individuals with a promising performance from the current population, and then, replace the current population by the selected population.
  • Step 6: Go to Step 3 if the termination criterion is not met; otherwise, return the individual with the best fitness as the best solution for the problem.

Commonly, a maximal generation number is predefined as the termination criterion.

B. ResNet and DenseNet Blocks

Fig. 1. Example of the RB.
Fig. 2. Example of the DB including four convolutional layers.

Fig. 1 shows an example of an RB, which is composed of three convolutional layers2 and one skip connection. In this example, the convolutional layers are denoted as conv1, conv2, and conv3.

  • On conv1, the spatial size of the input is reduced by a smaller number of filters with a size of 1×1, to lower the computational complexity of conv2.
  • On conv2, filters with a larger size, such as 3 × 3, are used to learn features with the same spatial size.
  • On conv3, filters with a size of 1×1 are used again, and the spatial size is increased for generating more features. The input is added, denoted by ⊕, to the output of conv3 as the final output of the RD. Noting that if the spatial sizes of the input and conv3’s output are unequal, a group of convolutional operations with the filters of 1×1 size is applied on the input, to achieve the same spatial size as that of conv3’s output, for the addition.

Fig. 2 exhibits an example of a DB. For the convenience of the introduction, we give only four convolutional layers in the DB. In practice, the DB can have a different number of convolutional layers, which is tuned by users. In the DB, each convolutional layer receives inputs from not only the input data but also the output of all the previous convolutional layers. In addition, there is a parameter k for controlling the spatial size of the input and output of the same convolutional layer. If the spatial size of the input is a, then the spatial size of the output is a +k, which is achieved by the convolutional operation using the corresponding number of filters.

Efforts in [37] and [38] have been put on investigating the mechanism behind the success of RBs and DBs and revealed that RBs and DBs are able to mitigate the adverse impact of the gradient vanishing problem [39], based on which a deep architecture is capable of effectively learning the hierarchical representations of the input data and then improving the final classification accuracy in turn. In addition, the dense connections in DBs have also been claimed to be able to reuse the low-level features to increase the discrimination of features learned at the top layers of CNNs [18].

III. PROPOSED ALGORITHM

For the convenience of the development, the proposed algorithm is named AE-CNN (automatically evolving CNN) in short, and the evolved CNN is used solely for image classification tasks.

A. Algorithm Overview

使用GA(遗传算法)框架

Algorithm 1 shows the framework of AE-CNN, which is composed of three parts.

  • First, the population is randomly initialized with a predefined size of N (see line 1).
  • Then, the individuals are evaluated for the fitness (see line 2).
  • Next, all individuals in the population take part in the evolutionary process of GA with the maximal generation number of T (see lines 3–14).
  • Finally, the best CNN architecture is decoded from the best individual that is chosen from the final population based on the fitness (see line 15).

During the evolutionary process, an empty population is initialized for including offspring (see line 5), and then, a new offspring is generated from selected parents with the crossover and mutation operations, while the parents are selected by the binary tournament selection (see lines 6–10); after the fitness of the generated offspring has been evaluated (see line 11), a new population is selected with the environmental selection operation (see line 12) from the current population (containingthe currentindividualsandthe generatedoffspring) as the parent solutions surviving into the next evolutionary process (i.e., the next generation). Noting that the symbol of |·|shown in line 6 is a cardinality operator. The phases of “population initialization,” “fitness evaluation,” “offspring generation,” and “environmental selection” are documented in Sections III-B–III-E, respectively.

B. Population Initialization

no fully connected layersconstrain pooling layer

Generally, all the individuals are initialized in a random manner with a uniform distribution,as have introduced in Section II-A that each individual in GAs represents a candidate solution of the problem to be solved.

In the proposed algorithm, CNNs are constructed based on RBs, DBs, and pooling layers, which is motivated by the remarkable success of ResNet [17] and DenseNet [18], while the fully connected layers are not considered(不考率使用全连接层) in the proposed algorithm. The main reason is that the fully connected layers easily cause the overfitting phenomenon [40] due to their full-connection nature.

Next, we will explain the details of lines 8 and 11 because other parts of Algorithm 2 are straightforward.

Specifically(对pooling layer 的专门处理,限制pooling size过大), the pooling layers in CNNs perform the dimension reduction on their input data, and the most commonly used pooling operation is to halve the input size, which can be seen from the state-of-the-art CNNs [2]–[5], [17], [18].

To this end, the employed pooling layers cannot be arbitrarily specified but following the constraint that has been calculated, as shown in line 2.

For example, if the input size is 32×32, the number of used pooling layers cannot be larger than six because six pooling layers will reduce the dimension of the input data to 1×1, and one extra pooling layer on the dimension of 1×1 will lead to the logic error.

In the proposed algorithm, we design a new encoding strategy aiming at effectively modeling CNNs with different architectures.

  • For the used RBs, based on the configuration of the state-of-the-art CNNs [17], [42], we set the filter size of conv2 to 3×3, which is also used for the convolutional layers in the used DBs.
  • For the used pooling layers, we set the same stride as the step size to 2×2 based on the conventions, which means that such a single pooling layer in the evolved CNN halves the input dimension for one time.
  • To this end, the unknown parameter settings for RBs are the spatial sizes of input and output, those for DBs are the spatial sizes of input and output, as well as k, and that for pooling layers are only their types, i.e., the max or mean pooling type.

Note that the number of convolutional layers in the DB is known because it can be derived by the spatial sizes of input and output, as well as k. Accordingly, the proposed encoding strategy is based on three different types of units and their positions in the CNNs. The units are the RB unit (RBU), the DB unit (DBU), and the pooling layer unit (PU).

Specifically, an RBU and a DBU contain multiple RBs and DBs, respectively, while a PU is composed of only a single pooling layer.

Our justifications are that:

  1. by putting multiple of RBs or DBs into an RBU or a DBU, the depth of the CNN can be significantly changed compared with stacking RBs or DBs one-by-one, which will speed up the heuristic search of the proposed algorithm by easily changing the depth of the CNN.
  2. one PU consisting of a single pooling layer is more flexible than consisting of multiple pooling layers because the effect of multiple consequent pooling layers can be achieved by stacking multiple PUs.
  3. In addition, we also add one parameter to represent the unit type for the convenience of the algorithm implementation.

In summary, the encoded information for an RBU is the type, the number of RBs, the input spatial size, and the output spatial size, which are denoted as type, amount, in, and out, respectively.On the other hand, the encoded information for a DBU is the same as those of an RBU, in addition to the additional parameter k. Only one parameter is needed in a PU for encoding the pooling type.

Fig. 3 shows an example of the proposed algorithm in encoding a CNN containing nine units. Specifically, each number in the top-left corner of the block denotes the position of the unit in the CNN. The unit is an RBU, a DBU, or a PU if the type is 1, 2, or 3, respectively. Noting that the proposed encoding strategy does not constrain the maximal length of each individual, which means that the proposed algorithm can adaptively find the best CNN architecture with a proper depth through the designed variable-length encoding strategy.

C. Fitness Evaluation

classification accuracy Xavier initializerSGD

Noting that the weight initialize method and the training method are the Xavier initializer [43] and the SGD with momentum, respectively, which are commonly used in the deep learning community. Finally, the trained CNN is evaluated on the validation data (see line 5), and the evaluated classification accuracy is considered as the fitness of the individual (see line 6).

D. Offspring Generation

Offspring

Adopting the best ones as the parents could easily cause the loss of diversity in the population, which in turn leads to the premature convergence [44], [45], and as a result, the best performance of the population cannot be achieved [46], [47] due to trapping into the local minima [48], [49]. To address this problem, a general way is to select promising parents via a random way.

In the proposed AE-CNN algorithm, the binary tournament selection [50] is used for this purpose [50], [51] based on the conventions of the GA community.

The binary tournament selection randomly selects two individuals from the population, and the one with a higher fitness is chosen as one parent individual. By repeating this process again, another parent individual is chosen, and then, these two parent individuals perform the crossover operation. (用两轮锦标赛选择出parents)

Noting that two offspring are generated (每次crossover产生两个后代)after each crossover operation, and N offspring are generated in each generation, i.e., the crossover operation is performed N/2 times during each generation where N stands for the population size.

Crossover

In the proposed algorithm, we employ the one-point crossover operator. The reason is that the one-point crossover has been widely used in GP [31]. GP is another important class of evolutionary algorithms, and the individuals in GP are common with different lengths. Algorithm 4 shows the crossover operation in the proposed algorithm.

Fig. 4. (a) Two selected parent individuals for the crossover operation and (b) generated offspring. The numbers in each block denote the corresponding configuration, and the red numbers in Fig. 4(b) denote the necessary changes after the crossover operation.

Noting that some necessary changes are automatically made on the generated offspring if required. For example, the in of the current unit should be equal to the out of the previous unit, and other cascade adjustments are caused by this change. For a better understanding of the crossover operation, an example is shown in Fig. 4.

Mutation

In the proposed algorithm, the available mutation types are as follows:

  1. adding (adding an RBU, adding a DBU, or adding a PU to the selected position);
  2. removing (removing the unit at the selected position);
  3. modifying(modifying the encoded information of the unit at the selected position).
Fig. 5. Example of the “adding an RBU” mutation. (a) First and second rows denote the selected individual for the mutation and the randomly initialized RBU for the “adding and RBU” mutation at the fourth position of the individual to be mutated. (b) Mutated individual. The red numbers denote the necessary changes after the mutation.

In addition, a series of necessary adjustments will also be automatically performed based on the logic of composing a valid CNN as highlighted in the crossover operation. For better understanding the mutation, an example in terms of the “adding an RBU” is shown in Fig. 5.

E. Environmental Selection

IV. EXPERIMENT DESIGN

A. Peer Competitors

B. Benchmark Data Sets

CIFAR10 CIFAR100

Fig. 6. Randomly selected examples from each of the three categories of (a) CIFAR10 and (b) CIFAR100, and each category has ten examples.

C. Parameter Settings

  • Particularly, the population size and the maximal generation number are set to be 20.
  • The probabilities of crossover and mutation are set to 0.9 and 0.2, respectively.
  • Based on the conventions of the machine learning community, the validation data are randomly split from the training data with the proportion of 1/5.
  • Finally, all the classification error rates are evaluated on the same test data for the comparison.

V. EXPERIMENTAL RESULTS

在这里插入图片描述

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值