Must Know Tips/Tricks in Deep Neural Networks (by Xiu-Shen Wei)_wei xiushen class imbalanced deep-CSDN博客

转自http://lamda.nju.edu.cn/weixs/project/CNNTricks/CNNTricks.html。

tips/tricks:

Sec. 1: Data Augmentation

可以弥补cnn对于大数据集的需求，可以采用的方法有horizontally flipping（水平翻转）, random crops （随机裁剪）and color jittering（颜色转变）等，也可以将几种方法同时使用，比如doing the rotation and random scaling at the same time，此外还可以试着提高所有像素点的色彩饱和度和值（HSV色彩空间的S和V部分）首先做0.25到4这个范围内的指数幂（一个patch中对所有的像素点做同样的操作），然后乘以一个参数（0.7到1.4之间），在加上一个-0.1到0.1之间的值。同时，你也可以增加一个在-0.1到0.1的值给图片/批次的所有像素点H（HSV的H部分）

try to raise saturation and value (S and V components of the HSV color space) of all pixels to a power between 0.25 and 4 (same for all pixels within a patch), multiply these values by a factor between 0.7 and 1.4, and add to them a value between -0.1 and 0.1. Also, you could add a value between [-0.1, 0.1] to the hue (H component of HSV) of all pixels in the image/patch.

Krizhevsky et al. [1] 在2012年训练Alex-Net时提出了一种fancy PCA方法。Fancy PCA在训练图片时改变RGB通道的色彩饱和度。实际上，可以首先对训练图片的RGB像素值集合作PCA，然后，对每一个训练图片的每一个RGB图片像素增加下面的值，(i.e., $I_{xy}=[I_{xy}^R,I_{xy}^G,I_{xy}^B]^T$ ): $bf{p}_1,bf{p}_2,bf{p}_3][alpha_1 lambda_1,alpha_2 lambda_2,alpha_3 lambda_3]^T$ ，其中 $bf{p}_i$ 和 lambda_i 是一个3*3的RGB像素值的协方差矩阵的第i个特征向量和特征值，相对应的， alpha_i 是一个满足高斯分布（零均值和标准方差0.1）的随机值。需要注意的是，每一个 alpha_i 在图片再次训练之前只提取一次，也就是如果模型再次处理同一个训练图片时，会随机产生另外一个 alpha_i 来进行数据扩充。

In [1], they claimed that “fancy PCA could approximately capture an important property of natural images, namely, that object identity is invariant to changes in the intensity and color of the illumination”. To the classification performance, this scheme reduced the top-1 error rate by over 1% in the competition of ImageNet 2012.

Sec. 2: Pre-Processing

The first and simple pre-processing approach is zero-center the data, and then normalize them

Another pre-processing approach similar to the first one is PCA Whitening.

Sec. 3: Initializations（initialize its parameters）

All Zero Initialization

Initialization with Small Random Numbers

Calibrating the Variances

>>> w = np.random.randn(n) / sqrt(n) # calibrating the variances with 1/sqrt(n)

Current Recommendation（consider the influence of ReLU neurons）

>>> w = np.random.randn(n) * sqrt(2.0/n) # current recommendation

Sec. 4: During Training

Filters and pooling size.训练过程中输入图片大小多是power-of-2, such as 32 (e.g., CIFAR-10), 64, 224 (e.g., common used ImageNet), 384 or 512, etc. 过滤器 a small filter (e.g., )，步长small strides (e.g., 1) with zeros-padding,这样不但能减少参数的个数，还能提高准确率。同时, a special case mentioned above, i.e., filters with stride 1, could preserve the spatial size of images/feature maps. 池化层,常用的pooling size is of .

Learning rate. 学习率In addition, as described in a blog by Ilya Sutskever [2], he recommended to divide the gradients by mini batch size. Thus, you should not always change the learning rates (LR), if you change the mini batch size. For obtaining an appropriate LR, utilizing the validation set is an effective way. Usually, a typical value of LR in the beginning of your training is 0.1. In practice, if you see that you stopped making progress on the validation set, divide the LR by 2 (or by 5), and keep going, which might give you a surprise.

Fine-tune on pre-trained models. i.e., Caffe Model Zoo and VGG Group. a very simple yet effective approach is to fine-tune the pre-trained models on your own data. As shown in following table, the two most important factors are the size of the new data set (small or big), and its similarity to the original data set.

Sec. 5: Activation Functions

Sigmoid， tanh(x)， Rectified Linear Unit， Leaky ReLU ，Parametric ReLU ，Randomized ReLU

Sec. 6: Regularizations

L2 regularization
L1 regularization
Max norm constraints.
Dropout（The most popular used）

Sec. 7: Insights from Figures

Sec. 8: Ensemble

In machine learning, ensemble methods [8] that train multiple learners and then combine them for use 。

several skills for ensemble in the deep learning scenario。

Same model, different initialization.
Top few (e.g., 10) models to form the ensemble models discovered during cross-validation. Actually, you could directly select several state-of-the-art deep models from Caffe Model Zoo to perform ensemble.
Different checkpoints of a single model.The advantage of this approach is that is very cheap.
Some practical examples. For example in the Cultural Event Recognition challenge in associated with ICCV’15, we utilized five different deep models trained on images of ImageNet, Place Database and the cultural images supplied by the competition organizers. After that, we extracted five complementary deep features and treat them as multi-view data. Combining “early fusion” and “late fusion” strategies described in [7], we achieved one of the best performance and ranked the 2nd place in that challenge. Similar to our work, [9] presented the Stacked NN framework to fuse more deep networks at the same time.

Miscellaneous（杂项/其他）

class-imbalanced data sets ：

1） balance the training data by directly up-sampling and down-sampling the imbalanced data。

2）merely extract crops from the classes which have a small number of training images, which on one hand can supply diverse data sources, and on the other hand can solve the class-imbalanced problem
3） adjust the fine-tuning strategy for overcoming class-imbalance. For example, you can divide your own data set into two parts: one contains the classes which have a large number of training samples (images/crops); the other contains the classes of limited number of samples. In each part, the class-imbalanced problem will be not very serious.