Slide:http://lamda.nju.edu.cn/weixs/slide/CNNTricks_slide.pdf
博文:http://lamda.nju.edu.cn/weixs/project/CNNTricks/CNNTricks.html
1)data augmentation;
2)pre-processing on images;
3)initializations of Networks;
4)some tips during training;
5)selections of activation functions;
6)diverse regularizations;
7)some insights found from figures and finally
8)methods of ensemble multiple deep networks.
Sec. 1: Data Augmentation
训练的时候,训练集有限,可以用Data Augmentation来扩充数据集合;
-
(1)、简单的crops: horizontally flipping, random crops andcolor jittering.
-
(2)、结合(1)中简单的处理
-
(3)、Krizhevsky et al. [1] 提出的 fancy PCA : alters the intensities of the RGB channels in training images.
Sec. 2: Pre-Processing
(1)、 zero-center + normalize:
python实现
>>> X -= np.mean(X, axis = 0) # zero-center >>> X /= np.std(X, axis = 0) # normalize
(2)、 PCA Whitening:zero-center-->计算covariance matrix(数据之间的correlation结构)-->decorrelate数据-->whitening
python实现
>>> X -= np.mean(X, axis = 0) # zero-center >>> cov = np.dot(X.T, X) / X.shape[0] # compute the covariance matrix
decorrelate data :通过将原来的数据(除了zero-centres)映射带eigenbasis
>>> U,S,V = np.linalg.svd(cov) # compute the SVD factorization of the data covariance matrix >>> Xrot = np.dot(X, U) # decorrelate the data
whitening:用eigenvalue将eigenbasis的每个维度分开来normalize the scale
>>> Xwhite = Xrot / np.sqrt(S + 1e-5) # divide by the eigenvalues (which are square roots of the singular values)
Sec. 3: Initializations
(1)、All Zero Initialization
理想状态下认为一般权重为正数一半为负数再见过适当的data normalization
缺点:no source of asymmetry between neurons
(2)、Initialization with Small Random Numbers:
优点:symmetry breaking
思想:the neurons are all random and unique in the beginning,
eg1: , where
is a zero mean, unit standard deviation gaussian.
eg2:small numbers drawn from a uniform distribution,
(3)、Calibrating the Variances
思想:normalize the variance of each neuron's output to 1 ,但是不会考虑ReLUs
python实现:
>>> w = np.random.randn(n) / sqrt(n) # calibrating the variances with 1/sqrt(n)
(4)、Current Recommendation
He et al. [4] 关注 ReLUs:variance :
python实现:
>>> w = np.random.randn(n) * sqrt(2.0/n) # current recommendation.
Sec. 4: During Training
-
Filters and pooling size. input images: power-of-2 ; filter (e.g.,
) ;strides (e.g., 1) with zeros-padding; pooling :eg:
.
-
Learning rate.利用validation set ,再次 Ilya Sutskever [2]:divide the gradients by mini batch size
-
Fine-tune on pre-trained models. 考虑:新的数据集的大小&和预训练模型训练数据集的相似性
-
(1)、如果自己的数据和预训练的相似 ,直接在从预训练模型的高层提取的特征尚训练一个 linear classifier
-
(2)、如果有许多数据,可以用small learning rate微调预训练模型的高层
-
(3)、如果自己的数据集和预训练模型的数据集差异很大,但是有很多训练图像,大部分的layers需要用小的learning rate在自己的数据集上进行 fine-tuned
-
(4)、如果自己的数据集小而且与预训练模型数据集差异很大,那就只训练一个 linear classifier.
-
Sec. 5: Activation Functions :non-linearity
|
|
Sigmoid
large negative numbers become 0 and large positive numbers become 1.
|
tanh(x)
range [-1, 1]. 1、 its activations saturate 2、zero-centered |
Rectified Linear Unit
|
Leaky ReLU
fix the “dying ReLU” problem.
(cons)the results are not always consistent. |
Parametric ReLU :
PReLU, Leaky ReLU RReLU, and then fixed in the testing[[5]] (cons) reduce overfitting |
Randomized ReLU
RReLU, 在训练时是给定范围的随机变量 ,但在测试时是固定的。[[5]]
Sec. 6: Regularizations
-
L2 regularization : add
to the objective,
:regularization strength. ( heavily penalizing peaky weight vectors and preferring diffuse weight vectors)
-
L1 regularization: add
to the objective. 结合:
(Elastic net regularization).
-
Max norm constraints. enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint.
.
(always 3 or 4).update are bounded so the nwtwork wont explores..
-
Dropout : [6] only updating the parameters of the sampled network based on the input data .
[6]. training: keeping a neuron active with some probability testing: no dropout dropout ratio |
Sec. 7: Insights from Figures
-
learning rate
-
loss curve.: the “width” of the curve is related to the batch size.
-
accuracy curve.
Sec. 8: Ensemble[8]
-
Same model, different initialization. 用交叉验证集来决定最好的超参数 hyperparameters, 然后用这些超参数来训练多个 models ,但是随机初始化.
-
Top models discovered during cross-validation. 用交叉验证集来决定最好的超参数 hyperparameters,然后选出前n个最好的models来ensemble.(风险是可能包含未达标准的model).
-
Different checkpoints of a single model. training非常expensive的情况下, 选取一个single network中不同时刻的不同的 checkpoints 来ensemble. (缺乏多样性,但是cheap).
-
Some practical examples. 如果你的任务是high-level image semantic: 可以在不同的数据集上使用多个深度模型来提取不同的互补的深度representations.
Miscellaneous
Problems:
data:class-imbalanced: some classes have a large number of images/training instances, while some have very limited number of images.
method1:balance the training data by directly up-sampling and down-sampling the imbalanced data[10].
method2: crops processing[7].
method3 :adjust the fine-tuning strategy