caffe scale layer:
http://stackoverflow.com/questions/37410996/scale-layer-in-caffe
layer {
bottom: “res2b_branch2b”
top: “res2b_branch2b”
name: “scale2b_branch2b”
type: “Scale”
scale_param { bias_term: true }
}layer {
name: “scaleToUnitInt”
type: “Scale”
bottom: “bot”
top: “scaled”
param { lr_mult: 0
decay_mult: 0
}
param { lr_mult: 0
}
scale_param {
filler { value: 0.5 }
bias_term: true
bias_filler { value: -2 }
}
}
Quora关于BN的一些属性的回答:
https://www.quora.com/Why-does-batch-normalization-help
Batch Normalization solves such problem with some additional assumptions. Followings are the properties of Batch Normalization with mean and variance for a mini batch version:
- Learning faster: Learning rate can be increased compare to non-batch-normalized version.
Increase Accuracy: Flexibility on mean and variance value for every dimension in every hidden layer provides better learning, hence accuracy of the network.- Normalization or Whitening of the inputs to each layer: Zero means, unit variances and or not decorrelated.
- To remove the ill-effect of Internal Covariate shift:Transformation makes data to big or to small; change of the input distribution away from normalization due to successive transformation.
- Not-Stuck in the saturation mode: Even if ReLU is not used.
- Integrate Whitening within the gradient descent optimization: Decoupled Whitening between training steps, which modifies network directly, reduces the effort of optimization. So, model blows up when normalization parameters are computed outside the gradient descent step.
- Whitening within gradient descent: Requires inverse square root of covariance matrix as well as derivatives for backpropagation
- Normalization of Individual dimension: Individual dimension of hidden layers are normalized independently rather than joint covariances. So, features are not decorrelated.
- Normalization of mini-batch: Estimation of mean and variance are computed after each mini-batch rather than entire training set. 9. Even ignoring the joint covariance as it will create singular co-variance matrices for such small number of training sample per mini-batch compare to high dimension size of the hidden layer.
- Learning of scale and shift for every dimension: Scaled and shifted values are passed to the next layer, whether mean and variances are calculated after getting all mini-batch activation of current layer. So, forward pass of all the samples within the mini-batch should pass layer wise. Backpropagation is required for getting gradient of weights as well as scaling (variance) and shift (mean).
- Inference: During inference moving averaged mean and variance parameters during mini batch training are considered.
- Convolution Neural Network: Whitening of intermediate layers, before or after the nonlinearity creates a lot of new innovation pathways [11-15].
[1] What is buckling of a column?
[2] Why does buckling occur in columns?
[3] What is buckling?
[4] What is meant by buckling in engineering words?
[5] How does buckling analysis work?
[6] What is the difference between crippling and buckling?
[7] What is the difference between crushing and buckling failures of a column?
[8] What is the cylindrical buckling?
[9] What is difference between buckling and bending?
[10] batch normalization
[11] How do I apply Batch Normalization to the convolutional layer of a CNN?
[12] How does batch normalization behave differently at training time and test time?
[13] How does a person choose the best size of mini-batch in the test when the model is using batch normalization?
[14] How does a person choose the best size of mini-batch in the test when the model is using batch normalization?
[15] What is local response normalization?
介绍BN的blog:
https://standardfrancis.wordpress.com/2015/04/16/batch-normalization/