train()
启用 BatchNormalization 和 Dropout
eval()
不启用 BatchNormalization 和 Dropout,保证BN和dropout不发生变化,pytorch框架会自动把BN和Dropout固定住,不会取平均,而是用训练好的值,不然的话,一旦test的batch_size过小,很容易就会被BN层影响结果。
BatchNormalization 的 batch 是批数据, 把数据分成小批小批进行 SGD. 而且在每批数据进行前向传递 forward propagation 的时候, 对每一层都进行 normalization 的处理,
BN的作用是避开tanh激励函数的饱和区
Why does batch normalization work?
(1) We know that normalizing input features can speed up learning, one intuition is that doing same thing for hidden layers should also work.
标准化输入后可以加速学习进程
(2)solve the problem of covariance shift
解决协变量偏移问题 协变量偏移
Suppose you have trained your cat-recognizing network use black cat, but evaluate on colored cats, you will see data distribution changing(called covariance shift). Even there exist a true boundary separate cat and non-cat, you can’t expect learn that boundary only with black cat. So you may need to retrain the network.
For a neural network, suppose input distribution is constant, so output distribution of a certain hidden layer should have been constant. But as the weights of that layer and previous layers changing in the training phase, the output distribution will change, this cause covariance shift from the perspective of layer after it. Just like cat-recognizing network, the following need to re-train. To recover this problem, we use batch normal to force a zero-mean and one-variance distribution. It allow layer after it to learn independently from previous layers, and more concentrate on its own task, and so as to speed up the training process.
(3)Batch normal as regularization(slightly)
In batch normal, mean and variance is computed on mini-batch, which consist not too much samples. So the mean and variance contains noise. Just like dropout, it adds some noise to hidden layer’s activation(dropout randomly multiply activation by 0 or 1).
This is an extra and slight effect, don’t rely on it as a regularizer.