蒸馏网络与防御蒸馏-CSDN博客

本文链接：https://blog.csdn.net/qq_29931083/article/details/99412207

来自论文《Towards Evaluating the Robustness of Neural Networks》

Distillation was initially proposed as an approach to reduce a large model (the teacher) down to a smaller distilled model . At a high level, distillation works by first training the teacher model on the training set in a standard manner. Then, we use the teacher to label each instance in the training set with soft labels (the output vector from the teacher network). For example, while the hard label for an image of a hand-written digit 7 will say it is classified as a seven, the soft labels might say it has a 80% chance of being a seven and a 20% chance of being a one. Then, we train the distilled model on the soft labels from the teacher, rather than on the hard labels from the training set. Distillation can potentially increase accuracy on the test set as well as the rate at which the smaller model learns to predict the hard labels。

Defensive distillation uses distillation in order to increase the robustness of a neural network, but with two significant changes. First, both the teacher model and the distilled model are identical in size — defensive distillation does not result in smaller models. Second, and more importantly, defensive distillation uses a large distillation temperature (described below) to force the distilled model to become more confident in its predictions. Recall that, the softmax function is the last layer of a neural
network. Defensive distillation modifies the softmax function to also include a temperature constant T.

It is easy to see that softmax(x,T) = softmax(x/T, 1). Intuitively, increasing the temperature causes a “softer” maximum, and decreasing it causes a “harder” maximum. As the limit of the temperature goes to 0, softmax approaches max; as the limit goes to infinity, softmax(x) approaches a uniform distribution.

Defensive distillation proceeds in four steps:
1) Train a network, the teacher network, by setting the temperature of the softmax to T during the training phase.
2) Compute soft labels by apply the teacher network to each instance in the training set, again evaluating the softmax at temperature T.
3) Train the distilled network (a network with the same shape as the teacher network) on the soft labels, using softmax at temperature T.

4) Finally, when running the distilled network at test time (to classify new inputs), use temperature 1.