蒸馏网络与防御蒸馏

来自论文《Towards Evaluating the Robustness of Neural Networks》

Distillation was initially proposed as an approach to reduce a large model (the teacher) down to a smaller distilled model . At a high level, distillation works by first training the teacher model on the training set in a standard manner. Then, we use the teacher to label each instance in the training set with soft labels (the output vector from the teacher network). For example, while the hard label for an image of a hand-written digit 7 will say it is classified as a seven, the soft labels might say it has a 80% chance of being a seven and a 20% chance of being a one. Then, we train the distilled model on the soft labels from the teacher, rather than on the hard labels from the training set. Distillation can potentially increase accuracy on the test set as well as the rate at which the smaller model learns to predict the hard labels。

Defensive distillation uses distillation in order to increase the robustness of a neural network, but with two significant changes. First, both the teacher model and the distilled model are identical in size — defensive distillation does not result in smaller models. Second, and more importantly, defensive distillation uses a large distillation temperature (described below) to force the distilled model to become more confident in its predictions. Recall that, the softmax function is the last layer of a neural
network. Defensive distillation modifies the softmax function to also include a temperature constant T.

It is easy to see that softmax(x,T) = softmax(x/T, 1). Intuitively, increasing the temperature causes a “softer” maximum, and decreasing it causes a “harder” maximum. As the limit of the temperature goes to 0, softmax approaches max; as the limit goes to infinity, softmax(x) approaches a uniform distribution.

Defensive distillation proceeds in four steps:
1) Train a network, the teacher network, by setting the temperature of the softmax to T during the training phase.
2) Compute soft labels by apply the teacher network to each instance in the training set, again evaluating the softmax at temperature T.
3) Train the distilled network (a network with the same shape as the teacher network) on the soft labels, using softmax at temperature T.

4) Finally, when running the distilled network at test time (to classify new inputs), use temperature 1.
 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值