Evolving Losses for Unsupervised Video Representation Learning 论文笔记
Distillation
Knowledge Distillation from: zhihu
Distillate Knowledge from Teacher model Net-T to Student model Net-S.
目的:为了精简模型方便部署。
L = α L s o f t + β L h a r d L=\alpha L_{s o f t}+\beta L_{h a r d} L=αLsoft+βLhard
L s o f t = − ∑ j N p j T log ( q j T ) , where p l T = exp ( v i / T ) ∑ k N exp ( v k / T ) , q i T = exp ( z i / T ) ∑ k N exp ( z k / T ) L_{s o f t}=-\sum_{j}^{N} p_{j}^{T} \log \left(q_{j}^{T}\right), \text { where } p_{l}^{T}=\frac{\exp \left(v_{i} / T\right)}{\sum_{k}^{N} \exp \left(v_{k} / T\right)}, q_{i}^{T}=\frac{\exp \left(z_{i} / T\right)}{\sum_{k}^{N} \exp \left(z_{k} / T\right)} Lsoft=−j∑NpjTlog(qjT), where plT=∑kNexp(vk/T)exp(vi/T),qiT=∑kNexp(zk/T)exp(zi/T)
L h a r d = − ∑ j N c j log ( q j 1 ) , where q i 1 = exp ( z i ) ∑ j N exp ( z j ) L_{h a