Zero-shot knowledge distillation in deep networks

Zero-shot knowledge distillation in deep networks

Objective

Can we do Knowledge Distillation without (access to) training data (Zero-Shot)?

  • Data is precious and sensitive – won’t be shared
  • E.g. : Medical records, Biometric data, Proprietary data
  • Federated learning – Only models are available, not data

Existing methods

Knowledge Distillation (Hinton et al., 2015) enables to transfer the complex mapping functions learned by cumber- some models to relatively simpler models.

T e a c h e r Teacher Teacher Model: Generally the Teacher models deliver excellent performance, but they can be huge and computationally expensive. Hence, these models can not be deployed in limited resource environments or when real-time inference is expected.

S t u d e n t Student Student Model: has sub- stantially less memory footprint, requires less computation, and thereby often results in a much faster inference time than that of the much larger T e a c h e r Teacher Teacher model.

D a r k   K n o w l e d g d e Dark \ Knowledgde Dark Knowledgde: It is this knowledge that helps the Teacher to generalize better and transfers to the Student via matching their soft labels (output of the soft-max layer) instead of the one-hot vector encoded labels.

Main Idea

在没有目标数据先验知识的情况下,我们从教师模型中进行伪数据合成,教师模型作为传递集来执行蒸馏。

Our approach obtains useful prior information about the underlying data distribution in the form of C l a s s   S i m i l a r i t i e s Class \ Similarities Class Similarities from the model parameters of the Teacher. Further, we successfully utilize this prior in the crafting process via modelling the output space of the T e a c h e r Teacher Teacher model as a Dirichlet distribution. We name the crafted samples D a t a   I m p r e s s i o n s Data \ Impressions Data Impressions (DI) as these are the impressions of the training data as understood by the Teacher model.

Knowledge Distillation

L = ∑ ( x , y ) ∈ D L K D ( S ( x , θ S , τ ) , T ( x , θ T , τ ) ) + λ L C E ( y ^ S , y ) L = \sum_{(x,y)\in D}L_{KD}(S(x,\theta_S,\tau),T(x,\theta_T,\tau))+\lambda L_{CE}(\widehat{y}_S,y) L=(x,y)DLKD(S(x,θS,τ),T(x,θT,τ))+λLCE(y S,y)

L C E L_{CE} LCE is the cross-entropy loss computed on the labels y ^ S \widehat{y}_S y S predicted by the S t u d e n t Student Student and their corresponding ground truth labels y y y.

L K D L_{KD} LKD is the distillation loss (e.g. cross-entropy
or mean square error) comparing the soft labels (softmax
outputs) predicted by the S t u d e n t Student Student against the soft labels
predicted by the T e a c h e r Teacher Teacher. T ( x , θ T ) T(x, θ_T) T(x,θT) represents the softmax
output of the T e a c h e r Teacher Teacher and S ( x , θ S ) S(x, θ_S) S(x,θS) denotes the softmax output of the S t u d e n t Student Student. Here, τ \tau τ represents the softmax temperature.

Modelling the Data in Softmax Space

Let s ∼ p ( s ) s \sim p(s) sp(s), be the **random vector that represents the neural softmax output ** of the T e a c h e r Teacher Teacher, T ( x , θ T ) T(x,\theta_T) T(x,θT). We model p ( s k ) p(s^k) p(sk) belong ing to each class k k k, using a Dirichlet distribution which is a distribution over vectors whose components are in [0,1] range and their sum is 1. Thus, the distribution to represent the softmax outputs s k s^k sk of class k k k would be modelled as, D i r ( K , α k ) Dir(K, \alpha ^ k) Dir(K,αk), where k ∈ 1... K k \in {1 . . .K} k1...K is the class index, K K K is the dimension of the output probability vector (number of categories in the recognition problem) and α k \alpha ^ k αk is the concentration parameter of the distribution modelling class k k k. The concentration parameter α k \alpha ^k αk is a K K K dimensional positive real vector, i.e, α k = [ α 1 k , α 2 k , . . . , α K k ] \alpha^k = [\alpha ^k_1, \alpha ^k_2, . . . , \alpha ^k_K] αk=[α1k,α2k,...,αKk], and α i k > 0 \alpha_i^k > 0 αik>0 , ∀ i \forall i i.

Concentration Parameter ( α \alpha α):

由于狄利克雷分布的样本空间被解释为离散概率分布(在标签上),直观地说,Concentration Parameter ( α \alpha α)可以被认为是确定狄利克雷分布样本的概率质量可能“集中”的程度。

获取Concentration Parameter的先验信息并非易事。所有组件的参数不能相同,因为这会导致所有概率集的可能性相等,这是不现实的情况。例如,在CIFAR-10数据集的情况下,dog类和plane类具有相同置信度的softmax输出是没有意义的(因为它们在视觉上是不同的)。同样,相同的 α i \alpha_i αi值表示缺少任何先验信息来支持采样的softmax向量的一个分量而不是另一个分量。因此,应指定Concentration Parameter,以反映softmax矢量中各成分之间的相似性。由于这些成分表示识别问题中的潜在类别,因此 α \alpha α应该反映它们之间的视觉相似性。

Class Similarity Matrix

The final layer of a typical recogni- tion model will be a fully connected layer with a softmax non-linearity.

Each neuron in this layer corresponds to a class (k) and its activation is treated as the probability predicted by the model for that class.

The weights connecting the previous layer to this neuron ( w k w_k wk) can be considered as the template of the class k learned by the T e a c h e r Teacher Teacher network.

Reason: This is because the predicted class probability is proportional to the alignment of the pre-final layer’s output with the template ( w k w_k wk). The predicted probability peaks when the pre-final layer’s output is a positive scaled version of this template (wk).
C ( i , j ) = w i T w j ∣ ∣ w i ∣ ∣ ∣ ∣ w j ∣ ∣ C(i,j)=\frac{w_i^Tw_j}{||w_i||||w_j||} C(i,j)=wiwjwiTwj
Since the elements of the concentration parameter have to be positive real numbers, we further perform a min-max normalization over each row of the class similarity matrix.

Crafting Data Impression via Dirichlet Sampling

Y k = [ y 1 k , y 2 k , . . . , y N k ] ∈ R k × N Y^k = [y_1^k,y_2^k,...,y_N^k] \in R^{k \times N} Yk=[y1k,y2k,...,yNk]Rk×N, be the N N N softmax vectors corresponding to class k k k, sampled from D i r ( K , α k ) Dir(K,\alpha^k) Dir(K,αk) distribution.
α k = [ α 1 k , α 2 k , . . . , α K k ] \alpha^k = [\alpha ^k_1, \alpha ^k_2, . . . , \alpha ^k_K] αk=[α1k,α2k,...,αKk]
Each row c k c_k ck can be treated as the concen-tration parameter ( α \alpha α) of the Dirichlet distribution (Dir), which models the distribution of output probability vectors belonging to class k k k.
α k = c k \alpha^k = c_k αk=ck

C ( i , j ) = w i T w j ∣ ∣ w i ∣ ∣ ∣ ∣ w j ∣ ∣ C(i,j)=\frac{w_i^Tw_j}{||w_i||||w_j||} C(i,j)=wiwjwiTwj

Generate x ˉ i k \bar{x}_i^k xˉik as a random noisy image

update it over multiple iterations till the cross-entropy loss between the sampled softmax vector ( y i k y_i^k yik) and the softmax output predicted by the Teacher is minimized.
x ˉ i k = a r g m i n x   L C E ( y i k , T ( x , θ T , τ ) ) \bar{x}_i^k = \mathop{argmin}\limits_{x} \ L_{CE}(y_i^k,T(x,\theta_T,\tau)) xˉik=xargmin LCE(yik,T(x,θT,τ))
Scaling Factor( β \beta β)

The probability density function of the Dirichlet distribution for K K K random variables is a K − 1 K − 1 K1 dimensional probability simplex that exists on a K K K dimensional space.

When α i < 1 , ∀ i ∈ [ 1 , K ] \alpha_i < 1,\forall i \in [1,K] αi<1,i[1,K]:

the density congregates at the edges of the simplex

When α i > 1 , ∀ i ∈ [ 1 , K ] \alpha_i > 1,\forall i \in [1,K] αi>1,i[1,K]:

the density becomes more concentrated on the center of the simplex

Thus, we define a scaling factor (β) which can control the range of the individual elements of the concentration parameter, which in turn decides regions in the simplex from which sampling is performed.

The actual sampling of the probability vectors happen from:
p ( s ) = D i r ( K , β × α ) p(s)=Dir(K,\beta \times \alpha) p(s)=Dir(K,β×α)
β \beta β controls the l 1 l_1 l1-norm of the final concentration parameter
which, in turn, is inversely related to the variance of the
distribution.

Zero-Shot Knowledge Distillation

我们忽略了一般蒸馏目标中的交叉熵损失,因为性能只有很小的改善或没有改善,并且它减少了超参数 λ \lambda λ的负担
θ s = a r g m i n θ s ∑ x ˉ ∈ X ˉ L K D ( T ( x ˉ , θ T , τ ) , S ( x ˉ , θ S , τ ) ) \theta_s = \mathop{argmin}\limits_{\theta_s}\sum_{\bar{x}\in\bar{X}}L_{KD}(T(\bar{x},\theta_T,\tau),S(\bar{x},\theta_S,\tau)) θs=θsargminxˉXˉLKD(T(xˉ,θT,τ),S(xˉ,θS,τ))
Generate a diverse set of pseudo training examples that can provide with enough information to train the Student model via Dirichlet sampling.

Experiments

Datasets

MNIST,Fashion MNIST,CIFAR-10

As all the experiments in these three datasets are dealing with classification problems with 10 categories each, value of the parameter K K K in all our experiments is 10.

Experiment Detail

Two scaling factors: β 1 = 1.0   β 2 = 0.1 \beta_1 = 1.0 \ \beta_2=0.1 β1=1.0 β2=0.1. For each dataset, half the Data Impressions are generated with β 1 \beta_1 β1 and the other with β 2 \beta_2 β2.

A temperature value ( τ \tau τ) of 20 is used across all the datasets.

We augment the samples using regular operations such as scaling, trans- lation, rotation, flipping etc. which has proven useful in further boosting the model performance (Dao et al., 2018).

Experiment Results

MNIST

Lenet-5 for $Teacher $ model; 61706 parameters

Lenet-5-Half for S t u d e n t Student Student model; 35820 parameters

ModelPerformanceExplanation
Teacher-CE99.34The classification accuracy of the Teacher model trained using the cross-entropy (CE) loss
Student-CE98.92The performance of the Student model trained with all the training samples and their ground truth labels using cross-entropy loss
Student-KD (Hinton et al., 2015) 60K original data99.25The accuracy of the Student model trained using the actual training samples through Knowledge Distillation (KD) from Teacher.
(Kimura et al., 2018) 200 original data86.70
(Lopes et al., 2017) (uses meta data)92.47
ZSKD (Ours)(24000 DIs, and no original data)98.77Outperform the existing few data (Kimura et al., 2018) and data-free counterparts (Lopes et al., 2017) by a great margin. It performs close to the full data (classical) Knowledge Distillation while using only 24000 DIs, i.e., 40% of the the original training set size.
Fashion MNIST

Lenet-5 for $Teacher $ model; 61706 parameters

Lenet-5-Half for S t u d e n t Student Student model; 35820 parameters

ModelPerformance
Teacher-CE90.84
Student-CE89.43
Student-KD (Hinton et al., 2015) 60K original data89.66
(Kimura et al., 2018) 200 original data72.50
ZSKD (48000 DIs, and no original data)79.62
CIFAR-10

AlexNet as T e a c h e r Teacher Teacher model;

考虑到数据集的复杂程度,用了更大了迁移数据集包括了40000个DI的样本,尽管依旧是低于20 % \% % 的数据集比例。

Size of the Transfer Set

Different number of Data Impressions such as 1%, 5%, 10%, . . . , 80% of the training set size.

随着数据集变得复杂,需要生成更多的数据印象来捕获数据集中的底层模式。注意,在蒸馏过程中也观察到与实际训练样本类似的趋势。

我们观察到,所提出的输出空间的Dirichlet模型和重构的印象始终比同类模型有很大的优势。此外,在类印象的情况下,与数据印象相比,由于传输集大小增加而导致的性能增加相对较小。请注意,为了更好地理解,在进行蒸馏时显示的结果没有任何数据扩充。

Aside

β \beta β分布

B e t a ( α , β ) Beta(\alpha,\beta) Beta(α,β)

Parameters:

α > 0   β > 0   x ∈ [ 0 , 1 ] \alpha > 0 \ \beta>0 \ x \in [0,1] α>0 β>0 x[0,1]

The probability density function (pdf) of the beta distribution, for 0 ≤ x ≤ 1 0 \leq x \leq 1 0x1, and shape parameters α   β > 0 \alpha \ \beta> 0 α β>0, is a power function of the variable x and of its reflection (1 − x) as follows:
f ( x ; α , β ) = c o n s t a n t x α − 1 ( 1 − x ) β − 1 f(x;\alpha,\beta) = constant x^{\alpha-1}(1-x)^{\beta-1} f(x;α,β)=constantxα1(1x)β1

= x α − 1 ( 1 − x ) β − 1 ∫ 0 1 u α − 1 ( 1 − u ) β − 1 d u =\frac{x^{\alpha-1}(1-x)^{\beta-1}}{\int_0^1u^{\alpha-1}(1-u)^{\beta-1}du} =01uα1(1u)β1duxα1(1x)β1

= Γ ( α + β ) Γ ( α ) Γ ( β ) x α − 1 ( 1 − x ) β − 1 = \frac{\Gamma (\alpha + \beta)}{\Gamma(\alpha)\Gamma (\beta)} x^{\alpha-1}(1-x)^{\beta-1} =Γ(α)Γ(β)Γ(α+β)xα1(1x)β1

= 1 B ( α , β ) x α − 1 ( 1 − x ) β − 1 =\frac{1}{B(\alpha,\beta)}x^{\alpha-1}(1-x)^{\beta-1} =B(α,β)1xα1(1x)β1

E [ X ] = α α + β E[X]=\frac{\alpha}{\alpha+\beta} E[X]=α+βα

Where Γ \Gamma Γ is the Gamma function:伽玛函数(Gamma Function)作为阶乘的延拓,是定义在复数范围内的亚纯函数。

(1)在实数域上伽玛函数定义为:
Γ ( x ) = ∫ 0 + ∞ t x − 1 e − t d t ( x > 0 ) \Gamma(x) = \int_0^{+\infty}t^{x-1}e^{-t}dt(x>0) Γ(x)=0+tx1etdt(x>0)
对于任何正整数 n n n有:
Γ ( n ) = ( n − 1 ) ! \Gamma(n) = (n-1)! Γ(n)=(n1)!
(2)在复数域上伽玛函数定义为:
Γ ( x ) = ∫ 0 + ∞ t x − 1 e − t d t \Gamma(x) = \int_0^{+\infty}t^{x-1}e^{-t}dt Γ(x)=0+tx1etdt

狄利克雷分布

狄利克雷分布是一种“分布的分布” (a distribution on probability distribution) ,由两个参数 α \alpha α G 0 G_0 G0确定,即 G ∼ D P ( α , G 0 ) G\sim DP(\alpha,G_0) GDP(αG0) α \alpha α是分布参数(concentration or scaling parameter),其值越大,分布越接近于均匀分布,其值越小,分布越concentrated。 G 0 G_0 G0是基分布(base distribution)。

可以把DP想象成黑箱,输入分布 G 0 G_0 G0是,输出分布 G G G,而 α \alpha α 控制输出的样子

问题背景

我们有一组来源于混合高斯分布的数据集,希望对其进行聚类,然而我们并不知道这组数据是由几组高斯分布生成的。

问题特点

  1. 聚类数量未知
  2. 非参数化,即不确定参数,如果需要,参数数量可以变化
  3. 聚类数量服从概率分布

可行方法

针对高斯混合模型(Gaussian Mixture Models)做最大期望运算(Expectation Maximization, EM),分析结果,继续迭代计算。也可以做层次聚类(Hierarchical Clustering),比如层次凝聚法(Hierarchical Agglomerative Clustering, HAC),再进行人为剪枝。

然而,最希望的还是用一种以统计学为主,尽量避免主管因素(比如人为规定类别数量,人为进行剪枝)的方法来对数据进行聚类。

https://www.zhihu.com/question/26751755

Relatvie Source

https://github.com/vcl-iisc/ZSKD

Presentation

Poster

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值